High-Fidelity 3D Subject-Driven Video Generation
3DreamBooth learns spatial identity via 1-frame optimization. 3Dapter pre-trains on single-view conditioning, then extends to multi-view joint attention with shared weights during joint optimization.
High-fidelity, identity-preserving videos across diverse subjects and creative scenarios. Each clip pairs four multiview reference images with a text-prompted video.
The same subject rendered across six distinct creative contexts — demonstrating robust 3D identity preservation regardless of scene, lighting, or narrative.


We compare our full multi-view framework against single-view conditioning baselines VACE and Phantom, which use brief textual descriptions following standard subject-driven generation protocols.
Comparing each component at a fixed step count. 3DreamBooth alone converges slowly; 3Dapter provides a strong initialization that dramatically accelerates training at identical step budgets.




Evaluated on 3D-CustomBench. S = single-view M = multi-view. Underline = 2nd best.
| Method | Views | CLIP-I ↑ | DINO-I ↑ | Shape ↑ | Color ↑ | Detail ↑ | Overall ↑ |
|---|---|---|---|---|---|---|---|
| VACE | S | 0.8964 | 0.7395 | 4.39±0.05 | 4.09±0.09 | 3.35±0.15 | 3.95±0.11 |
| Phantom | S | 0.8576 | 0.5861 | 3.48±0.12 | 3.94±0.13 | 3.03±0.16 | 3.31±0.15 |
| 3Dapter | S | 0.8647 | 0.5899 | 3.06±0.03 | 3.09±0.06 | 2.28±0.08 | 2.67±0.07 |
| 3DreamBooth | M | 0.8382 | 0.6530 | 4.18±0.06 | 3.63±0.09 | 3.14±0.11 | 3.53±0.07 |
| 3Dapter + 3DB | M | 0.8871 | 0.7420 | 4.80±0.03 | 4.53±0.04 | 4.04±0.13 | 4.57±0.04 |
| Method | Views | Accuracy ↓ | Completeness ↓ | CD ↓ |
|---|---|---|---|---|
| VACE | S | 0.0278 | 0.0427 | 0.0353 |
| Phantom | S | 0.0289 | 0.0388 | 0.0338 |
| 3Dapter | S | 0.0315 | 0.0659 | 0.0487 |
| 3DreamBooth | M | 0.0156 | 0.0322 | 0.0239 |
| 3Dapter + 3DB | M | 0.0182 | 0.0172 | 0.0177 |
| Method | Type | Aesthetic Quality ↑ | Imaging Quality ↑ | Motion Smoothness ↑ | ViCLIP ↑ |
|---|---|---|---|---|---|
| VACE | S | 0.5915 | 70.84 | 0.9916 | 0.2663 |
| Phantom | S | 0.5798 | 70.58 | 0.9934 | 0.2634 |
| 3Dapter | S | 0.6283 | 71.65 | 0.9944 | 0.2048 |
| 3DreamBooth | M | 0.5245 | 73.34 | 0.9928 | 0.2415 |
| 3Dapter + 3DB | M | 0.5920 | 74.33 | 0.9918 | 0.2388 |
3DreamBooth is model-agnostic. We demonstrate consistent identity preservation on WanVideo 2.1 (720p) in addition to our primary Hunyuan-based model.