3Dream
Booth

High-Fidelity 3D Subject-Driven Video Generation

1Yonsei University 2Sungkyunkwan University
* Equal contribution    † Corresponding author
Abstract
Creating dynamic, view-consistent videos of customized subjects is highly sought after for immersive VR/AR, virtual production, and next-generation e-commerce. Despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities — focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, these approaches lack the spatial priors necessary to reconstruct 3D geometry, and must rely on generating plausible but arbitrary details for unseen regions.

We introduce a novel framework comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm, baking a robust 3D prior without exhaustive video-based training. 3Dapter is a visual conditioning module that enhances fine-grained textures and accelerates convergence — acting as a dynamic selective router that queries view-specific geometric hints via multi-view joint attention with shared weights.
Architecture

Method Overview

3DreamBooth learns spatial identity via 1-frame optimization. 3Dapter pre-trains on single-view conditioning, then extends to multi-view joint attention with shared weights during joint optimization.

Training Pipeline Training Pipeline
Training Pipeline. Given multiview images, 3DreamBooth performs 1-frame spatial optimization with a [v] identifier token. 3Dapter conditions on reference latents via 3DB LoRA (text branch) and Shared 3Dapter (image branch).
Joint Attention Joint Attention
Single-view → Multi-view Joint Attention. 3Dapter is pre-trained in single-view mode, then extended to multi-view with shared weights — enabling consistency without redundant parameter overhead.
Qualitative Results

Video Gallery

High-fidelity, identity-preserving videos across diverse subjects and creative scenarios. Each clip pairs four multiview reference images with a text-prompted video.

An actress on a Hollywood red carpet raising a [v] bag and peering over it into the camera.
A video of a rider mounting a [v] motorcycle on a rain-soaked, neon-lit Tokyo street at night.
A video of hands rotating a [v] watch over autumn leaves, rustic stone wall and red apples behind.
A video of a [v] chair in a modern office, sunlight through large windows, skyscrapers outside.
A video of a [v] figure held in a person's hand at a flea market with vintage stalls in the background.
A dramatic museum heist at night, two thieves in tactical gear carrying the [v] sculpture.
A video of a [v] rubber duck floating in a clear swimming pool, swimmers in the background.
A video of a woman twisting open the cap of a [v] pill bottle on a kitchen counter, natural daylight.
A video of a [v] chair at a bustling outdoor café, with people walking by and waiters carrying trays.
A woman reviewing a [v] snack bag in a bright supermarket aisle, filmed by a foreground cameraman.
A video of a climber picking up a [v] bottle on a rocky mountain ledge, snow-capped peaks behind.
A video of a young girl applying [v] glue to construction paper in a classroom, classmates behind.
A video of hands sliding a smartphone into a [v] shoulder bag, futuristic blue and magenta lighting.
A video of hands in garden gloves holding a [v] flowerpot in a lush green garden.
A video of a [v] mug being filled with hot chocolate and topped with swirled whipped cream.
A video of a [v] rubber duck bobbing in a clear blue pool, ripples forming, poolside chairs visible.
A [v] plush being washed by sudsy hands, bright morning sunlight, modern bathroom background.
A video of a [v] plushie sitting alone on a bare mattress in a dorm room during golden hour.
Prompt Generalization

Diverse Creative Scenarios

The same subject rendered across six distinct creative contexts — demonstrating robust 3D identity preservation regardless of scene, lighting, or narrative.

Gloved hands placing aviator sunglasses on a [v] plushie, cool blue and magenta holiday lights, snowy city through a window.
A claw machine grabbing a [v] plushie and slowly rotating it 360°, neon arcade lights in the background.
Marilyn Monroe smiling and kissing a [v] plushie in a Hollywood dressing room, vanity mirror with glowing bulb lights.
A gloved hand holding a magnifying glass in front of a [v] plushie on a mahogany desk, vintage typewriter and coffee behind.
An elegant woman in a black dress wiggling a [v] plushie to tease a fluffy orange tabby cat in a 1960s New York apartment.
A [v] plushie on an ornate teacup at a Mad Hatter tea party, playing cards and pocket watches scattered across the table.
Qualitative Comparison

Single-View Conditioning Baselines

We compare our full multi-view framework against single-view conditioning baselines VACE and Phantom, which use brief textual descriptions following standard subject-driven generation protocols.

Ablation Study

Component Analysis

Comparing each component at a fixed step count. 3DreamBooth alone converges slowly; 3Dapter provides a strong initialization that dramatically accelerates training at identical step budgets.

A [v] plushie held in a person's hands inside a warmly lit Parisian bistro. The person slowly rotates it 360°. Fixed close-up camera. Eiffel Tower glittering at night visible through the window.
3Dapter only
3DreamBooth · 400 steps
3DreamBooth · 1600 steps
★ 3Dapter + 3DreamBooth · 400 steps
Quantitative Results

Benchmark Evaluation

Evaluated on 3D-CustomBench. S = single-view   M = multi-view. Underline = 2nd best.

Multi-View Subject Fidelity
Feature similarity (CLIP-I, DINO-I) and GPT-4o perceptual evaluation across Shape, Color, Detail, and Overall axes.
MethodViewsCLIP-I ↑DINO-I ↑Shape ↑Color ↑Detail ↑Overall ↑
VACES0.89640.73954.39±0.054.09±0.093.35±0.153.95±0.11
PhantomS0.85760.58613.48±0.123.94±0.133.03±0.163.31±0.15
3DapterS0.86470.58993.06±0.033.09±0.062.28±0.082.67±0.07
3DreamBoothM0.83820.65304.18±0.063.63±0.093.14±0.113.53±0.07
3Dapter + 3DBM0.88710.74204.80±0.034.53±0.044.04±0.134.57±0.04
3D Geometric Fidelity
Chamfer Distance between point clouds reconstructed from generated frames and ground-truth multi-view images from 3D-CustomBench. Lower is better.
MethodViewsAccuracy ↓Completeness ↓CD ↓
VACES0.02780.04270.0353
PhantomS0.02890.03880.0338
3DapterS0.03150.06590.0487
3DreamBoothM0.01560.03220.0239
3Dapter + 3DBM0.01820.01720.0177
Video Quality & Text Alignment
Intrinsic video quality via VBench and text-video alignment via ViCLIP Score (ViCLIP-L/14).
MethodTypeAesthetic Quality ↑Imaging Quality ↑Motion Smoothness ↑ViCLIP ↑
VACES0.591570.840.99160.2663
PhantomS0.579870.580.99340.2634
3DapterS0.628371.650.99440.2048
3DreamBoothM0.524573.340.99280.2415
3Dapter + 3DBM0.592074.330.99180.2388
Model-Agnostic Generalization
Generalization

WanVideo 2.1 Results

3DreamBooth is model-agnostic. We demonstrate consistent identity preservation on WanVideo 2.1 (720p) in addition to our primary Hunyuan-based model.

Citation

BibTeX

bibtex
@misc{ko20263dreamboothhighfidelity3dsubjectdriven, title = {3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model}, author = {Hyun-kyu Ko and Jihyeon Park and Younghyun Kim and Dongheok Park and Eunbyung Park}, year = {2026}, eprint = {2603.18524}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2603.18524}, }