3DreamBooth

Abstract

Creating dynamic, view-consistent videos of customized subjects is highly sought after for immersive VR/AR, virtual production, and next-generation e-commerce. Despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities — focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, these approaches lack the spatial priors necessary to reconstruct 3D geometry, and must rely on generating plausible but arbitrary details for unseen regions.

We introduce a novel framework comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm, baking a robust 3D prior without exhaustive video-based training. 3Dapter is a visual conditioning module that enhances fine-grained textures and accelerates convergence — acting as a dynamic selective router that queries view-specific geometric hints via multi-view joint attention with shared weights.

Architecture

Method Overview

3DreamBooth learns spatial identity via 1-frame optimization. 3Dapter pre-trains on single-view conditioning, then extends to multi-view joint attention with shared weights during joint optimization.

**Training Pipeline.** Given multiview images, 3DreamBooth performs 1-frame spatial optimization with a [v] identifier token. 3Dapter conditions on reference latents via 3DB LoRA (text branch) and Shared 3Dapter (image branch).

**Single-view → Multi-view Joint Attention.** 3Dapter is pre-trained in single-view mode, then extended to multi-view with shared weights — enabling consistency without redundant parameter overhead.

Qualitative Results

Video Gallery

High-fidelity, identity-preserving videos across diverse subjects and creative scenarios. Each clip pairs four multiview reference images with a text-prompted video.

An actress on a Hollywood red carpet raising a [v] bag and peering over it into the camera.

A video of a rider mounting a [v] motorcycle on a rain-soaked, neon-lit Tokyo street at night.

A video of hands rotating a [v] watch over autumn leaves, rustic stone wall and red apples behind.

A video of a [v] chair in a modern office, sunlight through large windows, skyscrapers outside.

A video of a [v] figure held in a person's hand at a flea market with vintage stalls in the background.

A dramatic museum heist at night, two thieves in tactical gear carrying the [v] sculpture.

A video of a [v] rubber duck floating in a clear swimming pool, swimmers in the background.

A video of a woman twisting open the cap of a [v] pill bottle on a kitchen counter, natural daylight.

A video of a [v] chair at a bustling outdoor café, with people walking by and waiters carrying trays.

A woman reviewing a [v] snack bag in a bright supermarket aisle, filmed by a foreground cameraman.

A video of a climber picking up a [v] bottle on a rocky mountain ledge, snow-capped peaks behind.

A video of a young girl applying [v] glue to construction paper in a classroom, classmates behind.

A video of hands sliding a smartphone into a [v] shoulder bag, futuristic blue and magenta lighting.

A video of hands in garden gloves holding a [v] flowerpot in a lush green garden.

A video of a [v] mug being filled with hot chocolate and topped with swirled whipped cream.

A video of a [v] rubber duck bobbing in a clear blue pool, ripples forming, poolside chairs visible.

A [v] plush being washed by sudsy hands, bright morning sunlight, modern bathroom background.

A video of a [v] plushie sitting alone on a bare mattress in a dorm room during golden hour.

Prompt Generalization

Diverse Creative Scenarios

The same subject rendered across six distinct creative contexts — demonstrating robust 3D identity preservation regardless of scene, lighting, or narrative.

Gloved hands placing aviator sunglasses on a [v] plushie, cool blue and magenta holiday lights, snowy city through a window.

A claw machine grabbing a [v] plushie and slowly rotating it 360°, neon arcade lights in the background.

Marilyn Monroe smiling and kissing a [v] plushie in a Hollywood dressing room, vanity mirror with glowing bulb lights.

A gloved hand holding a magnifying glass in front of a [v] plushie on a mahogany desk, vintage typewriter and coffee behind.

An elegant woman in a black dress wiggling a [v] plushie to tease a fluffy orange tabby cat in a 1960s New York apartment.

A [v] plushie on an ornate teacup at a Mad Hatter tea party, playing cards and pocket watches scattered across the table.

Qualitative Comparison

Single-View Conditioning Baselines

We compare our full multi-view framework against single-view conditioning baselines VACE and Phantom, which use brief textual descriptions following standard subject-driven generation protocols.

A [v] plushie on an alchemist's desk lit by bioluminescent crystals and purple hearth fire. Fingerless-gloved hands rotate it 360°. Soft-focused vials, astrolabes, and swirling mist.

★ Ours (3Dapter + 3DB)

VACE

Phantom

The [v] plushie sits on a marble slab in a royal bakery, bathed in the warm golden glow of a massive open-hearth brick oven. Baker's hands rotate it 360° under a fixed camera.

★ Ours (3Dapter + 3DB)

VACE

Phantom

Ablation Study

Component Analysis

Comparing each component at a fixed step count. 3DreamBooth alone converges slowly; 3Dapter provides a strong initialization that dramatically accelerates training at identical step budgets.

A [v] plushie held in a person's hands inside a warmly lit Parisian bistro. The person slowly rotates it 360°. Fixed close-up camera. Eiffel Tower glittering at night visible through the window.

3Dapter only

3DreamBooth · 400 steps

3DreamBooth · 1600 steps

★ 3Dapter + 3DreamBooth · 400 steps

Quantitative Results

Benchmark Evaluation

Evaluated on 3D-CustomBench. S = single-view M = multi-view. Underline = 2nd best.

Multi-View Subject Fidelity

Feature similarity (CLIP-I, DINO-I) and GPT-4o perceptual evaluation across Shape, Color, Detail, and Overall axes.

Method	Views	CLIP-I ↑	DINO-I ↑	Shape ↑	Color ↑	Detail ↑	Overall ↑
VACE	S	0.8964	0.7395	4.39±0.05	4.09±0.09	3.35±0.15	3.95±0.11
Phantom	S	0.8576	0.5861	3.48±0.12	3.94±0.13	3.03±0.16	3.31±0.15
3Dapter	S	0.8647	0.5899	3.06±0.03	3.09±0.06	2.28±0.08	2.67±0.07
3DreamBooth	M	0.8382	0.6530	4.18±0.06	3.63±0.09	3.14±0.11	3.53±0.07
3Dapter + 3DB	M	0.8871	0.7420	4.80±0.03	4.53±0.04	4.04±0.13	4.57±0.04

3D Geometric Fidelity

Chamfer Distance between point clouds reconstructed from generated frames and ground-truth multi-view images from 3D-CustomBench. Lower is better.

Method	Views	Accuracy ↓	Completeness ↓	CD ↓
VACE	S	0.0278	0.0427	0.0353
Phantom	S	0.0289	0.0388	0.0338
3Dapter	S	0.0315	0.0659	0.0487
3DreamBooth	M	0.0156	0.0322	0.0239
3Dapter + 3DB	M	0.0182	0.0172	0.0177

Video Quality & Text Alignment

Intrinsic video quality via VBench and text-video alignment via ViCLIP Score (ViCLIP-L/14).

Method	Type	Aesthetic Quality ↑	Imaging Quality ↑	Motion Smoothness ↑	ViCLIP ↑
VACE	S	0.5915	70.84	0.9916	0.2663
Phantom	S	0.5798	70.58	0.9934	0.2634
3Dapter	S	0.6283	71.65	0.9944	0.2048
3DreamBooth	M	0.5245	73.34	0.9928	0.2415
3Dapter + 3DB	M	0.5920	74.33	0.9918	0.2388

Model-Agnostic Generalization

Generalization

WanVideo 2.1 Results

3DreamBooth is model-agnostic. We demonstrate consistent identity preservation on WanVideo 2.1 (720p) in addition to our primary Hunyuan-based model.

A [v] plushie rotating 360° on a mechanical turntable, industrial machinery background.

A stationary [v] bag on a glass pedestal inside a luxury store with marble floors and golden spotlights.

A [v] plushie on a spinning turntable in a luxury boutique with marble counters and spotlights.

3Dream
Booth