Motif-Video 2B

A micro-budget text-to-video diffusion transformer from Motif Technologies

📑 Technical Report | 🤗 Hugging Face | 🌐 Project Page

🔥 News

[2026-04-29] RTX 4090 benchmarks added — SageAttention achieves ~3.16× speedup, all GGUF variants fit in 24 GB. See GGUF + SageAttention.
[2026-04-28] ComfyUI custom nodes released: ComfyUI-MotifVideo2B. GGUF workflow support coming soon.
[2026-04-28] GGUF quantized weights now available at Motif-Video-2B-GGUF — up to 2.7 GB VRAM savings with no speed penalty. SageAttention support for ~2× faster inference. See GGUF + SageAttention below.
[2026-04-14] We release Motif-Video 2B, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full technical report.

📖 Introduction

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. Motif-Video 2B asks whether competitive text-to-video quality is reachable at a much smaller budget — fewer than 10M training clips and under 100,000 H200 GPU hours — and shows that the answer is yes, provided the model design explicitly separates objectives that scaling would otherwise leave entangled.

Our central observation is that prompt alignment, temporal consistency, and fine-detail recovery interfere with one another when handled through the same pathway. Motif-Video 2B addresses this objective interference architecturally rather than relying on scale alone, through two contributions:

Shared Cross-Attention. A residual cross-attention mechanism that reuses self-attention K/V weights to stabilize text–video alignment under long-context token sparsity, where standard joint attention dilutes text influence as the video token sequence grows.
Three-stage DDT-style backbone. 12 dual-stream + 16 single-stream + 8 DDT decoder layers, separating early modality fusion, joint representation learning, and high-frequency detail reconstruction into dedicated components. Per-block attention analysis shows that the DDT decoder spontaneously develops inter-frame attention structure absent from the encoder layers.

These are paired with a micro-budget training recipe combining TREAD token routing and early-phase REPA with a frozen V-JEPA teacher — to our knowledge, the first time this combination has been applied to text-to-video training.

On VBench, Motif-Video 2B reaches 83.76%, the highest Total Score among open-source models we evaluate, surpassing Wan2.1-14B at 7× fewer parameters and roughly an order of magnitude less training data.

Motif-Video 2B architecture

✨ Highlights

Two tasks, one set of weights. A single checkpoint handles both text-to-video (T2V) and image-to-video (I2V) generation, trained jointly without a learnable task-type embedding.
Up to 720p, 121 frames. The final model generates 720p video at 121 frames under the standard rectified flow-matching sampler.
Architectural specialization over brute-force scale. Three-stage backbone with role-separated dual-stream / single-stream / DDT decoder layers.
Shared Cross-Attention. Stabilizes text alignment under long video-token sequences by grounding cross-attention K/V in the self-attention manifold.
Micro-budget recipe. TREAD token routing (≈27% per-step FLOP reduction) + early-phase REPA with V-JEPA teacher + offline bucket-balanced sampler (≈90% data utilization, up from ≈20% baseline).
Open and reproducible. Trained on ~64×H200 GPUs with FSDP2, full curriculum and recipe documented in the technical report.

🏗️ Architecture

Motif-Video 2B is a flow-matching diffusion transformer organized around a single principle: each component is assigned a well-defined responsibility, and components with conflicting objectives are not asked to share capacity.

Component	Choice
Text encoder	T5Gemma2 (encoder–decoder, UL2-adapted Gemma 3)
Video tokenizer	Wan2.1 VAE (8×8 spatial, 4× temporal compression), 2×2×1 patchify
Backbone	12 dual-stream + 16 single-stream + 8 DDT decoder layers
Hidden dim / heads	1536 / 12 heads × 128
Normalization	QK-normalization throughout
Position encoding	RoPE
Cross-attention	Shared Cross-Attention in the single-stream stage
Objective	Rectified flow matching (velocity prediction)
I2V conditioning	First-frame latent + SigLIP image embeddings, with timestep-aware blur

A high-level walkthrough of the role separation:

Dual-stream stage (12 layers). Text and video tokens are processed through separate self-attention pathways, exchanging information via cross-attention. This prevents premature feature entanglement before either modality has formed coherent representations.
Single-stream stage (16 layers). Text and video tokens attend freely in a joint sequence. Shared Cross-Attention is attached here to repair the text-attention dilution that emerges as the video token sequence grows.
DDT decoder (8 layers). A dedicated velocity decoder atop the 28-layer encoder, freeing the encoder from high-frequency detail reconstruction. Per-block attention analysis shows that the DDT decoder develops inter-frame attention structure that single-stream layers do not.

For the full derivation of why Shared Cross-Attention shares K/V but not Q, and why this is necessary in addition to standard zero-init of W_O, see Section 3.3 of the technical report.

🚀 Quickstart / Usage

Requirements

Python 3.10+
CUDA-capable GPU with 30GB+ VRAM (e.g., A100, H100) — for 24GB GPUs see Memory-efficient Inference

pip install "transformers>=5.5.4" torch accelerate ftfy einops sentencepiece regex Pillow imageio imageio-ffmpeg
pip install git+https://github.com/waitingcheung/diffusers.git@feat/motif-video

Text-to-Video (T2V)

import torch
from diffusers import (
    AdaptiveProjectedGuidance,
    DPMSolverMultistepScheduler,
    MotifVideoPipeline,
)
from diffusers.utils import export_to_video


guider = AdaptiveProjectedGuidance(
    guidance_scale=8.0,
    adaptive_projected_guidance_rescale=12.0,
    adaptive_projected_guidance_momentum=0.1,
    use_original_formulation=True,
    normalization_dims="spatial",
)

pipe = MotifVideoPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    torch_dtype=torch.bfloat16,
    guider=guider,
)
pipe = pipe.to("cuda")

output = pipe(
    prompt="A woman standing in a sunlit field as flower petals swirl around her in slow motion. Each petal floats gently through the golden light, casting tiny shadows. Her hair moves like water, and time seems to stand still.",
    negative_prompt="text overlay, graphic overlay, watermark, logo, subtitles, timestamp, broadcast graphics, UI elements, random letters, frozen pose, rigid, static expression, jerky motion, mechanical motion, discontinuous motion, flat framing, depthless, dull lighting, monotone, crushed shadows, blown-out highlights, shifting background, fading background, poor continuity, identity drift, deformation, flickering, ghosting, smearing, duplication, mutated proportions, inconsistent clothing, flat colors, desaturated, tonally compressed, poor background separation, exposure shift, uneven brightness, color balance shift",
    height=736,
    width=1280,
    num_frames=121,
    num_inference_steps=50,
    frame_rate=24,
)

export_to_video(output.frames[0], "output.mp4", fps=24)

Image-to-Video (I2V)

import torch
from diffusers import (
    AdaptiveProjectedGuidance,
    DPMSolverMultistepScheduler,
    MotifVideoPipeline,
)
from diffusers.utils import export_to_video, load_image

guider = AdaptiveProjectedGuidance(
    guidance_scale=8.0,
    adaptive_projected_guidance_rescale=12.0,
    adaptive_projected_guidance_momentum=0.1,
    use_original_formulation=True,
    normalization_dims="spatial",
)

pipe = MotifVideoImage2VideoPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    torch_dtype=torch.bfloat16,
    guider=guider,
)
pipe = pipe.to("cuda")

image = load_image("https://huggingface.co/Motif-Technologies/Motif-Video-2B/resolve/main/assets/i2v_sample.jpg")

output = pipe(
    prompt="Three friends stride through a sun-bleached meadow as a warm breeze ripples the tall dry grass around their legs.",
    negative_prompt="text overlay, graphic overlay, watermark, logo, subtitles, timestamp, broadcast graphics, UI elements, random letters, frozen pose, rigid, static expression, jerky motion, mechanical motion, discontinuous motion, flat framing, depthless, dull lighting, monotone, crushed shadows, blown-out highlights, shifting background, fading background, poor continuity, identity drift, deformation, flickering, ghosting, smearing, duplication, mutated proportions, inconsistent clothing, flat colors, desaturated, tonally compressed, poor background separation, exposure shift, uneven brightness, color balance shift",
    image=image,
    height=736,
    width=1280,
    num_frames=121,
    num_inference_steps=50,
    frame_rate=24,
)

export_to_video(output.frames[0], "output.mp4", fps=24)

CLI Inference

# Text-to-Video (default settings)
python inference.py \
  --prompt "A woman standing in a sunlit field as..." \
  --output t2v_output.mp4

# With SageAttention (~2x faster, requires sageattention package)
python inference.py \
  --prompt "Three friends stride through a sun-bleached meadow..." \
  --use-sage-attention \
  --output t2v_output.mp4

See inference.py --help for all available options.

Recommended Settings

Parameter	Default	Notes
Resolution	1280×736	720p, best quality
Frames	121	~5 seconds at 24fps
Scheduler	DPMSolver++	`solver_order=2`, `flow_shift=15.0`
Guidance scale	8.0	With APG (`normalization_dims="spatial"`)
Inference steps	50
Negative prompt	(built-in)	See code examples above
`use_linear_quadratic_schedule`	`False`	Must be set explicitly
dtype	bfloat16	Recommended for H100/A100

🔋 Memory-efficient Inference

For GPUs with 24 GB or less (e.g. RTX 4090, RTX 3090), CPU offloading and FP8 quantization can reduce peak VRAM from ~30 GB to ~15 GB with minimal speed impact.

Mode	Peak VRAM	Recommended GPU
`pipe.to("cuda")`	~30 GB	A100, H100, H200
`enable_model_cpu_offload()`	~19 GB	RTX 4090, RTX 3090
`+ FP8 quantization`	~15 GB	RTX 4090, RTX 3090

Full guide → docs/memory-efficient-inference.md

🧊 GGUF + SageAttention

GGUF quantized weights at Motif-Video-2B-GGUF — up to 2.7 GB VRAM savings with no speed penalty. Combined with SageAttention for ~1.6× faster inference.

Variant	Sage (s/it)	Speedup	Peak alloc (GB)
BF16	14.75	1.58x	15.12
Q8_0	14.49	1.60x	13.44
Q4_K_M	14.59	1.60x	12.53

Full guide → docs/gguf-sageattention.md

🖥️ ComfyUI

Official ComfyUI custom nodes: ComfyUI-MotifVideo2B

Note: Currently requires High VRAM mode. GGUF quantized model loading in ComfyUI is in progress.

📊 Performance

VBench

Motif-Video 2B achieves the highest Total Score among open-source models we evaluate.

Model	Params	Total	Quality	Semantic
Wan2.2-T2V (prompt-opt.)	A14B	84.23	85.42	79.50
Motif-Video 2B (Ours)	2B	83.76	84.59	80.44
SANA-Video	2B	83.71	84.35	81.35
Wan2.1-T2V	14B	83.69	85.59	76.11
OpenSora 2.0 (T2I2V)	11B	83.60	84.40	80.30
Wan2.1-T2V	1.3B	83.31	85.23	75.65
HunyuanVideo	13B	83.24	85.09	75.82
CogVideoX1.5-5B (prompt-opt.)	5B	82.17	82.78	79.76
Step-Video-T2V	30B	81.83	84.46	71.28
LTX-Video	2B	80.00	82.30	70.79

Notable per-dimension highlights for Motif-Video 2B (open-source):

Spatial Relationship: 83.02% — best among open-source models
Semantic Score: 80.44% — highest among open-source models reporting per-dimension results
Object Class: 92.93%, Multiple Objects: 77.29%, Imaging Quality: 70.50% — second-best in their categories

The full 16-dimension breakdown is in Table 3 of the technical report.

A note on VBench vs. perceptual quality. Motif-Video 2B leads on VBench Total Score, but in our internal side-by-side comparisons against Wan2.1-T2V-14B we observe a perceptual gap in favor of the larger model on temporal stability and fine human anatomy. We discuss the sources of this gap (uniform dimension weighting, near-correct semantic credit) in Section 7 of the report. We report the gap explicitly rather than smoothing it over.

Human evaluation

In a blind pairwise study against six contemporaneous open-source baselines (SANA-Video, LTX-Video 2, Wan2.1-14B, Wan2.1-1.3B, Wan2.2-5B, CogVideoX-5B) on 40 LLM-generated prompts, Motif-Video 2B is preferred over both SANA-Video (similar parameter count) and Wan2.1-1.3B (similar parameter count, larger training corpus) on prompt-following and video-fidelity axes. Wan2.1-14B remains the preferred model overall, consistent with its 7× larger parameter count and substantially larger training data.

🎬 Showcase

Text-to-Video

Motif-Video 2B T2V samples

Image-to-Video

Motif-Video 2B I2V samples

⚠️ Limitations

We report limitations as the boundary conditions under which the design decisions in this report should be interpreted, not as caveats.

Micro-scale semantic distortion. Motif-Video 2B occasionally produces sub-object-level artifacts that leave the category label intact but break perceptual plausibility — distorted hands on close-up human subjects, degraded body structure under high-displacement motion, and attribute leakage between visually similar co-present subjects. We attribute these primarily to data coverage rather than backbone design.
Temporal failures. Three distinct modes that frame-level metrics do not surface: (i) physically implausible liquid / cloth / collision dynamics, (ii) coherence loss under high scene complexity (multi-agent crowds), and (iii) unintended mid-clip scene transitions in long sequences.
Recipe components are evaluated jointly, not in isolation. We do not present per-component ablations for Shared Cross-Attention, the DDT decoder, REPA phasing, or TREAD routing at full scale. Readers should interpret our results as evidence that the composed recipe works at 2B, not as a marginal-contribution claim about any single component.

We view temporal stability and data coverage — not architectural depth — as the primary remaining ceilings on this model. Both are the most natural axes for a future iteration that the current architecture is built to absorb.

📚 Citation

If you find Motif-Video 2B useful in your research, please cite:

@techreport{motifvideo2b2026,
  title  = {Motif-Video 2B: Technical Report},
  author = {Motif Technologies},
  year   = {2026},
  institution = {Motif Technologies},
  url    = {https://arxiv.org/abs/2604.16503}
}

🙏 Acknowledgements

We build on a number of excellent open-source projects, including the Wan2.1 VAE [Wan Team, 2025], T5Gemma / Gemma 3 [Google], TREAD [Krause et al., 2025], REPA with the V-JEPA family of visual encoders [Bardes et al.], DDT [Wang et al.], and the broader diffusers and Accelerate ecosystems. Compute was provisioned on Microsoft Azure and orchestrated with SkyPilot on Kubernetes.

📄 License

This model is released under the Apache 2.0 License. See LICENSE for details.

Motif-Technologies/Motif-Video-2B