Motif-Technologies/Motif-Video-2B
Motif-Technologies β’ video
Motif-Video 2B
A micro-budget text-to-video diffusion transformer from Motif Technologies
π Technical Report Β |Β π€ Hugging Face Β |Β π Project Page
π₯ News
- [2026-04-29] RTX 4090 benchmarks added β SageAttention achieves ~3.16Γ speedup, all GGUF variants fit in 24 GB. See GGUF + SageAttention.
- [2026-04-28] ComfyUI custom nodes released: ComfyUI-MotifVideo2B. GGUF workflow support coming soon.
- [2026-04-28] GGUF quantized weights now available at Motif-Video-2B-GGUF β up to 2.7 GB VRAM savings with no speed penalty. SageAttention support for ~2Γ faster inference. See GGUF + SageAttention below.
- [2026-04-14] We release Motif-Video 2B, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full technical report.
π Introduction
Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. Motif-Video 2B asks whether competitive text-to-video quality is reachable at a much smaller budget β fewer than 10M training clips and under 100,000 H200 GPU hours β and shows that the answer is yes, provided the model design explicitly separates objectives that scaling would otherwise leave entangled.
Our central observation is that prompt alignment, temporal consistency, and fine-detail recovery interfere with one another when handled through the same pathway. Motif-Video 2B addresses this objective interference architecturally rather than relying on scale alone, through two contributions:
- Shared Cross-Attention. A residual cross-attention mechanism that reuses self-attention K/V weights to stabilize textβvideo alignment under long-context token sparsity, where standard joint attention dilutes text influence as the video token sequence grows.
- Three-stage DDT-style backbone. 12 dual-stream + 16 single-stream + 8 DDT decoder layers, separating early modality fusion, joint representation learning, and high-frequency detail reconstruction into dedicated components. Per-block attention analysis shows that the DDT decoder spontaneously develops inter-frame attention structure absent from the encoder layers.
These are paired with a micro-budget training recipe combining TREAD token routing and early-phase REPA with a frozen V-JEPA teacher β to our knowledge, the first time this combination has been applied to text-to-video training.
On VBench, Motif-Video 2B reaches 83.76%, the highest Total Score among open-source models we evaluate, surpassing Wan2.1-14B at 7Γ fewer parameters and roughly an order of magnitude less training data.
β¨ Highlights
- Two tasks, one set of weights. A single checkpoint handles both text-to-video (T2V) and image-to-video (I2V) generation, trained jointly without a learnable task-type embedding.
- Up to 720p, 121 frames. The final model generates 720p video at 121 frames under the standard rectified flow-matching sampler.
- Architectural specialization over brute-force scale. Three-stage backbone with role-separated dual-stream / single-stream / DDT decoder layers.
- Shared Cross-Attention. Stabilizes text alignment under long video-token sequences by grounding cross-attention K/V in the self-attention manifold.
- Micro-budget recipe. TREAD token routing (β27% per-step FLOP reduction) + early-phase REPA with V-JEPA teacher + offline bucket-balanced sampler (β90% data utilization, up from β20% baseline).
- Open and reproducible. Trained on ~64ΓH200 GPUs with FSDP2, full curriculum and recipe documented in the technical report.
ποΈ Architecture
Motif-Video 2B is a flow-matching diffusion transformer organized around a single principle: each component is assigned a well-defined responsibility, and components with conflicting objectives are not asked to share capacity.
| Component | Choice |
|---|---|
| Text encoder | T5Gemma2 (encoderβdecoder, UL2-adapted Gemma 3) |
| Video tokenizer | Wan2.1 VAE (8Γ8 spatial, 4Γ temporal compression), 2Γ2Γ1 patchify |
| Backbone | 12 dual-stream + 16 single-stream + 8 DDT decoder layers |
| Hidden dim / heads | 1536 / 12 heads Γ 128 |
| Normalization | QK-normalization throughout |
| Position encoding | RoPE |
| Cross-attention | Shared Cross-Attention in the single-stream stage |
| Objective | Rectified flow matching (velocity prediction) |
| I2V conditioning | First-frame latent + SigLIP image embeddings, with timestep-aware blur |
A high-level walkthrough of the role separation:
- Dual-stream stage (12 layers). Text and video tokens are processed through separate self-attention pathways, exchanging information via cross-attention. This prevents premature feature entanglement before either modality has formed coherent representations.
- Single-stream stage (16 layers). Text and video tokens attend freely in a joint sequence. Shared Cross-Attention is attached here to repair the text-attention dilution that emerges as the video token sequence grows.
- DDT decoder (8 layers). A dedicated velocity decoder atop the 28-layer encoder, freeing the encoder from high-frequency detail reconstruction. Per-block attention analysis shows that the DDT decoder develops inter-frame attention structure that single-stream layers do not.
For the full derivation of why Shared Cross-Attention shares K/V but not Q, and why this is necessary in addition to standard zero-init of W_O, see Section 3.3 of the technical report.
π Quickstart / Usage
Requirements
- Python 3.10+
- CUDA-capable GPU with 30GB+ VRAM (e.g., A100, H100) β for 24GB GPUs see Memory-efficient Inference
pip install "transformers>=5.5.4" torch accelerate ftfy einops sentencepiece regex Pillow imageio imageio-ffmpeg
pip install git+https://github.com/waitingcheung/diffusers.git@feat/motif-video
Text-to-Video (T2V)
import torch
from diffusers import (
AdaptiveProjectedGuidance,
DPMSolverMultistepScheduler,
MotifVideoPipeline,
)
from diffusers.utils import export_to_video
guider = AdaptiveProjectedGuidance(
guidance_scale=8.0,
adaptive_projected_guidance_rescale=12.0,
adaptive_projected_guidance_momentum=0.1,
use_original_formulation=True,
normalization_dims="spatial",
)
pipe = MotifVideoPipeline.from_pretrained(
"Motif-Technologies/Motif-Video-2B",
torch_dtype=torch.bfloat16,
guider=guider,
)
pipe = pipe.to("cuda")
output = pipe(
prompt="A woman standing in a sunlit field as flower petals swirl around her in slow motion. Each petal floats gently through the golden light, casting tiny shadows. Her hair moves like water, and time seems to stand still.",
negative_prompt="text overlay, graphic overlay, watermark, logo, subtitles, timestamp, broadcast graphics, UI elements, random letters, frozen pose, rigid, static expression, jerky motion, mechanical motion, discontinuous motion, flat framing, depthless, dull lighting, monotone, crushed shadows, blown-out highlights, shifting background, fading background, poor continuity, identity drift, deformation, flickering, ghosting, smearing, duplication, mutated proportions, inconsistent clothing, flat colors, desaturated, tonally compressed, poor background separation, exposure shift, uneven brightness, color balance shift",
height=736,
width=1280,
num_frames=121,
num_inference_steps=50,
frame_rate=24,
)
export_to_video(output.frames[0], "output.mp4", fps=24)
Image-to-Video (I2V)
import torch
from diffusers import (
AdaptiveProjectedGuidance,
DPMSolverMultistepScheduler,
MotifVideoPipeline,
)
from diffusers.utils import export_to_video, load_image
guider = AdaptiveProjectedGuidance(
guidance_scale=8.0,
adaptive_projected_guidance_rescale=12.0,
adaptive_projected_guidance_momentum=0.1,
use_original_formulation=True,
normalization_dims="spatial",
)
pipe = MotifVideoImage2VideoPipeline.from_pretrained(
"Motif-Technologies/Motif-Video-2B",
torch_dtype=torch.bfloat16,
guider=guider,
)
pipe = pipe.to("cuda")
image = load_image("https://huggingface.co/Motif-Technologies/Motif-Video-2B/resolve/main/assets/i2v_sample.jpg")
output = pipe(
prompt="Three friends stride through a sun-bleached meadow as a warm breeze ripples the tall dry grass around their legs.",
negative_prompt="text overlay, graphic overlay, watermark, logo, subtitles, timestamp, broadcast graphics, UI elements, random letters, frozen pose, rigid, static expression, jerky motion, mechanical motion, discontinuous motion, flat framing, depthless, dull lighting, monotone, crushed shadows, blown-out highlights, shifting background, fading background, poor continuity, identity drift, deformation, flickering, ghosting, smearing, duplication, mutated proportions, inconsistent clothing, flat colors, desaturated, tonally compressed, poor background separation, exposure shift, uneven brightness, color balance shift",
image=image,
height=736,
width=1280,
num_frames=121,
num_inference_steps=50,
frame_rate=24,
)
export_to_video(output.frames[0], "output.mp4", fps=24)
CLI Inference
# Text-to-Video (default settings)
python inference.py \
--prompt "A woman standing in a sunlit field as..." \
--output t2v_output.mp4
# With SageAttention (~2x faster, requires sageattention package)
python inference.py \
--prompt "Three friends stride through a sun-bleached meadow..." \
--use-sage-attention \
--output t2v_output.mp4
See inference.py --help for all available options.
Recommended Settings
| Parameter | Default | Notes |
|---|---|---|
| Resolution | 1280Γ736 | 720p, best quality |
| Frames | 121 | ~5 seconds at 24fps |
| Scheduler | DPMSolver++ | solver_order=2, flow_shift=15.0 |
| Guidance scale | 8.0 | With APG (normalization_dims="spatial") |
| Inference steps | 50 | |
| Negative prompt | (built-in) | See code examples above |
use_linear_quadratic_schedule | False | Must be set explicitly |
| dtype | bfloat16 | Recommended for H100/A100 |
π Memory-efficient Inference
For GPUs with 24 GB or less (e.g. RTX 4090, RTX 3090), CPU offloading and FP8 quantization can reduce peak VRAM from ~30 GB to ~15 GB with minimal speed impact.
| Mode | Peak VRAM | Recommended GPU |
|---|---|---|
pipe.to("cuda") | ~30 GB | A100, H100, H200 |
enable_model_cpu_offload() | ~19 GB | RTX 4090, RTX 3090 |
+ FP8 quantization | ~15 GB | RTX 4090, RTX 3090 |
Full guide β docs/memory-efficient-inference.md
π§ GGUF + SageAttention
GGUF quantized weights at Motif-Video-2B-GGUF β up to 2.7 GB VRAM savings with no speed penalty. Combined with SageAttention for ~1.6Γ faster inference.
| Variant | Sage (s/it) | Speedup | Peak alloc (GB) |
|---|---|---|---|
| BF16 | 14.75 | 1.58x | 15.12 |
| Q8_0 | 14.49 | 1.60x | 13.44 |
| Q4_K_M | 14.59 | 1.60x | 12.53 |
Full guide β docs/gguf-sageattention.md
π₯οΈ ComfyUI
Official ComfyUI custom nodes: ComfyUI-MotifVideo2B
Note: Currently requires High VRAM mode. GGUF quantized model loading in ComfyUI is in progress.
π Performance
VBench
Motif-Video 2B achieves the highest Total Score among open-source models we evaluate.
| Model | Params | Total | Quality | Semantic |
|---|---|---|---|---|
| Wan2.2-T2V (prompt-opt.) | A14B | 84.23 | 85.42 | 79.50 |
| Motif-Video 2B (Ours) | 2B | 83.76 | 84.59 | 80.44 |
| SANA-Video | 2B | 83.71 | 84.35 | 81.35 |
| Wan2.1-T2V | 14B | 83.69 | 85.59 | 76.11 |
| OpenSora 2.0 (T2I2V) | 11B | 83.60 | 84.40 | 80.30 |
| Wan2.1-T2V | 1.3B | 83.31 | 85.23 | 75.65 |
| HunyuanVideo | 13B | 83.24 | 85.09 | 75.82 |
| CogVideoX1.5-5B (prompt-opt.) | 5B | 82.17 | 82.78 | 79.76 |
| Step-Video-T2V | 30B | 81.83 | 84.46 | 71.28 |
| LTX-Video | 2B | 80.00 | 82.30 | 70.79 |
Notable per-dimension highlights for Motif-Video 2B (open-source):
- Spatial Relationship: 83.02% β best among open-source models
- Semantic Score: 80.44% β highest among open-source models reporting per-dimension results
- Object Class: 92.93%, Multiple Objects: 77.29%, Imaging Quality: 70.50% β second-best in their categories
The full 16-dimension breakdown is in Table 3 of the technical report.
A note on VBench vs. perceptual quality. Motif-Video 2B leads on VBench Total Score, but in our internal side-by-side comparisons against Wan2.1-T2V-14B we observe a perceptual gap in favor of the larger model on temporal stability and fine human anatomy. We discuss the sources of this gap (uniform dimension weighting, near-correct semantic credit) in Section 7 of the report. We report the gap explicitly rather than smoothing it over.
Human evaluation
In a blind pairwise study against six contemporaneous open-source baselines (SANA-Video, LTX-Video 2, Wan2.1-14B, Wan2.1-1.3B, Wan2.2-5B, CogVideoX-5B) on 40 LLM-generated prompts, Motif-Video 2B is preferred over both SANA-Video (similar parameter count) and Wan2.1-1.3B (similar parameter count, larger training corpus) on prompt-following and video-fidelity axes. Wan2.1-14B remains the preferred model overall, consistent with its 7Γ larger parameter count and substantially larger training data.
π¬ Showcase
Text-to-Video
Image-to-Video
β οΈ Limitations
We report limitations as the boundary conditions under which the design decisions in this report should be interpreted, not as caveats.
- Micro-scale semantic distortion. Motif-Video 2B occasionally produces sub-object-level artifacts that leave the category label intact but break perceptual plausibility β distorted hands on close-up human subjects, degraded body structure under high-displacement motion, and attribute leakage between visually similar co-present subjects. We attribute these primarily to data coverage rather than backbone design.
- Temporal failures. Three distinct modes that frame-level metrics do not surface: (i) physically implausible liquid / cloth / collision dynamics, (ii) coherence loss under high scene complexity (multi-agent crowds), and (iii) unintended mid-clip scene transitions in long sequences.
- Recipe components are evaluated jointly, not in isolation. We do not present per-component ablations for Shared Cross-Attention, the DDT decoder, REPA phasing, or TREAD routing at full scale. Readers should interpret our results as evidence that the composed recipe works at 2B, not as a marginal-contribution claim about any single component.
We view temporal stability and data coverage β not architectural depth β as the primary remaining ceilings on this model. Both are the most natural axes for a future iteration that the current architecture is built to absorb.
π Citation
If you find Motif-Video 2B useful in your research, please cite:
@techreport{motifvideo2b2026,
title = {Motif-Video 2B: Technical Report},
author = {Motif Technologies},
year = {2026},
institution = {Motif Technologies},
url = {https://arxiv.org/abs/2604.16503}
}
π Acknowledgements
We build on a number of excellent open-source projects, including the Wan2.1 VAE [Wan Team, 2025], T5Gemma / Gemma 3 [Google], TREAD [Krause et al., 2025], REPA with the V-JEPA family of visual encoders [Bardes et al.], DDT [Wang et al.], and the broader diffusers and Accelerate ecosystems. Compute was provisioned on Microsoft Azure and orchestrated with SkyPilot on Kubernetes.
π License
This model is released under the Apache 2.0 License. See LICENSE for details.