Back to Models
GE

georgexin/cointeract

georgexin โ€ข video

๐ŸŽฌ CoInteract

Physically-Consistent Human-Object Interaction Video Synthesis
via Spatially-Structured Co-Generation

Project Pageย  arXivย  GitHubย 

Alibaba Group & Tsinghua University


โœจ Highlights

CoInteract is the first end-to-end framework that generates physically-consistent human-object interaction (HOI) videos with zero additional inference cost. Given a person image, a product image, text prompts, and optional speech audio, CoInteract produces realistic videos where humans naturally grasp, wear, present, and manipulate objects โ€” with no hand-object interpenetration or geometric misalignment.

๐Ÿ”ฅ Key Results:

  • ๐Ÿ† State-of-the-art on HOI video synthesis benchmarks
  • ๐Ÿค Physically plausible hand-object contact (significantly reduced interpenetration)
  • โšก Zero inference overhead โ€” auxiliary HOI branch removed at test time
  • ๐ŸŽฏ Supports diverse interactions: grasping, wearing, presenting, carrying, and more
  • ๐Ÿ›’ Real-world ready: Virtual Try-On, Digital Human Commerce, Physics Simulation

๐Ÿ—๏ธ Architecture

CoInteract Architecture

CoInteract embeds structural priors and interaction geometry directly into a Diffusion Transformer (DiT) backbone through three core innovations:

  • ๐Ÿง  Human-Aware MoE โ€” Spatially-supervised Mixture-of-Experts routes tokens to region-specialized experts (Head, Hand, Base), ensuring high structural fidelity for hands and faces with minimal parameter overhead.

  • ๐Ÿ”— Dual-Stream Co-Generation โ€” An auxiliary HOI structure stream is jointly trained with the RGB stream within a shared DiT backbone, forcing the model to learn spatial and interaction relationships.

  • โœจ Asymmetric Co-Attention โ€” A two-stage training strategy with asymmetric attention masks embeds physical interaction rules, enabling the HOI branch to be completely removed at inference with zero overhead.


๐ŸŽฌ Demo

Our model handles diverse real-world products across various scenarios. Visit our Project Page for full video demos.

Supported Interaction Types:

CategoryExamples
๐Ÿคฒ GraspingMacaron box ยท Teapot ยท Skincare serum ยท Coffee mug
๐Ÿ‘œ PresentingLeather handbag ยท Eyeshadow palette ยท Decorative plate
๐Ÿ‘— WearingEmerald necklace ยท Sports jacket
๐ŸŒต HoldingCactus pot ยท Various daily objects

Application Scenarios:

  • ๐Ÿ›๏ธ Digital Human Commerce โ€” AI-powered digital humans presenting products in live-stream e-commerce
  • ๐Ÿ‘— Virtual Try-On โ€” Physically realistic garment and accessory interactions
  • โš™๏ธ Physics Simulation โ€” Generating high-quality HOI training data for robotics

๐Ÿ”ง Model Details

PropertyValue
BackboneDiffusion Transformer (DiT)
Resolution720p 480p
Frame CountUp to 81 frames per chunk
Multi-ChunkSupported for long-form video
InputsPerson image + Object image + Text prompt + Audio + (Optional) Pose
Training DataLarge-scale HOI video dataset with structure annotations
PrecisionFP16 / BF16
LicenseApache 2.0

๐Ÿ“Š Quantitative Results

image

Full quantitative results will be released upon paper acceptance.


๐Ÿ“ Citation

If you find CoInteract useful for your research, please consider citing:

@article{luo2025cointeract,
  title={CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation},
  author={Luo, Xiangyang and Xin, Xiaozhe and Feng, Tao and Guo, Xu and Jin, Meiguang and Ma, Junfeng},
  journal={arXiv preprint arXiv:2604.19636},
  year={2026}
}

๐Ÿ™ Acknowledgements

This work is supported by Taobao Live Tech, Alibaba Group and Tsinghua University.


Project Pageย ยทย  Paperย ยทย  Codeย ยทย  Demo

If you like this project, please give us a โญ!

Visit Website
โ€”

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes10
Downloadsโ€”
๐Ÿ“

No reviews yet

Be the first to review georgexin/cointeract!

Model Info

Providergeorgexin
Categoryvideo
Reviews0
Avg. Ratingโ€” / 5.0

Community

Likes10
Downloadsโ€”

Rating Guidelines

โ˜…โ˜…โ˜…โ˜…โ˜…Exceptional
โ˜…โ˜…โ˜…โ˜…Great
โ˜…โ˜…โ˜…Good
โ˜…โ˜…Fair
โ˜…Poor