georgexin/cointeract
georgexin โข video
๐ฌ CoInteract
Physically-Consistent Human-Object Interaction Video Synthesis
via Spatially-Structured Co-Generation
Alibaba Group & Tsinghua University
โจ Highlights
CoInteract is the first end-to-end framework that generates physically-consistent human-object interaction (HOI) videos with zero additional inference cost. Given a person image, a product image, text prompts, and optional speech audio, CoInteract produces realistic videos where humans naturally grasp, wear, present, and manipulate objects โ with no hand-object interpenetration or geometric misalignment.
๐ฅ Key Results:
- ๐ State-of-the-art on HOI video synthesis benchmarks
- ๐ค Physically plausible hand-object contact (significantly reduced interpenetration)
- โก Zero inference overhead โ auxiliary HOI branch removed at test time
- ๐ฏ Supports diverse interactions: grasping, wearing, presenting, carrying, and more
- ๐ Real-world ready: Virtual Try-On, Digital Human Commerce, Physics Simulation
๐๏ธ Architecture
CoInteract embeds structural priors and interaction geometry directly into a Diffusion Transformer (DiT) backbone through three core innovations:
-
๐ง Human-Aware MoE โ Spatially-supervised Mixture-of-Experts routes tokens to region-specialized experts (Head, Hand, Base), ensuring high structural fidelity for hands and faces with minimal parameter overhead.
-
๐ Dual-Stream Co-Generation โ An auxiliary HOI structure stream is jointly trained with the RGB stream within a shared DiT backbone, forcing the model to learn spatial and interaction relationships.
-
โจ Asymmetric Co-Attention โ A two-stage training strategy with asymmetric attention masks embeds physical interaction rules, enabling the HOI branch to be completely removed at inference with zero overhead.
๐ฌ Demo
Our model handles diverse real-world products across various scenarios. Visit our Project Page for full video demos.
Supported Interaction Types:
| Category | Examples |
|---|---|
| ๐คฒ Grasping | Macaron box ยท Teapot ยท Skincare serum ยท Coffee mug |
| ๐ Presenting | Leather handbag ยท Eyeshadow palette ยท Decorative plate |
| ๐ Wearing | Emerald necklace ยท Sports jacket |
| ๐ต Holding | Cactus pot ยท Various daily objects |
Application Scenarios:
- ๐๏ธ Digital Human Commerce โ AI-powered digital humans presenting products in live-stream e-commerce
- ๐ Virtual Try-On โ Physically realistic garment and accessory interactions
- โ๏ธ Physics Simulation โ Generating high-quality HOI training data for robotics
๐ง Model Details
| Property | Value |
|---|---|
| Backbone | Diffusion Transformer (DiT) |
| Resolution | 720p 480p |
| Frame Count | Up to 81 frames per chunk |
| Multi-Chunk | Supported for long-form video |
| Inputs | Person image + Object image + Text prompt + Audio + (Optional) Pose |
| Training Data | Large-scale HOI video dataset with structure annotations |
| Precision | FP16 / BF16 |
| License | Apache 2.0 |
๐ Quantitative Results

Full quantitative results will be released upon paper acceptance.
๐ Citation
If you find CoInteract useful for your research, please consider citing:
@article{luo2025cointeract,
title={CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation},
author={Luo, Xiangyang and Xin, Xiaozhe and Feng, Tao and Guo, Xu and Jin, Meiguang and Ma, Junfeng},
journal={arXiv preprint arXiv:2604.19636},
year={2026}
}
๐ Acknowledgements
This work is supported by Taobao Live Tech, Alibaba Group and Tsinghua University.
Project Pageย ยทย Paperย ยทย Codeย ยทย Demo
If you like this project, please give us a โญ!