Back to Models
Visit Website
nvidia/PixelDiT-1300M-1024px
nvidia • image
PixelDiT: Pixel Diffusion Transformers for Image Generation
Yongsheng Yu1,2 Wei Xiong1† Weili Nie1 Yichen Sheng1 Shiqiu Liu1 Jiebo Luo2
1NVIDIA 2University of Rochester
†Project Lead and Main Advising
Key Features
- VAE-free
- Dual-level architecture: Patch-level DiT + Pixel-level DiT
- MM-DiT text-image fusion: Joint attention between text and image tokens
- Text encoder: Gemma-2-2B-IT
- Multi-aspect-ratio: Supports various aspect ratios at 1024px
Usage
Installation
pip install -r requirements.txt
Inference
# See the full inference script at: https://github.com/NVlabs/PixelDiT
cd t2i/
python inference.py \
--config configs/PixelDiT_1024px_pixel_diffusion_stage3.yaml \
--model_path PixelDiT-T2I-v1.pth \
--txt_file prompts.txt \
--custom_height 1024 --custom_width 1024 \
--cfg_scale 2.75 --seed 2025 \
--negative_prompt "low quality, worst quality, over-saturated, blurry, deformed, watermark" \
--work_dir "."
Inference Parameters
| Parameter | Default | Description |
|---|---|---|
--cfg_scale | 3.5 | Classifier-free guidance scale |
--step | 50 | Number of sampling steps (25 for fast, 50 for quality) |
--seed | 0 | Random seed |
--negative_prompt | "" | Negative prompt for CFG |
--interval_guidance | [0, 1] | CFG application interval |
--sampling_algo | flow_dpm-solver | Sampling algorithm |
Model Architecture
| Component | Value |
|---|---|
| Parameters | 1.3B |
| Patch size | 16 |
| Hidden size | 1536 |
| Attention heads | 24 |
| Patch-level depth | 14 |
| Pixel-level depth | 2 |
| Pixel hidden size | 16 |
| Pixel attention hidden size | 1152 |
| Text embedding dim | 2304 |
| Text max length | 300 |
| Text encoder | Gemma-2-2B-IT |
Citation
@inproceedings{yu2026pixeldit,
title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026},
}
License
This model is released under the NSCLv1 License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.