Z-Image-Turbo-Fun-Controlnet-Union-2.1

Update
- [2026.02.26] Update to version 2602, with support for Gray Control.
- [2026.01.12] Update to version 2601, with support for Scribble Control. Added lite models (1.9GB, 5 layers). Retrained Control and Tile models with enriched mask varieties, improved training schedules, and multi-resolution control images (512~1536) to fix mask pattern leakage and large
control_context_scale artifacts.
- [2025.12.22] Performed 8-step distillation on v2.1 to restore acceleration lost when applying ControlNet. Uploaded a tile model for super-resolution.
- [2025.12.17] Fixed v2.0 typo (
control_layers used instead of control_noise_refiner), which caused double forward pass and slow inference. Speed restored in v2.1.
Model Card
a. 2602 Models
| Name | Description |
|---|
| Z-Image-Turbo-Fun-Controlnet-Union-2.1-2602-8steps.safetensors | Supports multiple control conditions (Canny, Depth, Pose, MLSD, Hed, Scribble, and Gray). |
| Z-Image-Turbo-Fun-Controlnet-Union-2.1-lite-2602-8steps.safetensors | Same training scheme as the 2601 version, but with control applied to fewer layers. Supports multiple control conditions (Canny, Depth, Pose, MLSD, Hed, Scribble, and Gray). |
b. 2601 Models
| Name | Description |
|---|
| Z-Image-Turbo-Fun-Controlnet-Union-2.1-2601-8steps.safetensors | Compared to the old version, this model uses more diverse masks, a more reasonable training schedule, and multi-resolution control images (512ā1536) instead of single resolution (512). This reduces artifacts and mask information leakage while improving robustness. Supports multiple control conditions (Canny, Depth, Pose, MLSD, Hed, and Scribble). |
| Z-Image-Turbo-Fun-Controlnet-Tile-2.1-2601-8steps.safetensors | Compared to the old version, uses higher training resolution and a more refined distillation schedule, reducing bright spots and artifacts. |
| Z-Image-Turbo-Fun-Controlnet-Union-2.1-lite-2601-8steps.safetensors | Same training scheme as the 2601 version, but with control applied to fewer layers, resulting in weaker control. This allows for larger control_context_scale values with more natural results, and is also better suited for lower-spec machines. Supports multiple control conditions (Canny, Depth, Pose, MLSD, Hed, and Scribble). |
| Z-Image-Turbo-Fun-Controlnet-Tile-2.1-lite-2601-8steps.safetensors | Same training scheme as the 2601 version, but with control applied to fewer layers, resulting in weaker control. Allows larger control_context_scale values with more natural results, and better suits lower-spec machines. |
c. Models Before 2601
| Name | Description |
|---|
| Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps.safetensors | Distilled from version 2.1 using an 8-step distillation algorithm. Compared to version 2.1, 8-step prediction yields clearer images with more reasonable composition. Supports Canny, Depth, Pose, MLSD, and Hed. |
| Z-Image-Turbo-Fun-Controlnet-Tile-2.1-8steps.safetensors | A Tile model trained on high-definition datasets (up to 2048Ć2048) for super-resolution, distilled using an 8-step algorithm. 8-step prediction is recommended. |
| Z-Image-Turbo-Fun-Controlnet-Union-2.1.safetensors | A retrained model fixing the typo in version 2.0, with faster single-step speed. Supports Canny, Depth, Pose, MLSD, and Hed. However, like version 2.0, some acceleration capability was lost during training, requiring more steps and cfg. |
| Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors | ControlNet weights for Z-Image-Turbo. Compared to version 1.0, more layers are modified with longer training. However, a code typo caused layer blocks to forward twice, resulting in slower speed. Supports Canny, Depth, Pose, MLSD, and Hed. Some acceleration capability was lost during training, requiring more steps. |
Model Features
- This ControlNet is applied to 15 layer blocks and 2 refiner layer blocks (Lite models: 3 layer blocks and 2 refiner layer blocks). It supports multiple control conditions including Canny, HED, Depth, Pose, and MLSD (supporting Scribble in 2601 models and Gray in 2602 models).
- Inpainting mode is also supported. For inpaint mode, use a larger
control_context_scale for better image continuity.
- Training Process:
- 2.0: Trained from scratch for 70,000 steps on 1M high-quality images (general and human-centric content) at 1328 resolution with BFloat16 precision, batch size 64, learning rate 2e-5, and text dropout ratio 0.10.
- 2.1: Continued training from 2.0 weights for 11,000 additional steps after fixing a typo, using the same parameters and dataset.
- 2.1-8-steps: Distilled from version 2.1 using an 8-step distillation algorithm for 5,500 steps.
- Note on Steps:
- 2.0 and 2.1: Higher
control_context_scale values may require more inference steps for better results, likely because the control model has not been distilled.
- 2.1-8-steps: Use 8 steps for inference.
- Adjust
control_context_scale (optimal range: 0.65ā1.00) for stronger control and better detail preservation. A detailed prompt is highly recommended for stability.
- In versions 2.0 and 2.1, applying ControlNet to Z-Image-Turbo caused loss of acceleration capability and blurry images. For strength and step count testing details, refer to Scale Test Results (generated with version 2.0).
Results
a. Difference between 2.1-8steps and 2.1-2601-8steps.
The old 8-steps model had bright spots/artifacts when the control_context_scale was too large, while the new version does not.
| Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps | Z-Image-Turbo-Fun-Controlnet-Union-2.1-2601-8steps |
 |  |
The old 8-steps model sometimes learned the mask information and tended to completely fill the mask during removal, while the new version does not.
| Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps | Z-Image-Turbo-Fun-Controlnet-Union-2.1-2601-8steps |
 |  |
b. Difference between 2.1 and 2.1-8steps.
8 steps results:
| Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps | Z-Image-Turbo-Fun-Controlnet-Union-2.1 |
 |  |
c. Generation Results With 2.1-lite-2601-8steps
Shares the same training scheme as the 2601 version, but with control applied to fewer layers, resulting in weaker control. This allows for larger control_context_scale values with more natural results, and is also better suited for lower-spec machines.
| Pose | Output |
 |  |
| Pose | Output |
 |  |
| Canny | Output |
 |  |
d. Generation Results With 2.1-2601-8steps
| Depth | Output |
 |  |
| Pose | Output |
 |  |
| Pose | Output |
 |  |
| Pose | Output |
 |  |
| Canny | Output |
 |  |
| HED | Output |
 |  |
| Depth | Output |
 |  |
| Low Resolution | High Resolution |
 |  |
e. Gray Control Results with 2602 Models
| Low Resolution | High Resolution |
 |  |
Inference
Go to the VideoX-Fun repository for more details.
Please clone the VideoX-Fun repository and create the required directories:
# Clone the code
git clone https://github.com/aigc-apps/VideoX-Fun.git
# Enter VideoX-Fun's directory
cd VideoX-Fun
# Create model directories
mkdir -p models/Diffusion_Transformer
mkdir -p models/Personalized_Model
Then download the weights into models/Diffusion_Transformer and models/Personalized_Model.
š¦ models/
āāā š Diffusion_Transformer/
ā āāā š Z-Image-Turbo/
āāā š Personalized_Model/
ā āāā š¦ Z-Image-Turbo-Fun-Controlnet-Union-2.1.safetensors
ā āāā š¦ Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps.safetensors
ā āāā š¦ Z-Image-Turbo-Fun-Controlnet-Tile-2.1-8steps.safetensors
Then run the file examples/z_image_fun/predict_t2i_control_2.1.py and examples/z_image_fun/predict_i2i_inpaint_2.1.py.
(Obsolete) Scale Test Results:
Scale Test Results
The table below shows the generation results under different combinations of Diffusion steps and Control Scale strength:
| Diffusion Steps | Scale 0.65 | Scale 0.70 | Scale 0.75 | Scale 0.8 | Scale 0.9 | Scale 1.0 |
|---|
| 9 |  |  |  |  |  |  |
| 10 |  |  |  |  |  |  |
| 20 |  |  |  |  |  |  |
| 30 |  |  |  |  |  |  |
| 40 |  |  |  |  |  |  |
Parameter Description:
Diffusion Steps: Number of iteration steps for the diffusion model (9, 10, 20, 30, 40)
Control Scale: Control strength coefficient (0.65 - 1.0)