Anima ControlNet-LLLite Sample Weights

Name: kohya-ss/Anima-LLLite
Brand: kohya-ss
Rating: 0.0 (13 reviews)

Sample ControlNet-LLLite weights for the Anima image generation model, trained with anima_train_control_net_lllite.py from the sd-scripts repository.

ControlNet-LLLite is a lightweight, LoRA-like conditional control module ported to Anima's DiT (MiniTrainDIT) architecture. See the training & inference guide for full details on the v2 architecture, dataset format, and how to run inference.

An experimental ComfyUI node is also available: kohya-ss/ComfyUI-Anima-LLLite.

Note on effect strength. The effect of these weights is intentionally moderate overall, and the pose model in particular has noticeably weaker control than the others. They are intended as community starting points / references rather than strong production-grade ControlNets.

日本語

sd-scripts の anima_train_control_net_lllite.py で学習した、Anima 向け ControlNet-LLLite のサンプル重みです。アーキテクチャ、データセット形式、推論手順の詳細は学習・推論ガイドを参照してください。実験的な ComfyUI ノードも kohya-ss/ComfyUI-Anima-LLLite で公開しています。

効果の強さについて: 本サンプル群は全体的に効果が控えめで、特に姿勢モデルは制御の効きがかなり弱い点にご注意ください。コミュニティ向けの参考実装としての位置づけです。

Released Weights / 公開する重み

File	Type	Conditioning source
`anima-lllite-lineart-1.safetensors`	Lineart	White background, black lines
`anima-lllite-depth-1.safetensors`	Depth map	White = near, black = far (Depth Anything V2)
`anima-lllite-pose-1.safetensors`	Pose	DWPose standard (colored skeleton + face/hand keypoints)
`anima-lllite-scribble-1.safetensors`	Fake scribble	HED / PiDiNet + hand-drawn-style augmentation
`anima-lllite-any-test-like-1-step1000.safetensors`	any-test like (mixed)	Lineart / scribble (HED, PiDiNet) / grayscale, all heavily augmented — 1,000-step checkpoint
`anima-lllite-any-test-like-1-step2000.safetensors`	any-test like (mixed)	Same as above — 2,000-step checkpoint (stronger effect)

Sample / サンプル

Type	Cond image	Generated image
Lineart
Depth map
Pose
Fake scribble
any-test like (1,000 steps)
any-test like (2,000 steps)

Common Setup / 共通設定

Base models

Anima DiT: anima-preview3-base
VAE: Qwen-Image VAE
Text encoder: Qwen3-0.6B (base)

Dataset (common to all four)

Target images: ~2,000 images generated by Anima from random prompts.
Image composition: ~3/4 contain people (varied gender, single-person to multi-person scenes); the remaining ~1/4 are animals, landscapes, or other no-person content.
Resolution buckets: 768×1344, 832×1216, 896×1152, 1024×1024, 1152×896, 1216×832, 1344×768.
Conditioning images: automatically generated from each target image. Generation method differs per model (see below).

Common training hyperparameters

Optimizer: adamw8bit
Mixed / save precision: bf16
Batch size: 6 (gradient checkpointing disabled)
Seed: 42
LLLite dims: --cond_emb_dim 32 --lllite_cond_dim 32 --lllite_mlp_dim 32
ASPP: not used
Caching: --cache_latents_to_disk --cache_text_encoder_outputs_to_disk
Attention backend: --attn_mode flash
Hardware: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
Wall-clock: ~45 minutes for 4 epochs on ~2,000 pairs (scribble runs the same wall-clock budget for ~1 epoch over 8,000 pairs).

日本語

ベースモデル: Anima DiT (anima-preview3-base) / Qwen-Image VAE / Qwen3-0.6B (base)
対象画像: Anima でランダムプロンプトで生成した約2,000枚（人物約3/4、残りは動物・風景など）
解像度バケット: 768×1344 / 832×1216 / 896×1152 / 1024×1024 / 1152×896 / 1216×832 / 1344×768
共通ハイパーパラメータ: adamw8bit、bf16、batch size 6（gradient checkpointing 無効）、seed 42、--cond_emb_dim 32 --lllite_cond_dim 32 --lllite_mlp_dim 32、ASPP 未使用、latent / TE 出力ともディスクキャッシュ、--attn_mode flash
ハード: RTX PRO 6000 Blackwell Max-Q Workstation Edition / 4 epoch・2,000枚で約45分（scribble は同等の時間で1 epoch・8,000枚相当）

Per-Model Details / 各モデルの詳細

1. Lineart / 線画

Conditioning: white background, black lines.
Generation method: tori29umai's FramePack lineart LoRA (https://note.com/tori29umai/n/n3447ca5b1437).
Pairs: 2,000.
Per-model hyperparameters (differences from the common setup):
- --learning_rate 2e-4
- --max_train_epochs 4 (published = epoch 4)
- --discrete_flow_shift 1.0
- --lllite_target_layers self_attn_q_pre
- --lllite_cond_resblocks (default, 1)

日本語

白背景・黒線の線画。conditioning は tori29umai 氏の FramePack 用線画化 LoRA (https://note.com/tori29umai/n/n3447ca5b1437) を用いて自動生成しています。ペア数 2,000、学習率 2e-4、4 epoch、discrete_flow_shift 1.0、target は self_attn_q_pre のみ、ResBlock はデフォルト1段。

2. Depth / デプスマップ

Conditioning: depth map (white = near, black = far).
Generation method: Depth Anything V2.
Pairs: 2,000.
Per-model hyperparameters:
- --learning_rate 2e-4
- --max_train_epochs 4 (published = epoch 4)
- --discrete_flow_shift 1.0
- --lllite_target_layers self_attn_q_pre
- --lllite_cond_resblocks 3 (deeper conditioning1 trunk to better capture depth's global structure)

日本語

デプスマップ（白いほど近く、黒いほど遠い）。conditioning は Depth Anything V2 で自動生成。ペア数 2,000、学習率 2e-4、4 epoch、discrete_flow_shift 1.0、target は self_attn_q_pre、ResBlock は3段（depth は大域構造を多く含むため conditioning1 を深めに）。

3. Pose / 姿勢

Conditioning: DWPose standard output — colored body skeleton, white face keypoints, and hand keypoints.
Generation method: easy_dwpose (a convenience wrapper around DWPose).
Pairs: 1,544 (only images where DWPose successfully extracted a pose).
⚠ Caveat: this model's effect is noticeably weaker than the other three. It is best treated as a soft pose prior rather than a strict pose-locking ControlNet.
Two-stage training. The published weight is the result of resuming Stage 1 with a different discrete_flow_shift:

Stage 1 — 3 epochs were trained, and the epoch-3 checkpoint (*-000003.safetensors) is used as the resume point for Stage 2.
- --learning_rate 1e-3
- --discrete_flow_shift 1.0
- --lllite_target_layers self_attn_q_pre,self_attn_kv_pre (K/V also injected)
- --lllite_cond_resblocks 3
Stage 2 — additional 4 epochs, resumed from Stage 1 epoch-3 via --network_weights. Same settings as Stage 1 except:
- --discrete_flow_shift 3.0
- Published weight: anima-lllite-pose-1.safetensors

日本語

DWPose の標準出力（カラー骨格＋白い顔点＋手のキーポイント）を conditioning とした姿勢制御モデル。実装には easy_dwpose（DWPose の利用しやすいラッパー）を使用しました。ペア数 1,544（DWPose で人物の姿勢が取得できた画像のみ）。

⚠ 他3モデルと比べて効果がかなり弱いため、厳密な姿勢固定というより緩い prior として使ってください。

学習は2段構えで、Stage 1 epoch-3 の重みを resume 起点として discrete_flow_shift を変えて再学習しています：

Stage 1: 学習率 1e-3、discrete_flow_shift 1.0、target self_attn_q_pre,self_attn_kv_pre（K/V も追加）、ResBlock 3段。3 epoch 時点のチェックポイントを resume 起点に採用。
Stage 2: Stage 1 の epoch 3 重みを --network_weights で読み込み、discrete_flow_shift を 1.0 から 3.0 に変更してさらに 4 epoch 学習。最終的に公開する重みが anima-lllite-pose-1.safetensors。

4. Fake scribble / フェイクスクリブル

Conditioning: scribble-style line drawings derived from HED / PiDiNet edge maps, with random hand-drawn-style augmentation, black background, white lines.
Generation method: HED and PiDiNet detectors via controlnet_aux. Four conditioning variants per target image — HED, HED + augmentation, PiDiNet, PiDiNet + augmentation — combined into one dataset.
Pairs: 8,000 (= 2,000 target images × 4 conditioning variants).
Augmentation procedure (applied independently per image to HED / PiDiNet output):
1. Gaussian blur (probability 80%, σ ∈ [0.3, 1.6]) — smooths detector-side speckle.
2. Random binarization threshold (∈ [80, 180]).
3. Small-component removal (min_area ∈ [8, 50]).
4. Line-width jitter: 35% dilate (kernel 2 or 3), 15% erode (kernel 2), 50% unchanged.
5. Random partial-line dropout (probability ∈ [0, 0.20]) using a coarse keep-mask upscaled with INTER_CUBIC.
6. Final small-component cleanup.
Per-model hyperparameters:
- --learning_rate 1e-3
- Stopped at 1,500 steps (~50 min). The loss curve had plateaued, so training was halted before reaching the configured --max_train_epochs 4.
- --discrete_flow_shift 1.0
- --lllite_target_layers self_attn_q_pre
- --lllite_cond_resblocks 3

日本語

黒背景・白線の、HED / PiDiNet で生成したエッジ画像、およびそれらを手書き風にランダム加工した画像を conditioning とする scribble モデル。HED / PiDiNet の検出には controlnet_aux を使用しました。ペア数 8,000（対象画像 2,000 × conditioning 4種：HED / HED 加工 / PiDiNet / PiDiNet 加工）。

加工処理（HED / PiDiNet 出力に対して画像ごとに独立適用）：

ガウシアンぼかし（確率 80%、σ ∈ [0.3, 1.6]）— 検出器由来のザラつきをならす。
ランダムしきい値（∈ [80, 180]）で2値化。
小領域除去（min_area ∈ [8, 50]）。
線幅変動: 35% で dilate（カーネル 2 or 3）、15% で erode（カーネル 2）、残り 50% は変化なし。
ランダム部分ドロップアウト（確率 ∈ [0, 0.20]、低解像度の keep マスクを INTER_CUBIC で拡大して適用）。
最後にもう一度小領域除去。

学習は loss curve が収束したため --max_train_epochs 4 設定ながら 1,500 step で打ち切り（約50分）。学習率 1e-3、discrete_flow_shift 1.0、target self_attn_q_pre、ResBlock 3段。

5. any-test like / any-test 風（複数条件画像の混合学習）

About the name. This is not anytest itself. It is an experimental ControlNet developed independently from 月須和・那々's anytest (an SDXL ControlNet). Inspired by publicly shared notes about training with multiple grayscale conditioning patterns, this model attempts to reproduce that direction on Anima's DiT via ControlNet-LLLite.

Conditioning: a heterogeneous mix of lineart, scribble (HED / PiDiNet, augmented) and grayscale images, so that the same LLLite weight reacts to whichever cond modality is supplied at inference time.
Two published checkpoints from the same run, no two-stage training:
- anima-lllite-any-test-like-1-step1000.safetensors — milder effect.
- anima-lllite-any-test-like-1-step2000.safetensors — noticeably stronger effect.
- If the cond is too dominant, lower the inference-time strength (--lllite_multiplier / ComfyUI strength) or restrict the active range with start/end percent.
Dataset (14,000 pairs total):
- Image set 1 (2,000 target images, reused from the other models in this repo) → 5 conditioning variants per image = 10,000 pairs:
  1. Lineart (same generation as the lineart model).
  2. HED scribble with augmentation (same as the scribble model).
  3. PiDiNet scribble with augmentation (same as the scribble model).
  4. Grayscale, augmentation pattern A.
  5. Grayscale, augmentation pattern B.
- Image set 2 (additional 2,000 target images) → 2 grayscale conditioning variants per image = 4,000 pairs.
Conditioning augmentation:
- Lineart / scribble branches: in addition to each model's native augmentation, apply a light extra pass of brightness / contrast jitter and Gaussian blur (significantly weaker than the grayscale branch), plus random color inversion at 50% probability.
- Grayscale branches: random HSV jitter, random brightness / contrast (with a small chance of a near-binarization extreme contrast), Gaussian blur, and random color inversion. Parameters:
```
P_HSV_JITTER = 0.5
P_BLUR = 0.75
P_EXTREME_CONTRAST = 0.1   # near-binarization extreme contrast
H_SHIFT_RANGE = (-60, 60)  # OpenCV H is 0-179, modulo shift
S_SCALE_RANGE = (0.0, 2.0)
V_SCALE_RANGE = (0.5, 1.5)
BRIGHTNESS_RANGE = (-128, 128)         # 8-bit offset
CONTRAST_RANGE = (0.3, 3.0)
EXTREME_CONTRAST_RANGE = (3.0, 10.0)
BLUR_RADIUS_RANGE = (0.0, 10.0)        # Gaussian sigma
P_INVERT = 0.25
```
Per-model hyperparameters (note: this run deviates from the common setup in several places):
- --learning_rate 2e-4
- --max_train_epochs 32 configured, but published checkpoints are taken at 1,000 and 2,000 steps.
- --discrete_flow_shift 4.0 (higher than the 1.0 used by the other four models)
- --lllite_target_layers self_attn_q_pre
- --lllite_cond_resblocks 6 (deepest conditioning1 trunk among the released models, to absorb the heterogeneous cond distribution)
- Batch size 32 (via dataset TOML) (vs. the common-setup 6)
- --gradient_checkpointing enabled (the other four models keep it disabled)
- --seed 42, adamw8bit, bf16, --attn_mode flash, --cache_latents_to_disk --cache_text_encoder_outputs_to_disk (same as common setup).

日本語

名称について: 本モデルは anytest そのものではなく、月須和・那々氏による anytest（SDXL 向け ControlNet）とは別個に作成した実験的 ControlNet です。氏が公開情報で言及されている「複数パターンのグレースケール条件画像を用いる学習」に着想を得て、Anima の DiT 向けに ControlNet-LLLite で再現を試みたものです。

conditioning: lineart / scribble (HED, PiDiNet, 加工済) / grayscale の異種混合。同一の LLLite 重みが、推論時に与えられた cond の種類に応じて反応するように学習しています。
公開チェックポイントは 1 ランから 2 つ（two-stage ではありません）：
- anima-lllite-any-test-like-1-step1000.safetensors — 効きは控えめ。
- anima-lllite-any-test-like-1-step2000.safetensors — 明確に効きが強め。
- 効きが強すぎる場合は推論時の strength（--lllite_multiplier / ComfyUI の strength）を下げるか、start/end percent で適用区間を絞ってください。
データセット（合計 14,000 ペア）:
- 画像セット1（他モデルでも使用している教師画像 2,000 枚）に対し 5 種類の conditioning を生成 → 10,000 ペア:
  1. Lineart（lineart モデルと同じ生成方法）
  2. HED scribble + augmentation（scribble モデルと同じ）
  3. PiDiNet scribble + augmentation（scribble モデルと同じ）
  4. Grayscale パターン A
  5. Grayscale パターン B
- 画像セット2（追加の教師画像 2,000 枚）に対し grayscale 2 パターン → 4,000 ペア。
conditioning の追加加工:
- lineart / scribble 系: 各モデル本来の augmentation に加え、輝度・コントラスト・ぼかしを弱めにかけ（grayscale 側よりかなり弱い）、さらに確率 50% で白黒反転。
- grayscale 系: ランダム HSV ジッタ、ランダム輝度・コントラスト（一部で二値化寄りの極端コントラスト）、ガウシアンぼかし、ランダム反転。パラメータは英語側のコードブロック参照。
ハイパーパラメータ（共通設定からの差分が多い点に注意）:
- --learning_rate 2e-4、--discrete_flow_shift 4.0（他4モデルの 1.0 より大きめ）、--lllite_target_layers self_attn_q_pre、--lllite_cond_resblocks 6（公開モデル中最深の conditioning1、異種混合 cond を吸収するため）。
- --max_train_epochs 32 設定だが、公開重みは 1,000 step / 2,000 step 時点のチェックポイント。
- batch size 32（dataset TOML 側で指定。共通設定の 6 ではない）、--gradient_checkpointing 有効（他4モデルは無効）。
- その他（adamw8bit、bf16、--attn_mode flash、latent / TE 出力ディスクキャッシュ、seed 42）は共通設定と同じ。

Usage / 使い方

See the inference section of the training guide for anima_minimal_inference_control_net_lllite.py. Architecture metadata is embedded in each .safetensors, so you normally only need to point --lllite_weights at the file and pass a --control_image.

An experimental ComfyUI node is available at kohya-ss/ComfyUI-Anima-LLLite.

License / ライセンス

These weights follow the same license as the Anima base model. Please refer to the Anima model card for terms of use.

A copy of the CircleStone Labs Non-Commercial License is included in this repository as LICENSE.

日本語

本重みのライセンスは Anima 本体に準拠します。利用条件については Anima 本体のモデルカードを参照してください。

CircleStone Labs Non-Commercial License のコピーはこのリポジトリの LICENSE として同梱しています。

Credits / クレジット

ControlNet-LLLite (original SDXL implementation) and Anima port — kohya-ss.
Lineart conditioning generated using tori29umai's FramePack lineart LoRA — FramePackのLoRA配布場所. Thanks to とりにく (tori29umai) for releasing the LoRA.
Depth conditioning generated with Depth Anything V2.
Pose conditioning generated with easy_dwpose, a wrapper around DWPose.
Fake scribble conditioning generated with HED and PiDiNet detectors provided by controlnet_aux (originals: HED, PiDiNet).
any-test like is inspired by 月須和・那々 's anytest (an SDXL ControlNet), in particular the publicly shared idea of training with multiple grayscale conditioning patterns. This Anima LLLite model is an independent experimental reproduction attempt, not anytest itself.

kohya-ss/Anima-LLLite