Back to Models
EL

elix3r/LTX-2.3-22b-AV-LoRA-talking-head

elix3rimage

elix3r/LTX-2.3-22b-AV-LoRA-talking-head

Overview

This is the first community audio-visual (AV) LoRA for LTX-Video 2.3, trained using the joint audio-video cross-attention architecture of the LTX-2.3 22B model. The LoRA enables talking head video generation with synchronized lip sync and internalized voice characteristics from a reference character.

This release is a character-specific implementation and reference pipeline. The weights demonstrate a working AV LoRA trained on a custom dataset. The methodology, dataset structure, caption format, and training config are fully documented and reusable for training your own character-specific AV LoRA.


What It Does

  • Generates talking head videos with synchronized lip sync from a reference image
  • Internalizes voice characteristics without requiring external audio input at inference time
  • Preserves character identity across unseen reference images and backgrounds

Demo Results (v1)

  • Lip sync: accurate and consistent
  • Identity preservation: locks in at step 1250, improves linearly to step 2000
  • Voice characteristics: internalized from training data
  • Known limitations: slight audio buzz artifacts, occasional eye blinking inconsistency, seed-dependent output quality

How To Use

Requirements

  • ComfyUI Workflow (examples)
  • LTX-2.3 Model
  • Power Lora Loader node

Loading the LoRA

Load LTX-2.3-22b-AV-LoRA-talking-head-v1.safetensors via the Power Lora Loader node in ComfyUI.

Set LoRA strength to 1.0.

Recommended Inference Settings

ParameterValue
Resolution1280x736
FPS24
Video lengthAny (10+ seconds recommended)
LoRA strength1.0
Trigger wordOHWXPERSON
CFG scale1.0

Note: 1280x736 @ 24fps is recommended for image-to-video inference. For image + audio to video inference, use 1280x704 @ 25fps to match the training distribution.

Prompt Format

Include the trigger word OHWXPERSON and end the prompt with the speech transcript:

OHWXPERSON, [visual description]. The person is talking, and he says: "[transcript]"

Training Your Own AV LoRA

This section documents the full pipeline so you can train a character-specific AV LoRA for your own subject.

Pipeline Overview

Reference Images
      |
      v
Flux.1 Kontext / Flux.2 Klein     -- Image generation
      |
      v
Fish Audio S2 Pro                 -- Voice cloning + TTS
      |
      v
LTX-Video 2.3                     -- Talking head video generation
      |
      v
LTX-2 trainer                     -- AV LoRA training
      |
      v
Trained AV LoRA weights

Step 1 -- Generate Reference Images

Use Flux Kontext in ComfyUI to generate consistent reference images of your character across varied poses, angles, lighting conditions, and expressions.

[KONTEXT WORKFLOW]

Key settings used in this project:

  • Flux Kontext dev Q6_K GGUF
  • Sampler: res_3s + res_2m (RES4LYF)
  • FluxGuidance: 1
  • denoise: 1

Step 2 -- Clone the Voice

Use Fish Audio S2 Pro (model) with a 10-15 second reference audio clip of your target voice. Supports [pause], [short pause], and [emphasis] tags for pacing control.

Generate TTS audio for each clip's script using the cloned voice.

Step 3 -- Generate Training Clips

Use LTX-2.3 in ComfyUI to generate talking head clips from your reference images.

[LTX-2.3 IMAGE + AUDIO TO VIDEO WORKFLOW]

Dataset requirements:

  • 25-30 clips minimum
  • Resolution: 1280x704
  • FPS: 25
  • Length: 6-10 seconds per clip after trimming
  • Variety: front facing, 3/4 angles, side profile, different backgrounds, multiple emotions

Prompt format for each clip:

[scene description]. Mouth partially open during speech with only the front teeth partially visible, lips moving naturally without fully exposing all teeth. Smooth continuous motion, cinematic, realistic, sharp focus on subject. The person is talking, and he says: "[transcript]"

Background complexity directly impacts lip sync quality. Simple and dark backgrounds produce the best results. Complex backgrounds with many competing elements reduce lip sync accuracy.

Step 4 -- Prepare the Dataset

Structure your dataset folder as follows:

ohwxperson_dataset_v1/
  clip_001.mp4          # video with embedded audio from LTX-2.3
  clip_002.mp4
  ...
  CAPTIONS.json

Caption format in CAPTIONS.json:

{
  "captions": [
    {
      "file": "clip_001.mp4",
      "caption": "[VISUAL] OHWXPERSON, [visual description of scene, pose, clothing, background]. [SPEECH] OHWXPERSON speaks in a [voice description]: \"[exact transcript]\""
    }
  ]
}

A reference CAPTIONS.json from this project is included in this repository.

Step 5 -- Train with ltx-trainer

Recommended training configuration:

model:
  model_path: ltx-2.3-22b-dev.safetensors
  text_encoder_path: gemma
  training_mode: lora

lora:
  rank: 32
  alpha: 32
  target_modules: [to_k, to_q, to_v, to_out.0]

training_strategy:
  name: text_to_video
  with_audio: true
  first_frame_conditioning_p: 0.5

optimization:
  steps: 2000
  learning_rate: 1.0e-04
  batch_size: 1
  gradient_accumulation_steps: 1
  optimizer_type: adamw
  scheduler_type: linear
  mixed_precision_mode: bf16
  enable_gradient_checkpointing: true

validation:
  interval: 250
  inference_steps: 30
  guidance_scale: 4.0

Training Details

ParameterValue
Base modelLTX-Video 2.3 22B
Training modeLoRA
LoRA rank32
LoRA alpha32
Steps2000
Learning rate1e-4
Batch size1
Mixed precisionbf16
Dataset size26 clips
Peak VRAM usage77.08 GB
Training time~7.8 hours
Training cost~$5.33 (GCP Spot G4 instance, RTX PRO 6000 96GB)
Identity lockStep 1250

Known Limitations (v1)

  • Slight audio buzz artifacts present in outputs
  • Eye blinking occasionally inconsistent (can be fixed by manual prompting)
  • Output quality is seed dependent -- sweep 3-5 seeds per generation
  • Character-specific weights -- lip sync and voice are tied to the trained character
  • Best results at 1280x736 @ 24fps

v2 Roadmap

  • Audio preprocessing with MelBand Roformer before training to eliminate buzz artifacts
  • Explicit eye blinking captions and dedicated blinking clips in dataset
  • Extended training to 2500-3000 steps
  • Larger and more diverse dataset

Files

FileDescription
LTX-2.3-22b-AV-LoRA-talking-head-v1.safetensorsFinal trained LoRA weights (v1)
CAPTIONS.jsonReference caption file for dataset structure
ohwxperson_av_lora.yamlFull training configuration
flux_kontext_clownsharkextended.jsonFlux Kontext workflow for generating reference images
LTX-2-3-I2V.jsonLTX-Video 2.3 Image to Video workflow
LTX-2-3-I2V-Custom-Audio.jsonLTX-Video 2.3 Image + Custom Audio to Video workflow

Citation

If you use this model or methodology in your work, please credit this repository.


License

The LoRA weights are released for research and personal use. Commercial use requires separate permission.

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes39
Downloads
📝

No reviews yet

Be the first to review elix3r/LTX-2.3-22b-AV-LoRA-talking-head!

Model Info

Providerelix3r
Categoryimage
Reviews0
Avg. Rating / 5.0

Community

Likes39
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor