Skip to main content

LTX 2.3 Pipeline

LTX 2.3 is a 22-billion-parameter DiT (Diffusion Transformer) audio-video foundation model from Lightricks. Unlike Scope’s other pipelines, which generate video only, LTX 2.3 produces synchronized audio and video in a single model pass. The scope-ltx-2 plugin wraps the model with FP8 quantization and CPU-resident block streaming so it fits on a 24 GB GPU.
LTX 2.3 ships as a separately installed plugin, not a built-in pipeline. See the LTX 2.3 guide for installation steps.

At a glance

Base ModelLightricks DiT 22B (distilled v3, FP8)
Estimated VRAM~22 GB (24 GB GPU recommended)
Text EncoderGemma 3 12B (FP8)
Audio SupportYes (synchronized audio-video)
LoRA SupportLTX 2.3 LoRAs (permanent merge only)
IC-LoRA SupportYes (structural control via reference video)
ID-LoRA SupportYes (identity-driven talking-head)
VACE SupportNo
T2V / I2V / V2VYes / Yes / Yes (via IC-LoRA)

Examples


Output constraints

LTX 2.3 has two hard constraints that distinguish it from Scope’s other pipelines.

Resolution

height and width are both snapped to the nearest multiple of 32. The default is 384 x 320. Generation is faster at smaller resolutions; visual quality improves at larger ones, with the usual tradeoff against GPU headroom and frame rate.

Frame count

num_frames must follow the pattern 8 x K + 1 (9, 17, 25, 33, …, 257). Values that do not match are snapped to the nearest valid count. The default is 129, the minimum is 9, and the maximum is 257. The 8×K+1 pattern comes from the video VAE, which downsamples temporally by 8 plus one anchor frame.

Parameters

All parameters come from the plugin’s LTX2Config schema.
ParameterDefaultRangeDescription
height384multiples of 32Output height in pixels
width320multiples of 32Output width in pixels
num_frames1299 to 257 (8×K+1)Frames per inference batch
num_steps81 to 20Euler denoising steps
scheduledistilledsee SchedulesSigma schedule
frame_rate24.0positive floatOutput frame rate (metadata)
randomize_seedtruebooleanRandomize seed between chunks for varied output
ffn_chunk_size4096null to disableFFN chunk size for memory tuning
i2v_imagenonefile or assetOptional first-frame reference image
i2v_strength1.00.0 to 1.0First-frame conditioning strength (0 disables)
control_strength1.00.0 to 1.0Video mode guide conditioning strength
audio_inputnoneaudio fileAudio reference or driving track
audio_modedrivingdriving or id_loraAudio input semantics (see Audio modes)
identity_guidance_scale3.00.0 to 20.0ID-LoRA identity amplification
lora_merge_strategypermanent_mergefixedOnly permanent merge is supported for FP8
sigmasnonedescending listCustom sigma schedule (API only; overrides num_steps and schedule)
realtime_pacing_slack0.0≥ 0Fraction ahead of wall-clock before throttling

Schedules

LTX 2.3 supports five sigma schedules.
ScheduleDescription
distilledPre-trained 8-step schedule. Falls back to linear for other step counts.
linearLinear ramp from 1.0 to 0.0.
cosineSpends more steps in the mid-noise range.
linear_quadraticTwo-phase linear-then-quadratic schedule.
betaBeta-distribution schedule.
The distilled schedule is the default and gives the best quality at 8 steps. Other schedules are useful when you change num_steps or want different denoising trajectories.

Generation modes

Text mode (default)

Generates video and audio directly from the prompt with no other input. Audio is generated jointly with video.

Image-to-video

Set i2v_image to a reference image. The first frame of the output is conditioned on the image, scaled by i2v_strength. At i2v_strength = 1.0 the first frame matches the reference; at 0.0 the conditioning is disabled and the pipeline falls back to text-only.

Video mode (for IC-LoRAs)

Drives structural or stylistic conditioning with a reference video. Enabled by selecting a matching IC-LoRA and switching the input mode. control_strength scales the guide conditioning. See the IC-LoRA catalog below.

Audio modes

The audio_mode parameter determines what the audio_input does.

driving (default)

Input audio drives the video. Output audio equals the input; no audio diffusion runs. Use this for lip-syncing to a specific voiceover, song, or audio track.

id_lora

Input audio is a speaker identity reference (about 5 seconds of the subject speaking). Generated audio matches the voice, and the model produces lip-synced video of the subject. Requires the ID-LoRA weights, which download automatically with the base model. See the LTX 2.3 guide for the full workflow.

Memory architecture

The pipeline orchestrates GPU memory across four components to keep a 22B model under 24 GB:
  1. Gemma 3 12B text encoder (~13 GB) loads on GPU, encodes the prompt, then offloads to CPU.
  2. Video VAE + Audio VAE (~1 GB total) stay resident on GPU.
  3. Transformer (~23 GB total) stays CPU-resident. Non-block layers persist on GPU; transformer blocks stream from CPU with double-buffered async transfers and prefetching.
  4. Between generations, streaming state persists. Full teardown only happens when the text encoder reloads on a prompt change.
To reduce memory further on tighter GPUs, lower ffn_chunk_size (for example 2048 or 1024) or reduce num_frames.

Prompting

LTX 2.3 accepts plain descriptive prompts, and supports structured channel tags when using ID-LoRA.

Scene prompts

For general audio-video generation, a descriptive prompt works:
"A time-lapse of a flower blooming in a sunlit meadow, birds chirping in the background."

Channel-tagged prompts (ID-LoRA mode)

For identity-driven talking-head generation, use channel tags to separate visual, speech, and sound content:
[VISUAL]: A close-up of a person speaking in a park.
[SPEECH]: Hello world.
[SOUNDS]: Birds chirping.
[VISUAL] controls scene content, [SPEECH] controls what the subject says, and [SOUNDS] controls environmental audio.

IC-LoRA catalog

IC-LoRAs (In-Context LoRAs) condition generation on a reference video to enforce spatial structure, motion, or stylistic transformation. Unlike standard LoRAs that change style, IC-LoRAs direct what the output looks like frame by frame. Place IC-LoRA .safetensors files in ~/.daydream-scope/models/lora/ and select them from the LoRA picker in video mode. The reference_downscale_factor is read automatically from each file’s metadata.
See the LTX 2.3 guide for step-by-step usage.

Official (Lightricks)

Union Control

  • File: ltx-2.3-22b-ic-lora-union-control-ref0.5.safetensors
  • Source: Lightricks/LTX-2.3-22b-IC-LoRA-Union-Control
  • Purpose: Unified structural control combining Canny edge, depth map, and pose skeleton conditioning in one checkpoint.
  • Reference downscale: 2 (reference at 0.5x output resolution)
  • Use cases: Animate depth maps into photorealistic videos, retarget character motion from pose skeletons, generate videos from edge-based storyboards.
  • How to use: Prepare a control signal video (depth, Canny, or pose) at half the target resolution. Feed it as the reference input. The model follows the structural guidance while the text prompt controls style.

Motion Track Control

  • File: ltx-2.3-22b-ic-lora-motion-track-control-ref0.5.safetensors
  • Source: Lightricks/LTX-2.3-22b-IC-LoRA-Motion-Track-Control
  • Purpose: Directs object motion using sparse spline-based trajectories rendered as trails of circles. Supports single or multiple simultaneous paths.
  • Reference downscale: 2 (reference at 0.5x output resolution)
  • Use cases: Direct object paths in product shots, character scenes, or nature footage. Guide multiple objects along independent paths in one generation. Extract motion tracks from existing videos with point trackers like SpatialTrackerV2 and replay them in new scenes.
  • How to use: Provide a reference video with colored spline overlays indicating desired motion. Start with 3 to 4 keypoints per track and add more only if the interpolated path does not match your intent.

Community

Anime2Real

  • File: ltx23_anime2real_rank64_v1_4500.safetensors
  • Author: Alissonerdx
  • Source: Alissonerdx/LTX-LoRAs
  • Purpose: Converts anime-style video into photorealistic footage while preserving motion, composition, and scene layout.
  • LoRA rank: 64
  • Use cases: Turn animated content into live-action aesthetic for mashups or concept visualization. Bridge storyboard animation and live-action previsualization. Create realistic versions of anime scenes.
  • How to use: Feed the anime video as the reference input. Prompt with a description of the realistic scene you want. The model translates the anime style to photorealism while following the reference motion.

Inpaint (Masked T2V)

  • File: ltx23_inpaint_masked_t2v_rank128_v1_10000steps.safetensors
  • Author: Alissonerdx
  • Source: Alissonerdx/LTX-LoRAs
  • Purpose: Text-guided video inpainting. Mask a region in a reference video and generate new content to fill it based on a text prompt.
  • LoRA rank: 128
  • Use cases: Replace objects or characters in a scene (for example, swap a car for a truck), remove unwanted elements and fill the region with context-appropriate content, or add new elements guided by text.
  • How to use: The mask must be embedded into the guide video (not passed as a separate channel). The trained mask format uses 8x8 block patterns. The 10000-step checkpoint favors mask adherence; a 2500-step variant in the same repo favors prompt adherence.
The official Lightricks inpainting IC-LoRA was removed from HuggingFace. This is a community-trained alternative.

Edit Anything

  • File: ltx23_edit_anything_global_rank128_v1_9000steps_adamw.safetensors
  • Author: Alissonerdx
  • Source: Alissonerdx/LTX-LoRAs
  • Purpose: Experimental global video editing LoRA for add, remove, replace, and style-conversion edits driven by text on a reference video.
  • Use cases: Add, remove, or replace subjects and objects in an existing video while preserving the rest of the scene. Apply global style conversions (for example, turn footage into a watercolor or Ghibli-style look). Generate synthetic edit-pair datasets for downstream training.
  • How to use: Feed the source video as the reference input and write an action-first, spatially grounded prompt following one of the trained patterns:
    • Add: Add a/an [subject] with [attributes], [location in the scene].
    • Remove: Remove the [subject] [location or identifying description].
    • Replace: Replace the [original subject] [location] with a/an [new subject] with [attributes].
    • Convert: Convert the video into a [style name] style.
Increase the scale/strength to 1.1-1.4 for better results.

Ungrade

  • File: ltx-2.3-22b-ic-lora-ungrade.safetensors
  • Author: oumoumad
  • Source: oumoumad/LTX-2.3-22b-IC-LoRA-Ungrade
  • Purpose: Strips color grading and contrast adjustments from footage, returning a neutral ungraded look.
  • Use cases: Normalize the look of clips shot with different cameras or color profiles before further processing. Create a consistent baseline color space across mixed-source footage.
  • How to use: Feed the color-graded video as the reference input. The model outputs a version with color grading removed, preserving motion and composition.

Refocus

  • File: ltx-2.3-22b-ic-lora-refocus.safetensors
  • Author: oumoumad
  • Source: oumoumad/LTX-2.3-22b-IC-LoRA-ReFocus
  • Purpose: Restores focus to out-of-focus or soft footage, sharpening details while keeping motion consistent.
  • Use cases: Sharpen footage shot with incorrect focus, recover detail from lens blur or shallow depth-of-field artifacts, or enhance the perceived quality of low-resolution webcam footage.
  • How to use: Feed the blurry or soft video as the reference input. The model outputs a refocused version with enhanced sharpness.

Uncompress

  • File: ltx-2.3-22b-ic-lora-uncompress.safetensors
  • Author: oumoumad
  • Source: oumoumad/LTX-2.3-22b-IC-LoRA-Uncompress
  • Purpose: Reverses compression artifacts (blocking, banding, mosquito noise) from heavily compressed video.
  • Use cases: Restore quality to aggressively H.264 or H.265 compressed video, clean up streaming artifacts from low-bitrate sources like YouTube or Zoom recordings, improve visual quality of low-bitrate archival footage.
  • How to use: Feed the compressed video as the reference input. The model outputs an artifact-free version while preserving motion.

Outpaint

  • File: ltx-2.3-22b-ic-lora-outpaint.safetensors
  • Author: oumoumad
  • Source: oumoumad/LTX-2.3-22b-IC-LoRA-Outpaint
  • Purpose: Extends the canvas of a video by generating new content in regions marked as pure black. Fill sides, top, bottom, or any combination.
  • Use cases: Convert 4:3 footage to 16:9 by outpainting the sides. Extend vertical video to landscape for desktop viewing. Add headroom or environmental context to tightly framed shots. Create cinematic letterbox-to-widescreen conversions.
  • How to use: Letterbox your source video to the target canvas with black bars (RGB 0, 0, 0) where you want new content. The model fills those regions with temporally consistent content.
Dark source footage can confuse the “generate here” sentinel. Apply a gamma 2.0 correction before feeding the model (which lifts dark content while keeping black bars at 0), then apply inverse gamma 0.5 to the output.

Cameraman

  • File: LTX2.3-22B_IC-LoRA-Cameraman_v1_10500.safetensors
  • Author: Cseti
  • Source: Cseti/LTX2.3-22B_IC-LoRA-Cameraman_v1
  • Purpose: Replicates camera movements (zoom, pan, tilt, orbit) from a reference video into newly generated content.
  • LoRA rank: 32
  • Training data: 77 video pairs across motion types.
  • Supported motions: zoom in, zoom out, tilt up, tilt down, pan left/right, orbit CW/CCW, compound (for example zoom + tilt).
  • Use cases: Transfer the camera work from a real-world reference clip to a generated scene. Recreate specific dolly, crane, or handheld motion in AI-generated content. Keep camera motion consistent across multiple generated clips.
  • How to use: Provide a reference video carrying the desired camera motion and a text prompt describing the new scene. No trigger word required. Strength 0.7 to 1.0 is recommended. If the motion transfer feels too subtle, describe the movement explicitly in the prompt.

See also

LTX 2.3 Guide

Install the plugin, run it locally or in the cloud, and use ID-LoRA and IC-LoRAs

Pipelines Overview

Compare LTX 2.3 with Scope’s other pipelines

Using LoRAs

General LoRA installation and management in Scope

System Requirements

Hardware requirements across pipelines