LTX 2.3 Pipeline
LTX 2.3 is a 22-billion-parameter DiT (Diffusion Transformer) audio-video foundation model from Lightricks. Unlike Scope’s other pipelines, which generate video only, LTX 2.3 produces synchronized audio and video in a single model pass. The scope-ltx-2 plugin wraps the model with FP8 quantization and CPU-resident block streaming so it fits on a 24 GB GPU.LTX 2.3 ships as a separately installed plugin, not a built-in pipeline. See the LTX 2.3 guide for installation steps.
At a glance
| Base Model | Lightricks DiT 22B (distilled v3, FP8) |
| Estimated VRAM | ~22 GB (24 GB GPU recommended) |
| Text Encoder | Gemma 3 12B (FP8) |
| Audio Support | Yes (synchronized audio-video) |
| LoRA Support | LTX 2.3 LoRAs (permanent merge only) |
| IC-LoRA Support | Yes (structural control via reference video) |
| ID-LoRA Support | Yes (identity-driven talking-head) |
| VACE Support | No |
| T2V / I2V / V2V | Yes / Yes / Yes (via IC-LoRA) |
Examples
Output constraints
LTX 2.3 has two hard constraints that distinguish it from Scope’s other pipelines.Resolution
height and width are both snapped to the nearest multiple of 32. The default is 384 x 320. Generation is faster at smaller resolutions; visual quality improves at larger ones, with the usual tradeoff against GPU headroom and frame rate.
Frame count
num_frames must follow the pattern 8 x K + 1 (9, 17, 25, 33, …, 257). Values that do not match are snapped to the nearest valid count. The default is 129, the minimum is 9, and the maximum is 257.
The 8×K+1 pattern comes from the video VAE, which downsamples temporally by 8 plus one anchor frame.
Parameters
All parameters come from the plugin’sLTX2Config schema.
| Parameter | Default | Range | Description |
|---|---|---|---|
height | 384 | multiples of 32 | Output height in pixels |
width | 320 | multiples of 32 | Output width in pixels |
num_frames | 129 | 9 to 257 (8×K+1) | Frames per inference batch |
num_steps | 8 | 1 to 20 | Euler denoising steps |
schedule | distilled | see Schedules | Sigma schedule |
frame_rate | 24.0 | positive float | Output frame rate (metadata) |
randomize_seed | true | boolean | Randomize seed between chunks for varied output |
ffn_chunk_size | 4096 | null to disable | FFN chunk size for memory tuning |
i2v_image | none | file or asset | Optional first-frame reference image |
i2v_strength | 1.0 | 0.0 to 1.0 | First-frame conditioning strength (0 disables) |
control_strength | 1.0 | 0.0 to 1.0 | Video mode guide conditioning strength |
audio_input | none | audio file | Audio reference or driving track |
audio_mode | driving | driving or id_lora | Audio input semantics (see Audio modes) |
identity_guidance_scale | 3.0 | 0.0 to 20.0 | ID-LoRA identity amplification |
lora_merge_strategy | permanent_merge | fixed | Only permanent merge is supported for FP8 |
sigmas | none | descending list | Custom sigma schedule (API only; overrides num_steps and schedule) |
realtime_pacing_slack | 0.0 | ≥ 0 | Fraction ahead of wall-clock before throttling |
Schedules
LTX 2.3 supports five sigma schedules.| Schedule | Description |
|---|---|
distilled | Pre-trained 8-step schedule. Falls back to linear for other step counts. |
linear | Linear ramp from 1.0 to 0.0. |
cosine | Spends more steps in the mid-noise range. |
linear_quadratic | Two-phase linear-then-quadratic schedule. |
beta | Beta-distribution schedule. |
num_steps or want different denoising trajectories.
Generation modes
Text mode (default)
Generates video and audio directly from the prompt with no other input. Audio is generated jointly with video.Image-to-video
Seti2v_image to a reference image. The first frame of the output is conditioned on the image, scaled by i2v_strength. At i2v_strength = 1.0 the first frame matches the reference; at 0.0 the conditioning is disabled and the pipeline falls back to text-only.
Video mode (for IC-LoRAs)
Drives structural or stylistic conditioning with a reference video. Enabled by selecting a matching IC-LoRA and switching the input mode.control_strength scales the guide conditioning. See the IC-LoRA catalog below.
Audio modes
Theaudio_mode parameter determines what the audio_input does.
driving (default)
Input audio drives the video. Output audio equals the input; no audio diffusion runs. Use this for lip-syncing to a specific voiceover, song, or audio track.
id_lora
Input audio is a speaker identity reference (about 5 seconds of the subject speaking). Generated audio matches the voice, and the model produces lip-synced video of the subject. Requires the ID-LoRA weights, which download automatically with the base model. See the LTX 2.3 guide for the full workflow.
Memory architecture
The pipeline orchestrates GPU memory across four components to keep a 22B model under 24 GB:- Gemma 3 12B text encoder (~13 GB) loads on GPU, encodes the prompt, then offloads to CPU.
- Video VAE + Audio VAE (~1 GB total) stay resident on GPU.
- Transformer (~23 GB total) stays CPU-resident. Non-block layers persist on GPU; transformer blocks stream from CPU with double-buffered async transfers and prefetching.
- Between generations, streaming state persists. Full teardown only happens when the text encoder reloads on a prompt change.
ffn_chunk_size (for example 2048 or 1024) or reduce num_frames.
Prompting
LTX 2.3 accepts plain descriptive prompts, and supports structured channel tags when using ID-LoRA.Scene prompts
For general audio-video generation, a descriptive prompt works:Channel-tagged prompts (ID-LoRA mode)
For identity-driven talking-head generation, use channel tags to separate visual, speech, and sound content:[VISUAL] controls scene content, [SPEECH] controls what the subject says, and [SOUNDS] controls environmental audio.
IC-LoRA catalog
IC-LoRAs (In-Context LoRAs) condition generation on a reference video to enforce spatial structure, motion, or stylistic transformation. Unlike standard LoRAs that change style, IC-LoRAs direct what the output looks like frame by frame. Place IC-LoRA.safetensors files in ~/.daydream-scope/models/lora/ and select them from the LoRA picker in video mode. The reference_downscale_factor is read automatically from each file’s metadata.
See the LTX 2.3 guide for step-by-step usage.
Official (Lightricks)
Union Control
- File:
ltx-2.3-22b-ic-lora-union-control-ref0.5.safetensors - Source: Lightricks/LTX-2.3-22b-IC-LoRA-Union-Control
- Purpose: Unified structural control combining Canny edge, depth map, and pose skeleton conditioning in one checkpoint.
- Reference downscale: 2 (reference at 0.5x output resolution)
- Use cases: Animate depth maps into photorealistic videos, retarget character motion from pose skeletons, generate videos from edge-based storyboards.
- How to use: Prepare a control signal video (depth, Canny, or pose) at half the target resolution. Feed it as the reference input. The model follows the structural guidance while the text prompt controls style.
Motion Track Control
- File:
ltx-2.3-22b-ic-lora-motion-track-control-ref0.5.safetensors - Source: Lightricks/LTX-2.3-22b-IC-LoRA-Motion-Track-Control
- Purpose: Directs object motion using sparse spline-based trajectories rendered as trails of circles. Supports single or multiple simultaneous paths.
- Reference downscale: 2 (reference at 0.5x output resolution)
- Use cases: Direct object paths in product shots, character scenes, or nature footage. Guide multiple objects along independent paths in one generation. Extract motion tracks from existing videos with point trackers like SpatialTrackerV2 and replay them in new scenes.
- How to use: Provide a reference video with colored spline overlays indicating desired motion. Start with 3 to 4 keypoints per track and add more only if the interpolated path does not match your intent.
Community
Anime2Real
- File:
ltx23_anime2real_rank64_v1_4500.safetensors - Author: Alissonerdx
- Source: Alissonerdx/LTX-LoRAs
- Purpose: Converts anime-style video into photorealistic footage while preserving motion, composition, and scene layout.
- LoRA rank: 64
- Use cases: Turn animated content into live-action aesthetic for mashups or concept visualization. Bridge storyboard animation and live-action previsualization. Create realistic versions of anime scenes.
- How to use: Feed the anime video as the reference input. Prompt with a description of the realistic scene you want. The model translates the anime style to photorealism while following the reference motion.
Inpaint (Masked T2V)
- File:
ltx23_inpaint_masked_t2v_rank128_v1_10000steps.safetensors - Author: Alissonerdx
- Source: Alissonerdx/LTX-LoRAs
- Purpose: Text-guided video inpainting. Mask a region in a reference video and generate new content to fill it based on a text prompt.
- LoRA rank: 128
- Use cases: Replace objects or characters in a scene (for example, swap a car for a truck), remove unwanted elements and fill the region with context-appropriate content, or add new elements guided by text.
- How to use: The mask must be embedded into the guide video (not passed as a separate channel). The trained mask format uses 8x8 block patterns. The 10000-step checkpoint favors mask adherence; a 2500-step variant in the same repo favors prompt adherence.
Edit Anything
- File:
ltx23_edit_anything_global_rank128_v1_9000steps_adamw.safetensors - Author: Alissonerdx
- Source: Alissonerdx/LTX-LoRAs
- Purpose: Experimental global video editing LoRA for add, remove, replace, and style-conversion edits driven by text on a reference video.
- Use cases: Add, remove, or replace subjects and objects in an existing video while preserving the rest of the scene. Apply global style conversions (for example, turn footage into a watercolor or Ghibli-style look). Generate synthetic edit-pair datasets for downstream training.
- How to use: Feed the source video as the reference input and write an action-first, spatially grounded prompt following one of the trained patterns:
- Add:
Add a/an [subject] with [attributes], [location in the scene]. - Remove:
Remove the [subject] [location or identifying description]. - Replace:
Replace the [original subject] [location] with a/an [new subject] with [attributes]. - Convert:
Convert the video into a [style name] style.
- Add:
Ungrade
- File:
ltx-2.3-22b-ic-lora-ungrade.safetensors - Author: oumoumad
- Source: oumoumad/LTX-2.3-22b-IC-LoRA-Ungrade
- Purpose: Strips color grading and contrast adjustments from footage, returning a neutral ungraded look.
- Use cases: Normalize the look of clips shot with different cameras or color profiles before further processing. Create a consistent baseline color space across mixed-source footage.
- How to use: Feed the color-graded video as the reference input. The model outputs a version with color grading removed, preserving motion and composition.
Refocus
- File:
ltx-2.3-22b-ic-lora-refocus.safetensors - Author: oumoumad
- Source: oumoumad/LTX-2.3-22b-IC-LoRA-ReFocus
- Purpose: Restores focus to out-of-focus or soft footage, sharpening details while keeping motion consistent.
- Use cases: Sharpen footage shot with incorrect focus, recover detail from lens blur or shallow depth-of-field artifacts, or enhance the perceived quality of low-resolution webcam footage.
- How to use: Feed the blurry or soft video as the reference input. The model outputs a refocused version with enhanced sharpness.
Uncompress
- File:
ltx-2.3-22b-ic-lora-uncompress.safetensors - Author: oumoumad
- Source: oumoumad/LTX-2.3-22b-IC-LoRA-Uncompress
- Purpose: Reverses compression artifacts (blocking, banding, mosquito noise) from heavily compressed video.
- Use cases: Restore quality to aggressively H.264 or H.265 compressed video, clean up streaming artifacts from low-bitrate sources like YouTube or Zoom recordings, improve visual quality of low-bitrate archival footage.
- How to use: Feed the compressed video as the reference input. The model outputs an artifact-free version while preserving motion.
Outpaint
- File:
ltx-2.3-22b-ic-lora-outpaint.safetensors - Author: oumoumad
- Source: oumoumad/LTX-2.3-22b-IC-LoRA-Outpaint
- Purpose: Extends the canvas of a video by generating new content in regions marked as pure black. Fill sides, top, bottom, or any combination.
- Use cases: Convert 4:3 footage to 16:9 by outpainting the sides. Extend vertical video to landscape for desktop viewing. Add headroom or environmental context to tightly framed shots. Create cinematic letterbox-to-widescreen conversions.
- How to use: Letterbox your source video to the target canvas with black bars (RGB 0, 0, 0) where you want new content. The model fills those regions with temporally consistent content.
Cameraman
- File:
LTX2.3-22B_IC-LoRA-Cameraman_v1_10500.safetensors - Author: Cseti
- Source: Cseti/LTX2.3-22B_IC-LoRA-Cameraman_v1
- Purpose: Replicates camera movements (zoom, pan, tilt, orbit) from a reference video into newly generated content.
- LoRA rank: 32
- Training data: 77 video pairs across motion types.
- Supported motions: zoom in, zoom out, tilt up, tilt down, pan left/right, orbit CW/CCW, compound (for example zoom + tilt).
- Use cases: Transfer the camera work from a real-world reference clip to a generated scene. Recreate specific dolly, crane, or handheld motion in AI-generated content. Keep camera motion consistent across multiple generated clips.
- How to use: Provide a reference video carrying the desired camera motion and a text prompt describing the new scene. No trigger word required. Strength 0.7 to 1.0 is recommended. If the motion transfer feels too subtle, describe the movement explicitly in the prompt.
See also
LTX 2.3 Guide
Install the plugin, run it locally or in the cloud, and use ID-LoRA and IC-LoRAs
Pipelines Overview
Compare LTX 2.3 with Scope’s other pipelines
Using LoRAs
General LoRA installation and management in Scope
System Requirements
Hardware requirements across pipelines