LTX 2.3 Pipeline

LTX 2.3 is a 22-billion-parameter DiT (Diffusion Transformer) audio-video foundation model from Lightricks. Unlike Scope’s other pipelines, which generate video only, LTX 2.3 produces synchronized audio and video in a single model pass. The scope-ltx-2 plugin wraps the model with FP8 quantization and CPU-resident block streaming so it fits on a 24 GB GPU.

LTX 2.3 ships as a separately installed plugin, not a built-in pipeline. See the LTX 2.3 guide for installation steps.

At a glance


Base Model	Lightricks DiT 22B (distilled v3, FP8)
Estimated VRAM	~22 GB (24 GB GPU recommended)
Text Encoder	Gemma 3 12B (FP8)
Audio Support	Yes (synchronized audio-video)
LoRA Support	LTX 2.3 LoRAs (permanent merge only)
IC-LoRA Support	Yes (structural control via reference video)
ID-LoRA Support	Yes (identity-driven talking-head)
VACE Support	No
T2V / I2V / V2V	Yes / Yes / Yes (via IC-LoRA)

Examples

Output constraints

LTX 2.3 has two hard constraints that distinguish it from Scope’s other pipelines.

Resolution

height and width are both snapped to the nearest multiple of 32. The default is 384 x 320. Generation is faster at smaller resolutions; visual quality improves at larger ones, with the usual tradeoff against GPU headroom and frame rate.

Frame count

num_frames must follow the pattern 8 x K + 1 (9, 17, 25, 33, …, 257). Values that do not match are snapped to the nearest valid count. The default is 129, the minimum is 9, and the maximum is 257. The 8×K+1 pattern comes from the video VAE, which downsamples temporally by 8 plus one anchor frame.

Parameters

All parameters come from the plugin’s LTX2Config schema.

Parameter	Default	Range	Description
`height`	384	multiples of 32	Output height in pixels
`width`	320	multiples of 32	Output width in pixels
`num_frames`	129	9 to 257 (8×K+1)	Frames per inference batch
`num_steps`	8	1 to 20	Euler denoising steps
`schedule`	`distilled`	see Schedules	Sigma schedule
`frame_rate`	24.0	positive float	Output frame rate (metadata)
`randomize_seed`	`true`	boolean	Randomize seed between chunks for varied output
`ffn_chunk_size`	4096	`null` to disable	FFN chunk size for memory tuning
`i2v_image`	none	file or asset	Optional first-frame reference image
`i2v_strength`	1.0	0.0 to 1.0	First-frame conditioning strength (0 disables)
`control_strength`	1.0	0.0 to 1.0	Video mode guide conditioning strength
`audio_input`	none	audio file	Audio reference or driving track
`audio_mode`	`driving`	`driving` or `id_lora`	Audio input semantics (see Audio modes)
`identity_guidance_scale`	3.0	0.0 to 20.0	ID-LoRA identity amplification
`lora_merge_strategy`	`permanent_merge`	fixed	Only permanent merge is supported for FP8
`sigmas`	none	descending list	Custom sigma schedule (API only; overrides `num_steps` and `schedule`)
`realtime_pacing_slack`	0.0	≥ 0	Fraction ahead of wall-clock before throttling

Schedules

LTX 2.3 supports five sigma schedules.

Schedule	Description
`distilled`	Pre-trained 8-step schedule. Falls back to linear for other step counts.
`linear`	Linear ramp from 1.0 to 0.0.
`cosine`	Spends more steps in the mid-noise range.
`linear_quadratic`	Two-phase linear-then-quadratic schedule.
`beta`	Beta-distribution schedule.

The distilled schedule is the default and gives the best quality at 8 steps. Other schedules are useful when you change num_steps or want different denoising trajectories.

Generation modes

Text mode (default)

Generates video and audio directly from the prompt with no other input. Audio is generated jointly with video.

Image-to-video

Set i2v_image to a reference image. The first frame of the output is conditioned on the image, scaled by i2v_strength. At i2v_strength = 1.0 the first frame matches the reference; at 0.0 the conditioning is disabled and the pipeline falls back to text-only.

Video mode (for IC-LoRAs)

Drives structural or stylistic conditioning with a reference video. Enabled by selecting a matching IC-LoRA and switching the input mode. control_strength scales the guide conditioning. See the IC-LoRA catalog below.

Audio modes

The audio_mode parameter determines what the audio_input does.

`driving` (default)

Input audio drives the video. Output audio equals the input; no audio diffusion runs. Use this for lip-syncing to a specific voiceover, song, or audio track.

`id_lora`

Input audio is a speaker identity reference (about 5 seconds of the subject speaking). Generated audio matches the voice, and the model produces lip-synced video of the subject. Requires the ID-LoRA weights, which download automatically with the base model. See the LTX 2.3 guide for the full workflow.

Memory architecture

The pipeline orchestrates GPU memory across four components to keep a 22B model under 24 GB:

Gemma 3 12B text encoder (~13 GB) loads on GPU, encodes the prompt, then offloads to CPU.
Video VAE + Audio VAE (~1 GB total) stay resident on GPU.
Transformer (~23 GB total) stays CPU-resident. Non-block layers persist on GPU; transformer blocks stream from CPU with double-buffered async transfers and prefetching.
Between generations, streaming state persists. Full teardown only happens when the text encoder reloads on a prompt change.

To reduce memory further on tighter GPUs, lower ffn_chunk_size (for example 2048 or 1024) or reduce num_frames.

Prompting

LTX 2.3 accepts plain descriptive prompts, and supports structured channel tags when using ID-LoRA.

Scene prompts

For general audio-video generation, a descriptive prompt works:

"A time-lapse of a flower blooming in a sunlit meadow, birds chirping in the background."

Channel-tagged prompts (ID-LoRA mode)

For identity-driven talking-head generation, use channel tags to separate visual, speech, and sound content:

[VISUAL]: A close-up of a person speaking in a park.
[SPEECH]: Hello world.
[SOUNDS]: Birds chirping.

[VISUAL] controls scene content, [SPEECH] controls what the subject says, and [SOUNDS] controls environmental audio.

IC-LoRA catalog

IC-LoRAs (In-Context LoRAs) condition generation on a reference video to enforce spatial structure, motion, or stylistic transformation. Unlike standard LoRAs that change style, IC-LoRAs direct what the output looks like frame by frame. Place IC-LoRA .safetensors files in ~/.daydream-scope/models/lora/ and select them from the LoRA picker in video mode. The reference_downscale_factor is read automatically from each file’s metadata.

See the LTX 2.3 guide for step-by-step usage.

Official (Lightricks)

Union Control

File: ltx-2.3-22b-ic-lora-union-control-ref0.5.safetensors
Source: Lightricks/LTX-2.3-22b-IC-LoRA-Union-Control
Purpose: Unified structural control combining Canny edge, depth map, and pose skeleton conditioning in one checkpoint.
Reference downscale: 2 (reference at 0.5x output resolution)
Use cases: Animate depth maps into photorealistic videos, retarget character motion from pose skeletons, generate videos from edge-based storyboards.
How to use: Prepare a control signal video (depth, Canny, or pose) at half the target resolution. Feed it as the reference input. The model follows the structural guidance while the text prompt controls style.

Motion Track Control

File: ltx-2.3-22b-ic-lora-motion-track-control-ref0.5.safetensors
Source: Lightricks/LTX-2.3-22b-IC-LoRA-Motion-Track-Control
Purpose: Directs object motion using sparse spline-based trajectories rendered as trails of circles. Supports single or multiple simultaneous paths.
Reference downscale: 2 (reference at 0.5x output resolution)
Use cases: Direct object paths in product shots, character scenes, or nature footage. Guide multiple objects along independent paths in one generation. Extract motion tracks from existing videos with point trackers like SpatialTrackerV2 and replay them in new scenes.
How to use: Provide a reference video with colored spline overlays indicating desired motion. Start with 3 to 4 keypoints per track and add more only if the interpolated path does not match your intent.

Community

Anime2Real

File: ltx23_anime2real_rank64_v1_4500.safetensors
Author: Alissonerdx
Source: Alissonerdx/LTX-LoRAs
Purpose: Converts anime-style video into photorealistic footage while preserving motion, composition, and scene layout.
LoRA rank: 64
Use cases: Turn animated content into live-action aesthetic for mashups or concept visualization. Bridge storyboard animation and live-action previsualization. Create realistic versions of anime scenes.
How to use: Feed the anime video as the reference input. Prompt with a description of the realistic scene you want. The model translates the anime style to photorealism while following the reference motion.

Inpaint (Masked T2V)

File: ltx23_inpaint_masked_t2v_rank128_v1_10000steps.safetensors
Author: Alissonerdx
Source: Alissonerdx/LTX-LoRAs
Purpose: Text-guided video inpainting. Mask a region in a reference video and generate new content to fill it based on a text prompt.
LoRA rank: 128
Use cases: Replace objects or characters in a scene (for example, swap a car for a truck), remove unwanted elements and fill the region with context-appropriate content, or add new elements guided by text.
How to use: The mask must be embedded into the guide video (not passed as a separate channel). The trained mask format uses 8x8 block patterns. The 10000-step checkpoint favors mask adherence; a 2500-step variant in the same repo favors prompt adherence.

The official Lightricks inpainting IC-LoRA was removed from HuggingFace. This is a community-trained alternative.

Edit Anything

File: ltx23_edit_anything_global_rank128_v1_9000steps_adamw.safetensors
Author: Alissonerdx
Source: Alissonerdx/LTX-LoRAs
Purpose: Experimental global video editing LoRA for add, remove, replace, and style-conversion edits driven by text on a reference video.
Use cases: Add, remove, or replace subjects and objects in an existing video while preserving the rest of the scene. Apply global style conversions (for example, turn footage into a watercolor or Ghibli-style look). Generate synthetic edit-pair datasets for downstream training.
How to use: Feed the source video as the reference input and write an action-first, spatially grounded prompt following one of the trained patterns:
- Add: Add a/an [subject] with [attributes], [location in the scene].
- Remove: Remove the [subject] [location or identifying description].
- Replace: Replace the [original subject] [location] with a/an [new subject] with [attributes].
- Convert: Convert the video into a [style name] style.

Increase the scale/strength to 1.1-1.4 for better results.

Ungrade

File: ltx-2.3-22b-ic-lora-ungrade.safetensors
Author: oumoumad
Source: oumoumad/LTX-2.3-22b-IC-LoRA-Ungrade
Purpose: Strips color grading and contrast adjustments from footage, returning a neutral ungraded look.
Use cases: Normalize the look of clips shot with different cameras or color profiles before further processing. Create a consistent baseline color space across mixed-source footage.
How to use: Feed the color-graded video as the reference input. The model outputs a version with color grading removed, preserving motion and composition.

Refocus

File: ltx-2.3-22b-ic-lora-refocus.safetensors
Author: oumoumad
Source: oumoumad/LTX-2.3-22b-IC-LoRA-ReFocus
Purpose: Restores focus to out-of-focus or soft footage, sharpening details while keeping motion consistent.
Use cases: Sharpen footage shot with incorrect focus, recover detail from lens blur or shallow depth-of-field artifacts, or enhance the perceived quality of low-resolution webcam footage.
How to use: Feed the blurry or soft video as the reference input. The model outputs a refocused version with enhanced sharpness.

Uncompress

File: ltx-2.3-22b-ic-lora-uncompress.safetensors
Author: oumoumad
Source: oumoumad/LTX-2.3-22b-IC-LoRA-Uncompress
Purpose: Reverses compression artifacts (blocking, banding, mosquito noise) from heavily compressed video.
Use cases: Restore quality to aggressively H.264 or H.265 compressed video, clean up streaming artifacts from low-bitrate sources like YouTube or Zoom recordings, improve visual quality of low-bitrate archival footage.
How to use: Feed the compressed video as the reference input. The model outputs an artifact-free version while preserving motion.

Outpaint

File: ltx-2.3-22b-ic-lora-outpaint.safetensors
Author: oumoumad
Source: oumoumad/LTX-2.3-22b-IC-LoRA-Outpaint
Purpose: Extends the canvas of a video by generating new content in regions marked as pure black. Fill sides, top, bottom, or any combination.
Use cases: Convert 4:3 footage to 16:9 by outpainting the sides. Extend vertical video to landscape for desktop viewing. Add headroom or environmental context to tightly framed shots. Create cinematic letterbox-to-widescreen conversions.
How to use: Letterbox your source video to the target canvas with black bars (RGB 0, 0, 0) where you want new content. The model fills those regions with temporally consistent content.

Dark source footage can confuse the “generate here” sentinel. Apply a gamma 2.0 correction before feeding the model (which lifts dark content while keeping black bars at 0), then apply inverse gamma 0.5 to the output.

Cameraman

File: LTX2.3-22B_IC-LoRA-Cameraman_v1_10500.safetensors
Author: Cseti
Source: Cseti/LTX2.3-22B_IC-LoRA-Cameraman_v1
Purpose: Replicates camera movements (zoom, pan, tilt, orbit) from a reference video into newly generated content.
LoRA rank: 32
Training data: 77 video pairs across motion types.
Supported motions: zoom in, zoom out, tilt up, tilt down, pan left/right, orbit CW/CCW, compound (for example zoom + tilt).
Use cases: Transfer the camera work from a real-world reference clip to a generated scene. Recreate specific dolly, crane, or handheld motion in AI-generated content. Keep camera motion consistent across multiple generated clips.
How to use: Provide a reference video carrying the desired camera motion and a text prompt describing the new scene. No trigger word required. Strength 0.7 to 1.0 is recommended. If the motion transfer feels too subtle, describe the movement explicitly in the prompt.

LTX 2.3 Guide

Install the plugin, run it locally or in the cloud, and use ID-LoRA and IC-LoRAs

Pipelines Overview

Compare LTX 2.3 with Scope’s other pipelines

Using LoRAs

General LoRA installation and management in Scope

System Requirements

Hardware requirements across pipelines

​LTX 2.3 Pipeline

​At a glance

​Examples

​Output constraints

​Resolution

​Frame count

​Parameters

​Schedules

​Generation modes

​Text mode (default)

​Image-to-video

​Video mode (for IC-LoRAs)

​Audio modes

​driving (default)

​id_lora

​Memory architecture

​Prompting

​Scene prompts

​Channel-tagged prompts (ID-LoRA mode)

​IC-LoRA catalog

​Official (Lightricks)

​Union Control

​Motion Track Control

​Community

​Anime2Real

​Inpaint (Masked T2V)

​Edit Anything

​Ungrade

​Refocus

​Uncompress

​Outpaint

​Cameraman

​See also

LTX 2.3 Guide

Pipelines Overview

Using LoRAs

System Requirements

LTX 2.3 Pipeline

At a glance

Examples

Output constraints

Resolution

Frame count

Parameters

Schedules

Generation modes

Text mode (default)

Image-to-video

Video mode (for IC-LoRAs)

Audio modes

`driving` (default)

`id_lora`

Memory architecture

Prompting

Scene prompts

Channel-tagged prompts (ID-LoRA mode)

IC-LoRA catalog

Official (Lightricks)

Union Control

Motion Track Control

Community

Anime2Real

Inpaint (Masked T2V)

Edit Anything

Ungrade

Refocus

Uncompress

Outpaint

Cameraman

See also