Building a real-time AI avatar on a single GPU with LTX 2.3

In this tutorial you will build a real-time conversational AI avatar on top of LTX 2.3 in about five minutes. You type a question, the avatar answers on video, and the whole round trip takes 5 to 7 seconds. The language model that writes the reply, the video model that animates the face, and the audio all run together on a single GPU. Here is what you are building:

Live LTX-2 Avatar workflow

Open the complete workflow on Daydream. Install it into Scope in one click, then adjust any node you like.

This is a Workflow Builder project, so there is no code to write. You install a plugin or two, load the workflow, and press Play. The sections below walk through what every node does, so you can change the face, the voice, or the model once it is running.

Why build your own avatar

Real-time conversational avatars are a crowded category. Tavus ships Phoenix-4 at sub-500ms end to end. Beyond Presence ships Genesis at around 250ms speech-to-avatar. D-ID moved on from talking-photo gimmicks to Visual AI Agents. All three let you spin up a talking head behind an API, and if a hosted service is the right fit for your product, they are good ones. They are also all cloud SaaS. Your face, your voice, and your prompts travel to someone else’s servers and get logged there. Per-minute pricing eats into unit economics on anything that runs continuously. And you do not get to choose the model or the pipeline: it is their language model, their avatar, their compute. The workflow in this tutorial is the alternative. The weights sit on your disk, inference runs on your GPU, every node is swappable, and the WebRTC stream goes from your browser straight to your own process. Nothing leaves the box.

Why not a lipsync tool

MuseTalk, LivePortrait, and Hallo3 run locally too, so privacy is not the difference here. What sets them apart is what they animate: a head on a still image. MuseTalk (Tencent’s 30 fps real-time lipsync, MIT-licensed) drives the mouth. LivePortrait (Kuaishou’s portrait animator, what most ComfyUI workflows reach for in 2026) and Hallo3 (Fudan’s video-diffusion talking heads, CVPR 2025) drive the full face and head pose. In every case the body and the scene around it stay frozen, and each one needs a separate text-to-speech step to produce the audio first. LTX 2.3 is different. It is Lightricks’s 22-billion-parameter open-weights audio-video model, and it generates the full video and the audio together from a single prompt. Joint generation at that scale is rare in open weights. The avatar can gesture, glance, react, and respond to its own lighting. Reach for this workflow when you want an actual avatar rather than a talking face stitched onto a photo.

Prerequisites

This workflow runs the LTX 2.3 video model and a local language model side by side. The language model is small, so the hardware bar is the same as for LTX 2.3 on its own.

An NVIDIA GPU with 24 GB VRAM or more (the demo was recorded on an H100, and the workflow also runs on a 5090)
CUDA 12.8 or higher
Python 3.12 or higher
A HuggingFace access token with read permissions (see HuggingFace Auth)
Daydream Scope installed and running (desktop app or local install)

You also need two plugins, and neither one ships with Scope by default. scope-ltx-2 provides the ltx2 audio-video pipeline, and scope-llm provides the Local LLM node. From your Scope directory, install both:

uv run daydream-scope install https://github.com/daydreamlive/scope-ltx-2
uv run daydream-scope install https://github.com/leszko/scope-llm

You can also install both plugins from the desktop app: open Settings → Plugins, paste the GitHub URL, and click Install (see the Plugins guide). The LTX 2.3 guide covers scope-ltx-2 in more depth, including running it in the cloud.

Run the workflow

With both plugins installed, loading the avatar takes a few clicks.

Open the workflow

Go to the Live LTX-2 Avatar workflow on Daydream and click Install Workflow. Scope imports the full graph, checks that the required plugins and LoRA are available, and prompts you to download anything that is missing.

Let the weights download

On first run, LTX 2.3 downloads roughly 28 GB of weights, and the talking-head LoRA and the language model download alongside them. This takes a few minutes depending on your connection. Later runs start from the cached files.

Press Play

Type a question into the Primitive node and press Play. The avatar answers a few seconds later, with video and audio in sync.

Prefer to build the graph by hand? Every node in the next section is available from the node picker in Workflow Builder. Wiring them up yourself is a good way to learn how the pieces connect.

How the seven nodes fit together

The workflow is seven nodes. Here is what each one does, in the order data flows through them.

Primitive (text). Holds the question you type. A Primitive node is just a value, in this case a string, that you can wire into anything downstream.
Local LLM. The node from the scope-llm plugin. It runs SmolLM2-360M-Instruct locally and turns your question into a short spoken reply. Its system_prompt keeps answers first-person, conversational, and under 12 words, and tells the model never to echo the question back. Its template parameter wraps the reply in the fixed prompt the video model expects (more on that below).
LoRA. Loads elix3r/LTX-2.3-22b-AV-LoRA-talking-head and permanently merges it into the LTX 2.3 weights. This adapter is what turns a general audio-video model into a talking-head model.
Media (reference image). A still portrait, wired into the pipeline’s i2v_image input (labeled I2V Reference Image in the UI). Whatever face you put here becomes the avatar. The demo uses a public-domain photo of Einstein from Wikimedia Commons.
Media (idle-loop video). A short pre-recorded ping-pong clip of the avatar, wired into idle_loop_path (labeled Idle Loop Clip). It plays between answers so the screen never sits frozen.
LTX 2.3 pipeline. The ltx2 node, running LTX 2.3 22B distilled in FP8. It takes the templated reply, the reference image, the LoRA-merged weights, and the idle clip, then generates video and audio jointly.
Sink. Sends the finished video and audio out over WebRTC to your browser.

The prompt template

The avatar LoRA was trained on a fixed prompt shape, so the Local LLM node’s template parameter wraps every reply before it reaches the video model:

OHWXPERSON, a portrait of a person facing the camera.
The person is talking, and he says: "{answer}"

{answer} is where the language model’s reply lands. OHWXPERSON is the trigger token the talking-head LoRA was trained on: it tells the model to apply the talking-head behavior, and whatever image you wired into the reference Media node becomes the face. For a question like “How are you doing?”, the language model might produce any of these:

“I’m doing great, thanks for asking!”
“I’m a local AI avatar running on Daydream Scope.”
“My favorite equation in the history of math is the Pythagorean Theorem.”

Each one gets dropped into {answer}, and the video model speaks it.

Hardware and performance

The demo was recorded on an H100, and the workflow runs on a 5090 as well. There is one catch on a 32 GB card: at FP8 the text encoder does not fit alongside the rest of the workflow, so the plugin offloads text conditioning to the CPU. That adds a few seconds of latency every time the prompt changes. Between prompts, the video streams normally.

Int4 quantization is in progress. Early results suggest the whole workflow fits into 32 GB without the CPU offload, which would let a 5090 run the pipeline end to end with no pause when the prompt changes.

Once the model is warm, the numbers look like this:

LTX 2.3 22B distilled (FP8) on disk: ~23.5 GB
Chunk length: 121 frames at 25 fps (4.84 seconds)
Audio: synced, 48 kHz
Denoising: 8 steps at ~0.3 seconds each, about 2.5 to 3 seconds total
Total cycle per chunk: 3 to 4 seconds

The FP8 model fits on a 24 GB card with the LoRA merged in. Frame Count, Denoising Steps, and the other settings are exposed as knobs on the LTX 2.3 node, so you can trade quality against latency. The stream stays smooth because of the margin between the last two numbers above: each chunk takes about 3 seconds to generate but 4.84 seconds to play, so generation always stays ahead of playback.

Where the seams show

LTX 2.3 is not autoregressive. Each chunk is generated independently from the reference image, not from whatever was on screen a moment ago. Between answers, the workflow plays the short idle-loop clip so the avatar does not sit frozen, but the idle loop and the generated chunks do not share end frames. You can spot a small seam at every transition between idle and speaking. Autoregressive audio-video models are the real fix, and they are close. Omniforcing and Mutual Forcing are two that are coming soon. When a model at LTX 2.3’s quality ships with that architecture, you swap the pipeline node, drop the idle-loop Media node you no longer need, and the seams go with it. The rest of the graph stays exactly as it is.

Make it your own

Every node in this workflow is swappable. Once you have it running, try the following:

Change the face. Drop a different portrait into the reference Media node. Any clear, front-facing photo works, since OHWXPERSON carries the talking-head behavior and the image carries the identity.
Change how the avatar talks. Swap SmolLM2-360M-Instruct for any LLM you like, whether it runs locally or behind an API, and rewrite the system_prompt to change the avatar’s personality.
Add voice input. Wire a Whisper-small speech-to-text node in front of the Local LLM node so you can talk to the avatar instead of typing.
Train a LoRA on yourself. Swap the talking-head LoRA for one trained on your own face to get a personalized avatar.

Everything here is open source: Daydream Scope, the two plugins, and the workflow itself. Fork it, break it, and tell us what you build.

Using LTX 2.3

Install the LTX 2.3 plugin, run it locally or in the cloud, and use ID-LoRA and IC-LoRAs

LTX 2.3 Reference

The full parameter schema for the ltx2 pipeline, including frame count, schedules, and audio modes

Workflow Builder

Build and wire node graphs by hand in Scope’s visual editor

Using LoRAs

Download, install, and manage LoRA adapters in Scope

Getting Started

Tutorials

Guides

API Reference

Pipelines

Reference

Concepts

Build Your Real-Time AI Avatar

Building a real-time AI avatar on a single GPU with LTX 2.3

Live LTX-2 Avatar workflow

Why build your own avatar

Why not a lipsync tool

Prerequisites

Run the workflow

How the seven nodes fit together

The prompt template

Hardware and performance

Where the seams show

Make it your own

See also

Using LTX 2.3

LTX 2.3 Reference

Workflow Builder

Using LoRAs

Getting Started

Tutorials

Guides

API Reference

Pipelines

Reference

Concepts

Documentation Index

​Building a real-time AI avatar on a single GPU with LTX 2.3

Live LTX-2 Avatar workflow

​Why build your own avatar

​Why not a lipsync tool

​Prerequisites

​Run the workflow

​How the seven nodes fit together

​The prompt template

​Hardware and performance

​Where the seams show

​Make it your own

​See also

Using LTX 2.3

LTX 2.3 Reference

Workflow Builder

Using LoRAs

Building a real-time AI avatar on a single GPU with LTX 2.3

Why build your own avatar

Why not a lipsync tool

Prerequisites

Run the workflow

How the seven nodes fit together

The prompt template

Hardware and performance

Where the seams show

Make it your own

See also