Documentation Index
Fetch the complete documentation index at: https://docs.daydream.live/llms.txt
Use this file to discover all available pages before exploring further.
Building a real-time AI avatar on a single GPU with LTX 2.3
In this tutorial you will build a real-time conversational AI avatar on top of LTX 2.3 in about five minutes. You type a question, the avatar answers on video, and the whole round trip takes 5 to 7 seconds. The language model that writes the reply, the video model that animates the face, and the audio all run together on a single GPU. Here is what you are building:Live LTX-2 Avatar workflow
Open the complete workflow on Daydream. Install it into Scope in one click, then adjust any node you like.
Why build your own avatar
Real-time conversational avatars are a crowded category. Tavus ships Phoenix-4 at sub-500ms end to end. Beyond Presence ships Genesis at around 250ms speech-to-avatar. D-ID moved on from talking-photo gimmicks to Visual AI Agents. All three let you spin up a talking head behind an API, and if a hosted service is the right fit for your product, they are good ones. They are also all cloud SaaS. Your face, your voice, and your prompts travel to someone else’s servers and get logged there. Per-minute pricing eats into unit economics on anything that runs continuously. And you do not get to choose the model or the pipeline: it is their language model, their avatar, their compute. The workflow in this tutorial is the alternative. The weights sit on your disk, inference runs on your GPU, every node is swappable, and the WebRTC stream goes from your browser straight to your own process. Nothing leaves the box.Why not a lipsync tool
MuseTalk, LivePortrait, and Hallo3 run locally too, so privacy is not the difference here. What sets them apart is what they animate: a head on a still image. MuseTalk (Tencent’s 30 fps real-time lipsync, MIT-licensed) drives the mouth. LivePortrait (Kuaishou’s portrait animator, what most ComfyUI workflows reach for in 2026) and Hallo3 (Fudan’s video-diffusion talking heads, CVPR 2025) drive the full face and head pose. In every case the body and the scene around it stay frozen, and each one needs a separate text-to-speech step to produce the audio first. LTX 2.3 is different. It is Lightricks’s 22-billion-parameter open-weights audio-video model, and it generates the full video and the audio together from a single prompt. Joint generation at that scale is rare in open weights. The avatar can gesture, glance, react, and respond to its own lighting. Reach for this workflow when you want an actual avatar rather than a talking face stitched onto a photo.Prerequisites
This workflow runs the LTX 2.3 video model and a local language model side by side. The language model is small, so the hardware bar is the same as for LTX 2.3 on its own.- An NVIDIA GPU with 24 GB VRAM or more (the demo was recorded on an H100, and the workflow also runs on a 5090)
- CUDA 12.8 or higher
- Python 3.12 or higher
- A HuggingFace access token with read permissions (see HuggingFace Auth)
- Daydream Scope installed and running (desktop app or local install)
ltx2 audio-video pipeline, and scope-llm provides the Local LLM node. From your Scope directory, install both:
You can also install both plugins from the desktop app: open Settings → Plugins, paste the GitHub URL, and click Install (see the Plugins guide). The LTX 2.3 guide covers
scope-ltx-2 in more depth, including running it in the cloud.Run the workflow
With both plugins installed, loading the avatar takes a few clicks.Open the workflow
Go to the Live LTX-2 Avatar workflow on Daydream and click Install Workflow. Scope imports the full graph, checks that the required plugins and LoRA are available, and prompts you to download anything that is missing.
Let the weights download
On first run, LTX 2.3 downloads roughly 28 GB of weights, and the talking-head LoRA and the language model download alongside them. This takes a few minutes depending on your connection. Later runs start from the cached files.
How the seven nodes fit together
The workflow is seven nodes. Here is what each one does, in the order data flows through them.- Primitive (text). Holds the question you type. A Primitive node is just a value, in this case a string, that you can wire into anything downstream.
-
Local LLM. The node from the scope-llm plugin. It runs
SmolLM2-360M-Instructlocally and turns your question into a short spoken reply. Itssystem_promptkeeps answers first-person, conversational, and under 12 words, and tells the model never to echo the question back. Itstemplateparameter wraps the reply in the fixed prompt the video model expects (more on that below). -
LoRA. Loads
elix3r/LTX-2.3-22b-AV-LoRA-talking-headand permanently merges it into the LTX 2.3 weights. This adapter is what turns a general audio-video model into a talking-head model. -
Media (reference image). A still portrait, wired into the pipeline’s
i2v_imageinput (labeled I2V Reference Image in the UI). Whatever face you put here becomes the avatar. The demo uses a public-domain photo of Einstein from Wikimedia Commons. -
Media (idle-loop video). A short pre-recorded ping-pong clip of the avatar, wired into
idle_loop_path(labeled Idle Loop Clip). It plays between answers so the screen never sits frozen. -
LTX 2.3 pipeline. The
ltx2node, running LTX 2.3 22B distilled in FP8. It takes the templated reply, the reference image, the LoRA-merged weights, and the idle clip, then generates video and audio jointly. - Sink. Sends the finished video and audio out over WebRTC to your browser.
The prompt template
The avatar LoRA was trained on a fixed prompt shape, so the Local LLM node’stemplate parameter wraps every reply before it reaches the video model:
{answer} is where the language model’s reply lands. OHWXPERSON is the trigger token the talking-head LoRA was trained on: it tells the model to apply the talking-head behavior, and whatever image you wired into the reference Media node becomes the face.
For a question like “How are you doing?”, the language model might produce any of these:
- “I’m doing great, thanks for asking!”
- “I’m a local AI avatar running on Daydream Scope.”
- “My favorite equation in the history of math is the Pythagorean Theorem.”
{answer}, and the video model speaks it.
Hardware and performance
The demo was recorded on an H100, and the workflow runs on a 5090 as well. There is one catch on a 32 GB card: at FP8 the text encoder does not fit alongside the rest of the workflow, so the plugin offloads text conditioning to the CPU. That adds a few seconds of latency every time the prompt changes. Between prompts, the video streams normally.Int4 quantization is in progress. Early results suggest the whole workflow fits into 32 GB without the CPU offload, which would let a 5090 run the pipeline end to end with no pause when the prompt changes.
- LTX 2.3 22B distilled (FP8) on disk: ~23.5 GB
- Chunk length: 121 frames at 25 fps (4.84 seconds)
- Audio: synced, 48 kHz
- Denoising: 8 steps at ~0.3 seconds each, about 2.5 to 3 seconds total
- Total cycle per chunk: 3 to 4 seconds
Where the seams show
LTX 2.3 is not autoregressive. Each chunk is generated independently from the reference image, not from whatever was on screen a moment ago. Between answers, the workflow plays the short idle-loop clip so the avatar does not sit frozen, but the idle loop and the generated chunks do not share end frames. You can spot a small seam at every transition between idle and speaking. Autoregressive audio-video models are the real fix, and they are close. Omniforcing and Mutual Forcing are two that are coming soon. When a model at LTX 2.3’s quality ships with that architecture, you swap the pipeline node, drop the idle-loop Media node you no longer need, and the seams go with it. The rest of the graph stays exactly as it is.Make it your own
Every node in this workflow is swappable. Once you have it running, try the following:- Change the face. Drop a different portrait into the reference Media node. Any clear, front-facing photo works, since
OHWXPERSONcarries the talking-head behavior and the image carries the identity. - Change how the avatar talks. Swap
SmolLM2-360M-Instructfor any LLM you like, whether it runs locally or behind an API, and rewrite thesystem_promptto change the avatar’s personality. - Add voice input. Wire a
Whisper-smallspeech-to-text node in front of the Local LLM node so you can talk to the avatar instead of typing. - Train a LoRA on yourself. Swap the talking-head LoRA for one trained on your own face to get a personalized avatar.
See also
Using LTX 2.3
Install the LTX 2.3 plugin, run it locally or in the cloud, and use ID-LoRA and IC-LoRAs
LTX 2.3 Reference
The full parameter schema for the
ltx2 pipeline, including frame count, schedules, and audio modesWorkflow Builder
Build and wire node graphs by hand in Scope’s visual editor
Using LoRAs
Download, install, and manage LoRA adapters in Scope