> ## Documentation Index
> Fetch the complete documentation index at: https://docs.daydream.live/llms.txt
> Use this file to discover all available pages before exploring further.

# Build Your Real-Time AI Avatar

> Build a real-time conversational AI avatar on a single GPU with LTX 2.3 in Scope

# Building a real-time AI avatar on a single GPU with LTX 2.3

In this tutorial you will build a real-time conversational AI avatar on top of LTX 2.3 in about five minutes. You type a question, the avatar answers on video, and the whole round trip takes 5 to 7 seconds. The language model that writes the reply, the video model that animates the face, and the audio all run together on a single GPU.

Here is what you are building:

<video controls autoPlay muted loop playsInline className="w-full aspect-[1920/1192] rounded-xl" src="https://mintcdn.com/dd/8oGZGzUG7FPrsIZQ/images/scope/tutorials/ltx-2-avatar/avatar.mp4?fit=max&auto=format&n=8oGZGzUG7FPrsIZQ&q=85&s=7de4a75e4856f2b79754343be9a0f0d7" data-path="images/scope/tutorials/ltx-2-avatar/avatar.mp4" />

<Card title="Live LTX-2 Avatar workflow" icon="diagram-project" href="https://app.daydream.live/workflows/rafal/live-ltx-2-avatar">
  Open the complete workflow on Daydream. Install it into Scope in one click, then adjust any node you like.
</Card>

This is a [Workflow Builder](/scope/guides/workflow-builder) project, so there is no code to write. You install a plugin or two, load the workflow, and press **Play**. The sections below walk through what every node does, so you can change the face, the voice, or the model once it is running.

***

## Why build your own avatar

Real-time conversational avatars are a crowded category. [Tavus](https://tavus.io/) ships Phoenix-4 at sub-500ms end to end. [Beyond Presence](https://www.beyondpresence.ai/) ships Genesis at around 250ms speech-to-avatar. [D-ID](https://www.d-id.com/) moved on from talking-photo gimmicks to Visual AI Agents. All three let you spin up a talking head behind an API, and if a hosted service is the right fit for your product, they are good ones.

They are also all cloud SaaS. Your face, your voice, and your prompts travel to someone else's servers and get logged there. Per-minute pricing eats into unit economics on anything that runs continuously. And you do not get to choose the model or the pipeline: it is their language model, their avatar, their compute.

The workflow in this tutorial is the alternative. The weights sit on your disk, inference runs on your GPU, every node is swappable, and the WebRTC stream goes from your browser straight to your own process. Nothing leaves the box.

### Why not a lipsync tool

[MuseTalk](https://github.com/TMElyralab/MuseTalk), [LivePortrait](https://github.com/KwaiVGI/LivePortrait), and [Hallo3](https://github.com/fudan-generative-vision/hallo3) run locally too, so privacy is not the difference here. What sets them apart is what they animate: a head on a still image. MuseTalk (Tencent's 30 fps real-time lipsync, MIT-licensed) drives the mouth. LivePortrait (Kuaishou's portrait animator, what most ComfyUI workflows reach for in 2026) and Hallo3 (Fudan's video-diffusion talking heads, CVPR 2025) drive the full face and head pose. In every case the body and the scene around it stay frozen, and each one needs a separate text-to-speech step to produce the audio first.

LTX 2.3 is different. It is Lightricks's 22-billion-parameter open-weights audio-video model, and it generates the full video and the audio together from a single prompt. Joint generation at that scale is rare in open weights. The avatar can gesture, glance, react, and respond to its own lighting. Reach for this workflow when you want an actual avatar rather than a talking face stitched onto a photo.

***

## Prerequisites

This workflow runs the LTX 2.3 video model and a local language model side by side. The language model is small, so the hardware bar is the same as for LTX 2.3 on its own.

* An NVIDIA GPU with 24 GB VRAM or more (the demo was recorded on an H100, and the workflow also runs on a 5090)
* CUDA 12.8 or higher
* Python 3.12 or higher
* A HuggingFace access token with **read** permissions (see [HuggingFace Auth](/scope/guides/huggingface))
* Daydream Scope installed and running ([desktop app or local install](/scope/getting-started/quickstart))

You also need two plugins, and neither one ships with Scope by default. [scope-ltx-2](https://github.com/daydreamlive/scope-ltx-2) provides the `ltx2` audio-video pipeline, and [scope-llm](https://github.com/leszko/scope-llm) provides the **Local LLM** node. From your Scope directory, install both:

```bash theme={null}
uv run daydream-scope install https://github.com/daydreamlive/scope-ltx-2
uv run daydream-scope install https://github.com/leszko/scope-llm
```

<Note>
  You can also install both plugins from the desktop app: open **Settings** → **Plugins**, paste the GitHub URL, and click **Install** (see the [Plugins guide](/scope/guides/plugins)). The [LTX 2.3 guide](/scope/guides/ltx-2-3#install-the-plugin) covers `scope-ltx-2` in more depth, including running it in the cloud.
</Note>

***

## Run the workflow

With both plugins installed, loading the avatar takes a few clicks.

<Steps>
  <Step title="Open the workflow">
    Go to the [Live LTX-2 Avatar workflow](https://app.daydream.live/workflows/rafal/live-ltx-2-avatar) on Daydream and click **Install Workflow**. Scope imports the full graph, checks that the required plugins and LoRA are available, and prompts you to download anything that is missing.
  </Step>

  <Step title="Let the weights download">
    On first run, LTX 2.3 downloads roughly 28 GB of weights, and the talking-head LoRA and the language model download alongside them. This takes a few minutes depending on your connection. Later runs start from the cached files.
  </Step>

  <Step title="Press Play">
    Type a question into the **Primitive** node and press **Play**. The avatar answers a few seconds later, with video and audio in sync.
  </Step>
</Steps>

<Tip>
  Prefer to build the graph by hand? Every node in the next section is available from the node picker in [Workflow Builder](/scope/guides/workflow-builder). Wiring them up yourself is a good way to learn how the pieces connect.
</Tip>

***

## How the seven nodes fit together

The workflow is seven nodes. Here is what each one does, in the order data flows through them.

1. **Primitive (text).** Holds the question you type. A Primitive node is just a value, in this case a string, that you can wire into anything downstream.

2. **Local LLM.** The node from the [scope-llm](https://github.com/leszko/scope-llm) plugin. It runs `SmolLM2-360M-Instruct` locally and turns your question into a short spoken reply. Its `system_prompt` keeps answers first-person, conversational, and under 12 words, and tells the model never to echo the question back. Its `template` parameter wraps the reply in the fixed prompt the video model expects (more on that below).

3. **LoRA.** Loads [`elix3r/LTX-2.3-22b-AV-LoRA-talking-head`](https://huggingface.co/elix3r/LTX-2.3-22b-AV-LoRA-talking-head) and permanently merges it into the LTX 2.3 weights. This adapter is what turns a general audio-video model into a talking-head model.

4. **Media (reference image).** A still portrait, wired into the pipeline's `i2v_image` input (labeled **I2V Reference Image** in the UI). Whatever face you put here becomes the avatar. The demo uses a public-domain photo of Einstein from Wikimedia Commons.

5. **Media (idle-loop video).** A short pre-recorded ping-pong clip of the avatar, wired into `idle_loop_path` (labeled **Idle Loop Clip**). It plays between answers so the screen never sits frozen.

6. **LTX 2.3 pipeline.** The `ltx2` node, running LTX 2.3 22B distilled in FP8. It takes the templated reply, the reference image, the LoRA-merged weights, and the idle clip, then generates video and audio jointly.

7. **Sink.** Sends the finished video and audio out over WebRTC to your browser.

### The prompt template

The avatar LoRA was trained on a fixed prompt shape, so the Local LLM node's `template` parameter wraps every reply before it reaches the video model:

```
OHWXPERSON, a portrait of a person facing the camera.
The person is talking, and he says: "{answer}"
```

`{answer}` is where the language model's reply lands. `OHWXPERSON` is the trigger token the [talking-head LoRA](https://huggingface.co/elix3r/LTX-2.3-22b-AV-LoRA-talking-head) was trained on: it tells the model to apply the talking-head behavior, and whatever image you wired into the reference Media node becomes the face.

For a question like "How are you doing?", the language model might produce any of these:

* "I'm doing great, thanks for asking!"
* "I'm a local AI avatar running on Daydream Scope."
* "My favorite equation in the history of math is the Pythagorean Theorem."

Each one gets dropped into `{answer}`, and the video model speaks it.

***

## Hardware and performance

The demo was recorded on an H100, and the workflow runs on a 5090 as well. There is one catch on a 32 GB card: at FP8 the text encoder does not fit alongside the rest of the workflow, so the plugin offloads text conditioning to the CPU. That adds a few seconds of latency every time the prompt changes. Between prompts, the video streams normally.

<Note>
  Int4 quantization is in progress. Early results suggest the whole workflow fits into 32 GB without the CPU offload, which would let a 5090 run the pipeline end to end with no pause when the prompt changes.
</Note>

Once the model is warm, the numbers look like this:

* **LTX 2.3 22B distilled (FP8) on disk:** \~23.5 GB
* **Chunk length:** 121 frames at 25 fps (4.84 seconds)
* **Audio:** synced, 48 kHz
* **Denoising:** 8 steps at \~0.3 seconds each, about 2.5 to 3 seconds total
* **Total cycle per chunk:** 3 to 4 seconds

The FP8 model fits on a 24 GB card with the LoRA merged in. **Frame Count**, **Denoising Steps**, and the other settings are exposed as knobs on the LTX 2.3 node, so you can trade quality against latency. The stream stays smooth because of the margin between the last two numbers above: each chunk takes about 3 seconds to generate but 4.84 seconds to play, so generation always stays ahead of playback.

***

## Where the seams show

LTX 2.3 is not autoregressive. Each chunk is generated independently from the reference image, not from whatever was on screen a moment ago. Between answers, the workflow plays the short idle-loop clip so the avatar does not sit frozen, but the idle loop and the generated chunks do not share end frames. You can spot a small seam at every transition between idle and speaking.

Autoregressive audio-video models are the real fix, and they are close. [Omniforcing](https://omniforcing.com/) and [Mutual Forcing](https://huggingface.co/papers/2604.25819) are two that are coming soon. When a model at LTX 2.3's quality ships with that architecture, you swap the pipeline node, drop the idle-loop Media node you no longer need, and the seams go with it. The rest of the graph stays exactly as it is.

***

## Make it your own

Every node in this workflow is swappable. Once you have it running, try the following:

* **Change the face.** Drop a different portrait into the reference Media node. Any clear, front-facing photo works, since `OHWXPERSON` carries the talking-head behavior and the image carries the identity.
* **Change how the avatar talks.** Swap `SmolLM2-360M-Instruct` for any LLM you like, whether it runs locally or behind an API, and rewrite the `system_prompt` to change the avatar's personality.
* **Add voice input.** Wire a `Whisper-small` speech-to-text node in front of the Local LLM node so you can talk to the avatar instead of typing.
* **Train a LoRA on yourself.** Swap the talking-head LoRA for one trained on your own face to get a personalized avatar.

Everything here is open source: [Daydream Scope](https://github.com/daydreamlive/scope), the two plugins, and the [workflow](https://app.daydream.live/workflows/rafal/live-ltx-2-avatar) itself. Fork it, break it, and tell us what you build.

***

## See also

<CardGroup cols={2}>
  <Card title="Using LTX 2.3" icon="film" href="/scope/guides/ltx-2-3">
    Install the LTX 2.3 plugin, run it locally or in the cloud, and use ID-LoRA and IC-LoRAs
  </Card>

  <Card title="LTX 2.3 Reference" icon="layer-group" href="/scope/reference/pipelines/ltx-2-3">
    The full parameter schema for the `ltx2` pipeline, including frame count, schedules, and audio modes
  </Card>

  <Card title="Workflow Builder" icon="diagram-project" href="/scope/guides/workflow-builder">
    Build and wire node graphs by hand in Scope's visual editor
  </Card>

  <Card title="Using LoRAs" icon="wand-magic-sparkles" href="/scope/guides/loras">
    Download, install, and manage LoRA adapters in Scope
  </Card>
</CardGroup>
