Real-time AI video generation has been a long-standing goal in the creative technology space. In December 2023, StreamDiffusion - co-created by Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Masayoshi Tomizuka, and Kurt Keutzer - emerged as the first widely adopted framework to actually make it possible.What started as a research paper has, in less than two years, gone from theory to show-ready production tool powering live concerts, immersive installations, and interactive design pipelines. It’s especially popular in the TouchDesigner community, where the StreamDiffusionTD plugin by dotsimulate enables artists to leverage a local GPU to use StreamDiffusion within any TouchDesigner project,Built for speed and responsiveness, StreamDiffusion enables real time interactivity with creative AI, enabling a new content format that’s fun, immersive, and engaging.
By combining stream batching, residual classifier-free guidance (RCFG), stochastic similarity filtering (SSF), asynchronous I/O, and GPU acceleration hooks like TensorRT and xFormers, it achieves ~91 FPS img2img on an RTX 4090, with community demos pushing 100+ FPS for 1-step SD-Turbo generation.
StreamDiffusion is a pipeline-level optimization for diffusion models, that dramatically accelerates existing diffusion models so they can run interactively.Its architecture is designed to remove bottlenecks in traditional diffusion pipelines:
Stream batching: Reorders the denoising process so multiple frames or steps can be processed in parallel.
Residual Classifier-Free Guidance (RCFG): Computes guidance information once and reuses it, cutting redundant calculations.
Stochastic Similarity Filtering (SSF): Skips rendering entirely when frames are similar enough to the previous output, saving GPU cycles.
Async I/O and KV caching: Smooths data flow and reuses pre-computed values for faster iteration.
GPU acceleration hooks: Integrates TensorRT, xFormers, and tiny autoencoders to squeeze more performance from hardware.
The result: an order-of-magnitude performance boost compared to standard Stable Diffusion pipelines, plus up to ~2.4× better energy efficiency in some configurations.StreamDiffusion can be used in combination with a wide range of AI models, to further improve output video quality and enhance artistic control.
Canny ControlNet: Add hard edge definition
HED ControlNet: Add soft edge definition
Depth ControlNet: Add structural definition through depth map
Color ControlNet: Add color control definition
TensorRT Acceleration: Optimizations to accelerate inference
LoRAs: Apply specific artistic styles through additional model training
IPAdapters: Apply specific artistic styles through a single image
StreamV2V: Improve temporal consistency by leveraging a feature bank - using information from previous frames to inform the generation of the current frame.
StreamDiffusion is not tied to a single model. It supports:
SD-Turbo for ultra-low latency
SDXL-Turbo for higher fidelity
SD 1.5 and community checkpoints
Future video-first models
It also works with ControlNets, LoRAs, and pre/post-processing operators for advanced control, including depth maps, masking for in-painting, and multi-ControlNet blending.
Creative industries are moving from offline rendering toward low-latency, interactive AI pipelines. StreamDiffusion makes this shift practical, but running it locally requires:
Windows + RTX-class GPU (4090 recommended)
Driver/CUDA installation and configuration
Session stability for long-running live performances
These requirements make hosted services and API integrations an attractive alternative, offering pre-warmed models, predictable latency, and reduced setup time.