By John | July 4, 2026
You've seen the headline: "输入一句话,AI 全自动帮你做短视频" — type one sentence and AI builds an entire short video. No editing. No timeline. No Premiere Pro-induced existential crisis.
That's Pixelle-Video, an Apache 2.0-licensed open-source engine from AIDC-AI (now under ATH-MaaS) that's racked up 24,000+ GitHub stars and 3,500+ forks in its first few months. But here's the thing — almost every tutorial assumes you're clicking buttons in a browser. What if you want to run it headless on Ubuntu, trigger generation with a single CLI command, and keep your AI character looking consistent across every video?
I spent the weekend spelunking through the codebase. Here's everything I found.
The pipeline is genuinely impressive:
Topic/Text → LLM Scripting → Image Prompt Gen → Media Generation (ComfyUI/API)
↓
Final MP4 ← BGM Mixing ← Frame Composition ← TTS Voiceover
In plain English: you feed it a topic like "How do black holes evaporate?" and it:
All of this happens automatically. The default output is vertical 1080×1920 — perfect for TikTok and YouTube Shorts.
Here's the tech stack breakdown:
| Layer | Technology | Role |
|---|---|---|
| Web UI | Streamlit (web/app.py) |
Browser-based control panel |
| API | FastAPI (api/app.py) |
REST endpoints on port 8000 |
| Core Engine | PixelleVideoCore in pixelle_video/service.py |
Orchestrates everything |
| LLM | OpenAI-compatible SDK | Scriptwriting, prompt generation |
| Media | ComfyUI (local) or RunningHub (cloud) | Image/video generation |
| TTS | Edge TTS + ComfyUI workflows | Voice synthesis with cloning |
| Templates | HTML/CSS (1080×1920, 1920×1080) | Frame layout and rendering |
| Video | FFmpeg + Playwright | Composition, rendering, concatenation |
The critical insight: PixelleVideoCore is completely decoupled from the web UI. It's a standalone Python class with an async generate_video() method. The Streamlit UI and FastAPI are just wrappers around it.
The good news: Pixelle-Video already ships with everything you need for headless operation. Here are three approaches ranked from simplest to most powerful.
This is the cleanest. Write a small Python script that imports PixelleVideoCore directly:
#!/usr/bin/env python3
"""pixelle-cli.py — Headless Pixelle-Video CLI for Ubuntu"""
import asyncio
import argparse
from pixelle_video.service import PixelleVideoCore
async def main():
parser = argparse.ArgumentParser()
parser.add_argument("text", help="Topic or script for video generation")
parser.add_argument("--mode", default="generate",
choices=["generate", "fixed"])
parser.add_argument("--scenes", type=int, default=5)
parser.add_argument("--template", default="1080x1920/image_default.html")
parser.add_argument("--prompt-prefix",
default="consistent character: a wise old sage in a library, "
"warm lighting, detailed illustration style")
parser.add_argument("--ref-audio", help="Reference audio for voice cloning")
parser.add_argument("--output", help="Output path for video")
args = parser.parse_args()
core = PixelleVideoCore()
await core.initialize()
result = await core.generate_video(
text=args.text,
mode=args.mode,
n_scenes=args.scenes,
frame_template=args.template,
prompt_prefix=args.prompt_prefix,
ref_audio=args.ref_audio,
output_path=args.output,
)
print(f"✅ Video generated: {result.video_path}")
print(f" Duration: {result.duration:.1f}s")
await core.cleanup()
if __name__ == "__main__":
asyncio.run(main())
Usage:
uv run python pixelle-cli.py "How do quantum computers work?" \
--scenes 5 \
--prompt-prefix "consistent character: professor with glasses, lab coat, clean illustration style"
Start the API in the background, then hit it with curl:
# Terminal 1: Start headless API
uv run python api/app.py --host 127.0.0.1 --port 8000
# Terminal 2: Generate video via API
curl -X POST http://localhost:8000/api/video/generate/sync \
-H "Content-Type: application/json" \
-d '{
"text": "The history of sushi in 60 seconds",
"mode": "generate",
"n_scenes": 5,
"frame_template": "1080x1920/image_default.html",
"prompt_prefix": "consistent character: Japanese chef, warm kitchen, anime art style"
}'
This returns a JSON with video_url pointing to the generated MP4.
For production use, modify docker-compose.yml to skip the Streamlit web container:
# Clone and configure
git clone https://github.com/ATH-MaaS/Pixelle-Video.git
cd Pixelle-Video
cp config.example.yaml config.yaml
# Edit config.yaml with your API keys
# Start only the API container
docker compose up api -d
# Generate video via curl
curl -X POST http://localhost:8000/api/video/generate/async \
-H "Content-Type: application/json" \
-d '{"text": "5 life lessons from stoicism", "mode": "generate", "n_scenes": 6}'
This is where it gets interesting. Pixelle-Video doesn't have a built-in "character memory" system, but it exposes three powerful levers for maintaining visual consistency:
prompt_prefix — Your Character's DNAThe prompt_prefix gets prepended to every image prompt. If your prefix is consistent and descriptive, your character stays consistent:
# Good — specific character description
prompt_prefix: "A young woman with short silver hair, round glasses, wearing a
navy blue lab coat, cartoon illustration style, Pixar-inspired, same character
in every image"
# Not so good — too vague
prompt_prefix: "Minimalist black-and-white matchstick figure style illustration"
The key: describe your character like you're filling out a police sketch form. Hair color, eye shape, signature accessory, art style — lock it all down.
Pixelle-Video's Digital Human extension module takes an image (your character) and generates a talking-head video synced to your TTS audio. This is the closest thing to "persistent character" in the current release:
The workflow lives in workflows/runninghub/ and workflows/selfhost/ — look for the digital human pipelines.
For voice consistency, pass a ref_audio clip:
uv run python pixelle-cli.py "Today's tech news..." \
--ref-audio "/home/john/voice-samples/my-narrator.wav"
The TTS engine will clone that voice across all scenes. Combined with a locked prompt_prefix, you get the same face and the same voice — your AI persona is born.
Two more extension modules that help with character consistency:
For a persistent character workflow: generate one high-quality character image → use it as seed → run Image-to-Video for each scene → same face, different actions.
# Prerequisites
sudo apt update && sudo apt install -y ffmpeg curl fonts-noto-cjk
# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc
# Clone Pixelle-Video
git clone https://github.com/ATH-MaaS/Pixelle-Video.git
cd Pixelle-Video
# Configure
cp config.example.yaml config.yaml
# Edit config.yaml:
# - Set LLM provider (OpenAI, Qwen, DeepSeek, or local Ollama)
# - Choose ComfyUI URL (local) or RunningHub API key (cloud)
# - Set image style prompt_prefix
# Install dependencies
uv sync
uv run playwright install --with-deps chromium
# Test your setup
uv run python -c "
import asyncio
from pixelle_video.service import PixelleVideoCore
async def test():
core = PixelleVideoCore()
await core.initialize()
print('✅ Pixelle-Video core initialized successfully')
await core.cleanup()
asyncio.run(test())
"
Pixelle-Video supports two media generation paths:
| Local ComfyUI | RunningHub Cloud | |
|---|---|---|
| Cost | Free (your electricity) | Pay-per-generation (~$0.01-0.10/video) |
| GPU | Required (6GB+ VRAM recommended) | None needed |
| Speed | Depends on your hardware | Consistent cloud performance |
| Privacy | Everything stays local | Images processed in cloud |
| Setup | Install ComfyUI + workflows + models | Just an API key |
My recommendation for integrated GPUs like the Radeon 800M: The iGPU is probably tight for ComfyUI (6GB VRAM minimum for Flux), so RunningHub is the pragmatic choice. Set runninghub_api_key in config.yaml and use runninghub/ workflows.
The same team also built Pixelle-MCP, which exposes ComfyUI workflows as MCP (Model Context Protocol) tools. This means AI agents can directly call video generation workflows without touching the Pixelle-Video web UI at all.
# Install and run Pixelle-MCP CLI
uvx pixelle@latest
# Or: pip install pixelle && pixelle
This is arguably the most "headless" approach — a unified CLI that bridges LLMs and ComfyUI, with MCP endpoint at http://localhost:9004/pixelle/mcp. You get:
http://localhost:9004prompt_prefix + Digital Human extension + reference audio cloningQuestions? Found a better CLI approach? Drop a comment below — or fork the repo and ship a PR. The Pixelle-Video team is actively accepting contributions.
Built and tested on Ubuntu 26.04 (Resolute Raccoon) with kernel 7.0.0-22-generic.