Three seconds. That's all it takes. Record a three-second clip of anyone speaking — in any of 14 languages — and Confucius4-TTS will clone that voice and make it speak fluently in all the others. No reference transcript needed. No fine-tuning. And the entire thing is open-source under Apache 2.0, which means you can use it commercially without a second thought.
NetEase Youdao released Confucius4-TTS on June 23, 2026 under their "Ziyue 4.0" (子曰4.0) initiative, and the open-source TTS landscape just got a serious shake-up.

Let's cut through the hype. The TTS space is crowded — Fish-Speech, GPT-SoVITS, CosyVoice, ElevenLabs, you name it. So what does Confucius4-TTS actually bring to the table that we haven't seen before?
1. No Reference Text Required (This Is the Big One)
Almost every other voice cloning model in existence — CosyVoice, Fish-Speech, OmniVoice, VoxCPM2 — requires you to provide a transcript of what the reference speaker is saying. Confucius4-TTS doesn't. You feed it a 3-second WAV file and it figures out the rest. This is what the project calls "Unconstrained Voice Cloning," and it's not just marketing — the benchmark tables on GitHub confirm it. Every single competitor marked with a "†" requires reference text; Confucius4-TTS sits alone without that asterisk across multiple benchmarks.
2. Cross-Lingual Without the Accent
If you've ever used a multilingual TTS system, you know the pain: your cloned English voice suddenly develops a heavy Chinese accent when speaking Japanese, or vice versa. Confucius4-TTS explicitly tackles this. The 14 supported languages — Chinese, English, Japanese, Korean, German, French, Spanish, Indonesian, Italian, Thai, Portuguese, Russian, Malay, and Vietnamese — all maintain consistent voice character without the typical "accent bleed." More languages are promised soon.
3. Emotion Transfer Built In
This isn't just about tone matching. The model extracts emotional features from the reference audio — intonation, prosody, rhythm — and carries them across languages. A happy-sounding Chinese speaker will sound happy in German too. An angry English clip produces angry-sounding Korean. This is a subtle but powerful feature that most open-source TTS models either ignore or handle poorly.

Under the hood, Confucius4-TTS packs a 1.3B parameter model using a "speech encoder + LLM" architecture. Here's the pipeline:
This is a significant departure from traditional TTS pipelines that rely on autoregressive vocoders. Flow matching, popularized by models like Stable Diffusion 3, gives Confucius4-TTS better control over generation quality and speed.
The 54GB complete resource package includes everything: T2S model weights, S2A model, tokenizer, speaker encoder checkpoints, and configuration files. You can run it locally, offline, forever — no cloud API calls needed once deployed.
Requirements: Python 3.10, CUDA 12.6. A GPU with decent VRAM is recommended (the model is 1.3B parameters after all).
The GitHub README contains extensive benchmark results across four evaluation suites. Here's the honest picture:
| Benchmark | Confucius4-TTS Position | Key Competitor |
|---|---|---|
| CV3-eval (en→zh) | WER 6.71 — beats F5-TTS, Spark-TTS, CosyVoice2 | CosyVoice3+DiffRO edges ahead (5.16) |
| X-Voice (de→zh) | WER 2.86 — competitive with X-Voice (3.07) | OmniVoice wins SIM (0.691 vs 0.569) |
| Seed-TTS-eval (English) | WER 1.49 — competitive | Qwen3-TTS slightly better (1.24) |
| MiniMax (German) | WER 0.47 — beats ElevenLabs (0.57) | FishAudio S2 close at 0.55 |
The pattern is clear: Confucius4-TTS is highly competitive but not dominant. Its real edges are in the no-reference-text constraint and cross-lingual consistency, not raw accuracy numbers. On the MiniMax multilingual test, the model absolutely crushes ElevenLabs on Thai (WER 1.56 vs 73.94) and Vietnamese (1.61 vs 73.42), showing where its true multilingual strength lies.

Here's where Confucius4-TTS fits in the 2026 open-source TTS ecosystem:
The 3-second requirement is genuinely industry-leading. Most competitors need 5-20 seconds of clean audio. And the fact that you don't need to provide what the speaker is actually saying? That's a workflow game-changer.
Let's be real — nothing is perfect:
54GB is heavy. For a 1.3B model, that's a lot of disk. This isn't running on your Raspberry Pi. You'll want a proper GPU setup.
85% similarity is self-reported. The "85% similarity, 97% accuracy" numbers come from NetEase's internal testing. Third-party independent evaluation is still thin. Early community testers on X/Twitter report "natural and fluent" results but note that "100% reproduction of nuanced timbre isn't achievable yet."
14 languages isn't 50. Fish-Speech covers ~50 languages; Confucius4-TTS is at 14 with "more coming soon." If you need niche language support today, you may need to look elsewhere.
448 GitHub stars as of late June 2026. For context, GPT-SoVITS has 45K+. The community is still nascent. That means fewer third-party tutorials, forks, and integrations — for now.
Want to try it right now? Here's the five-minute path:
# 1. Clone
git clone https://github.com/netease-youdao/Confucius4-TTS.git
cd Confucius4-TTS
# 2. Environment
conda create -n confuciustts python=3.10 -y
conda activate confuciustts
pip install -r requirements.txt
# 3. Clone a voice (3-second WAV → synthesized speech)
python example.py \
--prompt_wav path/to/reference.wav \
--text "Hello, this is a test of zero-shot voice cloning." \
--lang en \
--out output.wav \
--config config/inference_config.yaml
Or try the online demo without installing anything.
Confucius4-TTS matters beyond the feature checklist. It's another data point in the accelerating trend of Chinese tech companies going all-in on open-source AI. Following DeepSeek, Alibaba's Qwen, and ByteDance's various releases, NetEase Youdao is betting that Apache 2.0 + commercial freedom + zero-friction cloning is the right formula to win developer mindshare.
For content creators, this means one thing: the barrier to multilingual content creation just dropped again. Dubbing short dramas for international audiences, creating multilingual educational content, building digital humans that speak 14 languages natively — all of this just got cheaper and simpler.
The 3-second, no-transcript constraint is the killer feature. Everything else is table stakes. Whether Confucius4-TTS builds the community it deserves will depend on how well the model generalizes beyond the benchmarks — and how aggressively NetEase iterates. For now, it's absolutely worth a weekend project.