Audiobooks are a $20+ billion industry, yet converting your own books, research papers, or documents into spoken audio has remained surprisingly locked down. Commercial services charge per-title fees, proprietary TTS APIs demand per-character billing, and the open-source tooling has historically fallen short of natural-sounding narration.
Enter the Qwen Audiobook Converter by WhiskeyCoder — an open-source, MIT-licensed Python tool that takes any PDF, EPUB, DOCX, DOC, or TXT file and turns it into a high-quality MP3 audiobook using Alibaba's Qwen3 TTS voice model. With 744 GitHub stars, 102 forks, and a Reddit post that hit 149 upvotes (96% positive) within hours of launch, it's quickly becoming the go-to solution for local, private audiobook generation.

At its core, this is a single-file Python script (851 lines, audiobook_converter.py) that bridges two worlds:
http://127.0.0.1:7860), then stitches the audio chunks into a complete MP3 audiobook.The design philosophy is refreshingly straightforward: drop a file, run one command, get an audiobook. No cloud accounts, no API keys, no monthly subscriptions.
| Stat | Value |
|---|---|
| Stars | ⭐ 744 |
| Forks | 102 |
| License | MIT (free for commercial use) |
| Main script | audiobook_converter.py (851 lines) |
| Backend model | Qwen3-TTS-12Hz-1.7B-CustomVoice |
| Supported formats | PDF, EPUB, DOCX, DOC, TXT |
The conversion pipeline is deceptively simple but packs smart engineering:

The script supports five input formats with dedicated parsers:
The chunker is the secret to maintaining natural pacing across long documents:
Each chunk is sent sequentially (1 worker, to avoid rate limiting) to the Qwen3 TTS server with a 300-second timeout and 3 retry attempts. The 1.7B model processes each chunk in ~4-5 minutes.
Processed WAV chunks are concatenated using pydub + FFmpeg, then exported as a single MP3 file at 128kbps bitrate (~1MB per minute of audio). Temporary chunk files are automatically cleaned up.
Already-processed chunks are cached with a 30-day TTL. If you re-run the converter (e.g., adjusting settings), it skips completed chunks — crucial for long novels where you don't want to regenerate 80+ chunks from scratch.
The converter connects to Qwen3 TTS's versatile model ecosystem, offering multiple ways to control the narration voice:

Uses pre-built speakers baked into the 1.7B CustomVoice model. The default speaker is Ryan (male, clear professional narrator). Nine speakers are available:
| Speaker | Style | Best For |
|---|---|---|
| Ryan | Male, clear & professional | Default, general narration |
| Serena | Female, warm & friendly | Fiction, novels |
| Aiden | Male, energetic | Adventure, thrillers |
| Dylan | Male, calm | Non-fiction, meditation |
| Eric | Male, expressive | Character dialogue |
| Ono_anna | Female, Japanese accent | Multilingual works |
| Sohee | Female, Korean accent | Multilingual works |
| Uncle_fu | Male, Chinese accent | Multilingual works |
| Vivian | Female, versatile | General narration |
Provide a 10-30 second WAV reference of any voice. The Qwen model automatically transcribes it using its built-in Whisper engine (no manual text input needed), then clones the voice's timbre, pace, and style. The developer demonstrated cloning everything from Patrick Stewart to SpongeBob from just a 5-second sample.
python audiobook_converter.py --voice-clone --voice-sample patrick_stewart.wav
Describe the voice you want in natural language, and the 1.7B VoiceDesign model generates it:
Train a custom voice identity using 15-60 audio samples. Creates a persistent voice that can be reused across projects.
Reuse voice designs across multiple books once you've crafted the perfect narrator.
The Qwen3 TTS model family, released open-weight by Alibaba's Qwen team, is what makes this project possible. The converter always uses the 1.7B parameter variant — the largest and highest quality:
| Model | Parameters | Purpose |
|---|---|---|
| CustomVoice | 1.7B | 9+ pre-built speakers with style instructions |
| VoiceDesign | 1.7B | Generate voice from text description |
| Base | 1.7B | Voice cloning + LoRA training |
Key technical specs:
Getting started takes about 10 minutes:
Easiest: Use Pinokio — one-click install for Qwen3 TTS, auto-launches at http://127.0.0.1:7860.
Manual:
pip install qwen-tts-demo
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 7860
git clone https://github.com/WhiskeyCoder/Qwen3-Audiobook-Converter.git
cd Qwen3-Audiobook-Converter
pip install -r requirements.txt
sudo apt-get install ffmpeg # Linux
cp my_book.pdf book_to_convert/
python audiobook_converter.py
# Output: audiobooks/my_book.mp3
python audiobook_converter.py --voice-clone --voice-sample my_voice.wav
The reference audio is automatically transcribed — no manual transcription needed.
Each 1,200-word chunk takes ~4-5 minutes with the 1.7B model. A typical 100,000-word novel splits into ~83 chunks, requiring ~5.5-7 hours of total processing time.
| Resource | Requirement |
|---|---|
| RAM | 4GB+ recommended |
| GPU | CUDA-capable GPU for TTS (NVIDIA recommended) |
| Storage | ~100MB per hour of audiobook output |
| Python | 3.8+ |
| FFmpeg | Required for audio assembly |
The 30-day cache means you can tweak settings and re-run without re-processing completed chunks. The script also supports resume from interruption — if your server restarts mid-conversion, already-processed chunks are preserved.
The project's roadmap hints at exciting improvements:
The Qwen Audiobook Converter represents something significant: the democratization of audiobook production. With a $1,000 GPU, an open-source TTS model comparable to proprietary systems, and 851 lines of Python, anyone can build a personal audiobook pipeline that rivals commercial services.
For language learners, it means textbooks read in native accents. For indie authors, it means audiobook editions without paying Audible's production fees. For the visually impaired, it means instant access to any document in audio form.
And it's all running locally — no data leaves your machine, no per-minute billing, no internet dependency once the models are downloaded.
The community has spoken with 744 stars and counting. This isn't just a tool — it's a glimpse at the future of how we'll consume written content.