Qwen Audiobook Converter: Turn Any PDF or EPUB into a Narrated Audiobook with Open-Source AI

Audiobooks are a $20+ billion industry, yet converting your own books, research papers, or documents into spoken audio has remained surprisingly locked down. Commercial services charge per-title fees, proprietary TTS APIs demand per-character billing, and the open-source tooling has historically fallen short of natural-sounding narration.

Enter the Qwen Audiobook Converter by WhiskeyCoder — an open-source, MIT-licensed Python tool that takes any PDF, EPUB, DOCX, DOC, or TXT file and turns it into a high-quality MP3 audiobook using Alibaba's Qwen3 TTS voice model. With 744 GitHub stars, 102 forks, and a Reddit post that hit 149 upvotes (96% positive) within hours of launch, it's quickly becoming the go-to solution for local, private audiobook generation.

Qwen Audiobook Converter hero image showing a modern tech workspace with headphones, book, and AI audio visualization

What Is the Qwen Audiobook Converter?

At its core, this is a single-file Python script (851 lines, audiobook_converter.py) that bridges two worlds:

Document parsing — it reads and extracts text from PDFs, EPUB ebooks, Word documents, and plain text files.
AI voice synthesis — it sends that text in smartly chunked pieces to a locally-running Qwen3 TTS server (Gradio API at http://127.0.0.1:7860), then stitches the audio chunks into a complete MP3 audiobook.

The design philosophy is refreshingly straightforward: drop a file, run one command, get an audiobook. No cloud accounts, no API keys, no monthly subscriptions.

Stat	Value
Stars	⭐ 744
Forks	102
License	MIT (free for commercial use)
Main script	`audiobook_converter.py` (851 lines)
Backend model	Qwen3-TTS-12Hz-1.7B-CustomVoice
Supported formats	PDF, EPUB, DOCX, DOC, TXT

Under the Hood: Architecture & Pipeline

The conversion pipeline is deceptively simple but packs smart engineering:

Diagram showing the data pipeline from documents to audio waves

Step 1: Text Extraction

The script supports five input formats with dedicated parsers:

PDF — uses PyPDF2 for text extraction from selectable-text PDFs
EPUB — uses ebooklib to parse the internal HTML content
DOCX/DOC — python-docx and docx2txt for Word documents
TXT — direct file read

Step 2: Smart Chunking

The chunker is the secret to maintaining natural pacing across long documents:

1,200-1,500 words per chunk (configurable) — each chunk produces ~4-5 minutes of audio
Sentence boundary detection — never splits mid-sentence
Page number cleaning — strips standalone page numbers that confuse TTS
Whitespace normalization — fixes PDF extraction artifacts

Step 3: Voice Synthesis via Gradio API

Each chunk is sent sequentially (1 worker, to avoid rate limiting) to the Qwen3 TTS server with a 300-second timeout and 3 retry attempts. The 1.7B model processes each chunk in ~4-5 minutes.

Step 4: Audio Assembly

Processed WAV chunks are concatenated using pydub + FFmpeg, then exported as a single MP3 file at 128kbps bitrate (~1MB per minute of audio). Temporary chunk files are automatically cleaned up.

Step 5: Intelligent Caching

Already-processed chunks are cached with a 30-day TTL. If you re-run the converter (e.g., adjusting settings), it skips completed chunks — crucial for long novels where you don't want to regenerate 80+ chunks from scratch.

Five Voice Modes for Every Use Case

The converter connects to Qwen3 TTS's versatile model ecosystem, offering multiple ways to control the narration voice:

Visualization of multiple voice profiles and audio waveforms

1. Custom Voice (Default — One-Click)

Uses pre-built speakers baked into the 1.7B CustomVoice model. The default speaker is Ryan (male, clear professional narrator). Nine speakers are available:

Speaker	Style	Best For
Ryan	Male, clear & professional	Default, general narration
Serena	Female, warm & friendly	Fiction, novels
Aiden	Male, energetic	Adventure, thrillers
Dylan	Male, calm	Non-fiction, meditation
Eric	Male, expressive	Character dialogue
Ono_anna	Female, Japanese accent	Multilingual works
Sohee	Female, Korean accent	Multilingual works
Uncle_fu	Male, Chinese accent	Multilingual works
Vivian	Female, versatile	General narration

2. Voice Clone

Provide a 10-30 second WAV reference of any voice. The Qwen model automatically transcribes it using its built-in Whisper engine (no manual text input needed), then clones the voice's timbre, pace, and style. The developer demonstrated cloning everything from Patrick Stewart to SpongeBob from just a 5-second sample.

python audiobook_converter.py --voice-clone --voice-sample patrick_stewart.wav

3. Voice Design

Describe the voice you want in natural language, and the 1.7B VoiceDesign model generates it:

"Speak in a clear, professional narrator voice suitable for reading audiobooks."
"A gravelly old man with dry wit, reading a mystery novel."

4. LoRA Training (Advanced)

Train a custom voice identity using 15-60 audio samples. Creates a persistent voice that can be reused across projects.

5. Saved Designs

Reuse voice designs across multiple books once you've crafted the perfect narrator.

Why Qwen3 TTS Changes the Game

The Qwen3 TTS model family, released open-weight by Alibaba's Qwen team, is what makes this project possible. The converter always uses the 1.7B parameter variant — the largest and highest quality:

Model	Parameters	Purpose
CustomVoice	1.7B	9+ pre-built speakers with style instructions
VoiceDesign	1.7B	Generate voice from text description
Base	1.7B	Voice cloning + LoRA training

Key technical specs:

12Hz output — 12 audio frames per second for smooth, natural intonation
~157-170 WPM narration speed — right in the professional audiobook range (150-160 WPM target)
Style instruction support — you can prompt "speak naturally and clearly, as if reading a dramatic book to an adult audience" to control pacing and emotion
Multi-language support — English, Chinese, Japanese, Korean, French, German, Spanish, Portuguese, Russian

Setting It Up: From Zero to Audiobook

Getting started takes about 10 minutes:

1. Install Qwen3 TTS

Easiest: Use Pinokio — one-click install for Qwen3 TTS, auto-launches at http://127.0.0.1:7860.

Manual:

pip install qwen-tts-demo
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 7860

2. Install the Converter

git clone https://github.com/WhiskeyCoder/Qwen3-Audiobook-Converter.git
cd Qwen3-Audiobook-Converter
pip install -r requirements.txt
sudo apt-get install ffmpeg  # Linux

3. Convert Your First Book

cp my_book.pdf book_to_convert/
python audiobook_converter.py
# Output: audiobooks/my_book.mp3

4. Voice Cloning (Optional)

python audiobook_converter.py --voice-clone --voice-sample my_voice.wav

The reference audio is automatically transcribed — no manual transcription needed.

Performance & Real-World Considerations

Processing Speed

Each 1,200-word chunk takes ~4-5 minutes with the 1.7B model. A typical 100,000-word novel splits into ~83 chunks, requiring ~5.5-7 hours of total processing time.

System Requirements

Resource	Requirement
RAM	4GB+ recommended
GPU	CUDA-capable GPU for TTS (NVIDIA recommended)
Storage	~100MB per hour of audiobook output
Python	3.8+
FFmpeg	Required for audio assembly

Caching

The 30-day cache means you can tweak settings and re-run without re-processing completed chunks. The script also supports resume from interruption — if your server restarts mid-conversion, already-processed chunks are preserved.

Limitations

No OCR — image-based PDFs (scanned books) won't work. Use OCR software first.
Sequential processing — limited to 1 concurrent worker to avoid Gradio rate limiting
GPU required for speed — CPU inference on the 1.7B TTS model would be significantly slower

The Road Ahead

The project's roadmap hints at exciting improvements:

GUI interface — drag-and-drop audiobook conversion
Chapter detection — automatic chapter splitting in output
Multiple output formats — M4B (audiobook standard), OGG, FLAC
Real-time preview — hear a sample before committing to full conversion
Progress persistence — survive server restarts mid-conversion
Batch voice switching — apply different voices per chapter
Voice quality enhancement — post-processing for even cleaner output

The Bigger Picture

The Qwen Audiobook Converter represents something significant: the democratization of audiobook production. With a $1,000 GPU, an open-source TTS model comparable to proprietary systems, and 851 lines of Python, anyone can build a personal audiobook pipeline that rivals commercial services.

For language learners, it means textbooks read in native accents. For indie authors, it means audiobook editions without paying Audible's production fees. For the visually impaired, it means instant access to any document in audio form.

And it's all running locally — no data leaves your machine, no per-minute billing, no internet dependency once the models are downloaded.

The community has spoken with 744 stars and counting. This isn't just a tool — it's a glimpse at the future of how we'll consume written content.