NX
View mobile page

Qwen Audiobook Converter: Turn Any PDF or EPUB into a Narrated Audiobook with Open-Source AI

Technology News x/technology ·
Qwen Audiobook Converter: Turn Any PDF or EPUB into a Narrated Audiobook with Open-Source AI

Qwen Audiobook Converter: Turn Any PDF or EPUB into a Narrated Audiobook with Open-Source AI

Audiobooks are a $20+ billion industry, yet converting your own books, research papers, or documents into spoken audio has remained surprisingly locked down. Commercial services charge per-title fees, proprietary TTS APIs demand per-character billing, and the open-source tooling has historically fallen short of natural-sounding narration.

Enter the Qwen Audiobook Converter by WhiskeyCoder — an open-source, MIT-licensed Python tool that takes any PDF, EPUB, DOCX, DOC, or TXT file and turns it into a high-quality MP3 audiobook using Alibaba's Qwen3 TTS voice model. With 744 GitHub stars, 102 forks, and a Reddit post that hit 149 upvotes (96% positive) within hours of launch, it's quickly becoming the go-to solution for local, private audiobook generation.

Qwen Audiobook Converter hero image showing a modern tech workspace with headphones, book, and AI audio visualization


What Is the Qwen Audiobook Converter?

At its core, this is a single-file Python script (851 lines, audiobook_converter.py) that bridges two worlds:

  1. Document parsing — it reads and extracts text from PDFs, EPUB ebooks, Word documents, and plain text files.
  2. AI voice synthesis — it sends that text in smartly chunked pieces to a locally-running Qwen3 TTS server (Gradio API at http://127.0.0.1:7860), then stitches the audio chunks into a complete MP3 audiobook.

The design philosophy is refreshingly straightforward: drop a file, run one command, get an audiobook. No cloud accounts, no API keys, no monthly subscriptions.

Stat Value
Stars ⭐ 744
Forks 102
License MIT (free for commercial use)
Main script audiobook_converter.py (851 lines)
Backend model Qwen3-TTS-12Hz-1.7B-CustomVoice
Supported formats PDF, EPUB, DOCX, DOC, TXT

Under the Hood: Architecture & Pipeline

The conversion pipeline is deceptively simple but packs smart engineering:

Diagram showing the data pipeline from documents to audio waves

Step 1: Text Extraction

The script supports five input formats with dedicated parsers:

  • PDF — uses PyPDF2 for text extraction from selectable-text PDFs
  • EPUB — uses ebooklib to parse the internal HTML content
  • DOCX/DOC — python-docx and docx2txt for Word documents
  • TXT — direct file read

Step 2: Smart Chunking

The chunker is the secret to maintaining natural pacing across long documents:

  • 1,200-1,500 words per chunk (configurable) — each chunk produces ~4-5 minutes of audio
  • Sentence boundary detection — never splits mid-sentence
  • Page number cleaning — strips standalone page numbers that confuse TTS
  • Whitespace normalization — fixes PDF extraction artifacts

Step 3: Voice Synthesis via Gradio API

Each chunk is sent sequentially (1 worker, to avoid rate limiting) to the Qwen3 TTS server with a 300-second timeout and 3 retry attempts. The 1.7B model processes each chunk in ~4-5 minutes.

Step 4: Audio Assembly

Processed WAV chunks are concatenated using pydub + FFmpeg, then exported as a single MP3 file at 128kbps bitrate (~1MB per minute of audio). Temporary chunk files are automatically cleaned up.

Step 5: Intelligent Caching

Already-processed chunks are cached with a 30-day TTL. If you re-run the converter (e.g., adjusting settings), it skips completed chunks — crucial for long novels where you don't want to regenerate 80+ chunks from scratch.


Five Voice Modes for Every Use Case

The converter connects to Qwen3 TTS's versatile model ecosystem, offering multiple ways to control the narration voice:

Visualization of multiple voice profiles and audio waveforms

1. Custom Voice (Default — One-Click)

Uses pre-built speakers baked into the 1.7B CustomVoice model. The default speaker is Ryan (male, clear professional narrator). Nine speakers are available:

Speaker Style Best For
Ryan Male, clear & professional Default, general narration
Serena Female, warm & friendly Fiction, novels
Aiden Male, energetic Adventure, thrillers
Dylan Male, calm Non-fiction, meditation
Eric Male, expressive Character dialogue
Ono_anna Female, Japanese accent Multilingual works
Sohee Female, Korean accent Multilingual works
Uncle_fu Male, Chinese accent Multilingual works
Vivian Female, versatile General narration

2. Voice Clone

Provide a 10-30 second WAV reference of any voice. The Qwen model automatically transcribes it using its built-in Whisper engine (no manual text input needed), then clones the voice's timbre, pace, and style. The developer demonstrated cloning everything from Patrick Stewart to SpongeBob from just a 5-second sample.

python audiobook_converter.py --voice-clone --voice-sample patrick_stewart.wav

3. Voice Design

Describe the voice you want in natural language, and the 1.7B VoiceDesign model generates it:

  • "Speak in a clear, professional narrator voice suitable for reading audiobooks."
  • "A gravelly old man with dry wit, reading a mystery novel."

4. LoRA Training (Advanced)

Train a custom voice identity using 15-60 audio samples. Creates a persistent voice that can be reused across projects.

5. Saved Designs

Reuse voice designs across multiple books once you've crafted the perfect narrator.


Why Qwen3 TTS Changes the Game

The Qwen3 TTS model family, released open-weight by Alibaba's Qwen team, is what makes this project possible. The converter always uses the 1.7B parameter variant — the largest and highest quality:

Model Parameters Purpose
CustomVoice 1.7B 9+ pre-built speakers with style instructions
VoiceDesign 1.7B Generate voice from text description
Base 1.7B Voice cloning + LoRA training

Key technical specs:

  • 12Hz output — 12 audio frames per second for smooth, natural intonation
  • ~157-170 WPM narration speed — right in the professional audiobook range (150-160 WPM target)
  • Style instruction support — you can prompt "speak naturally and clearly, as if reading a dramatic book to an adult audience" to control pacing and emotion
  • Multi-language support — English, Chinese, Japanese, Korean, French, German, Spanish, Portuguese, Russian

Setting It Up: From Zero to Audiobook

Getting started takes about 10 minutes:

1. Install Qwen3 TTS

Easiest: Use Pinokio — one-click install for Qwen3 TTS, auto-launches at http://127.0.0.1:7860.

Manual:

pip install qwen-tts-demo
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 7860

2. Install the Converter

git clone https://github.com/WhiskeyCoder/Qwen3-Audiobook-Converter.git
cd Qwen3-Audiobook-Converter
pip install -r requirements.txt
sudo apt-get install ffmpeg  # Linux

3. Convert Your First Book

cp my_book.pdf book_to_convert/
python audiobook_converter.py
# Output: audiobooks/my_book.mp3

4. Voice Cloning (Optional)

python audiobook_converter.py --voice-clone --voice-sample my_voice.wav

The reference audio is automatically transcribed — no manual transcription needed.


Performance & Real-World Considerations

Processing Speed

Each 1,200-word chunk takes ~4-5 minutes with the 1.7B model. A typical 100,000-word novel splits into ~83 chunks, requiring ~5.5-7 hours of total processing time.

System Requirements

Resource Requirement
RAM 4GB+ recommended
GPU CUDA-capable GPU for TTS (NVIDIA recommended)
Storage ~100MB per hour of audiobook output
Python 3.8+
FFmpeg Required for audio assembly

Caching

The 30-day cache means you can tweak settings and re-run without re-processing completed chunks. The script also supports resume from interruption — if your server restarts mid-conversion, already-processed chunks are preserved.

Limitations

  • No OCR — image-based PDFs (scanned books) won't work. Use OCR software first.
  • Sequential processing — limited to 1 concurrent worker to avoid Gradio rate limiting
  • GPU required for speed — CPU inference on the 1.7B TTS model would be significantly slower

The Road Ahead

The project's roadmap hints at exciting improvements:

  • GUI interface — drag-and-drop audiobook conversion
  • Chapter detection — automatic chapter splitting in output
  • Multiple output formats — M4B (audiobook standard), OGG, FLAC
  • Real-time preview — hear a sample before committing to full conversion
  • Progress persistence — survive server restarts mid-conversion
  • Batch voice switching — apply different voices per chapter
  • Voice quality enhancement — post-processing for even cleaner output

The Bigger Picture

The Qwen Audiobook Converter represents something significant: the democratization of audiobook production. With a $1,000 GPU, an open-source TTS model comparable to proprietary systems, and 851 lines of Python, anyone can build a personal audiobook pipeline that rivals commercial services.

For language learners, it means textbooks read in native accents. For indie authors, it means audiobook editions without paying Audible's production fees. For the visually impaired, it means instant access to any document in audio form.

And it's all running locally — no data leaves your machine, no per-minute billing, no internet dependency once the models are downloaded.

The community has spoken with 744 stars and counting. This isn't just a tool — it's a glimpse at the future of how we'll consume written content.


Sources

  1. WhiskeyCoder/Qwen3-Audiobook-Converter - GitHub Repository
  2. Qwen3-Audiobook-Converter/config.py - Full Configuration Reference
  3. Qwen3-Audiobook-Converter/audiobook_converter.py - Main Source Code (851 lines)
  4. Reddit Discussion: "I built an open-source audiobook converter using Qwen3 TTS" - r/LocalLLaMA
  5. Qwen3-TTS: The Complete 2026 Guide to Open-Source Voice Cloning - DEV Community
  6. High-Quality Long-Form TTS with Qwen3 Open-Weight Models - Data Science Collective / Medium
·