Benchmarking Qwen3.6-35B-A3B vs Qwen3.6-35B-A3B-MTP on a Ryzen AI 9 HX 370 with Lemonade 10.8.0
Everybody said Multi-Token Prediction (MTP) doesn't work on CPU-only setups. The Metal benchmarks showed net losses at every configuration. The HackMD deep dive on an RTX 3090 found no llama.cpp speculative method beat baseline. Alan West wrote a whole manifesto explaining why MTP fails on quantized models.
So naturally, I had to test it myself.
Multi-Token Prediction is a form of self-speculative decoding baked directly into the model. Instead of generating one token per forward pass — the way LLMs have worked since GPT-1 — the model predicts multiple future tokens at once, then a verifier accepts or rejects them in a single batch.
In plain English:
The theory is elegant. If the model guesses right more often than not, you effectively get 2–3 tokens per pass for roughly the same cost as one. Unsloth reported 240 tok/s on high-end GPUs — a 1.4x–2.2x speedup.
Qwen3.6-35B-A3B is Alibaba's April 2026 Mixture of Experts model. It has 35B total parameters but only activates ~3B per token — which makes it surprisingly viable on consumer hardware. The -MTP GGUF bundles a multi-token prediction head alongside the standard weights in a single file, ready for llama.cpp's speculative decoding engine.
The catch? The internet was pretty convinced it wouldn't work on anything short of a high-end GPU.
Before I ran a single benchmark, the literature painted a grim picture.
llama.cpp Issue #23752 — Metal (Apple Silicon): MTP speculative decoding produced worse throughput than baseline at every single configuration. Even with 100% draft acceptance (n_max=0), throughput dropped 11%. At n_max=6, it was 28% slower. The issue author was blunt: "MTP on Metal, as a new feature, may never have provided a speedup."
HackMD — RTX 3090 + Q4 Quant: A rigorous test of every llama.cpp speculative mode on Qwen3.6-35B-A3B with Q4 quantization. DFlash at draft-max=8? 44.6% slower. Oleg draft-spec? 52.8% slower. The conclusion was even more blunt: "No llama.cpp speculative-decoding method tested gives a positive yield on consumer Ampere with Q4 quantized target."
Alan West — DEV Community: Identified three failure modes: low acceptance rates on quantized models (the MTP head degrades more than trunk weights), KV cache thrashing, and CUDA graph capture failures. His prescription: "Walk the three steps above and you'll usually find the culprit within an hour."
The pattern seemed clear. MTP was a GPU-only toy. CPU inference — even with fast LPDDR5x — couldn't touch it.
Then I typed lemonade run Qwen3.6-35B-A3B-MTP-GGUF and everything changed.
Here's the rig:
| Component | Spec |
|---|---|
| Processor | AMD Ryzen AI 9 HX 370 (Strix Point) |
| RAM | 96 GB LPDDR5x (~89 Gi in Ubuntu) |
| iGPU | Radeon 800M (RDNA 3.5) |
| NPU | XDNA 2 (not used — llama.cpp runs CPU-only) |
| OS | Ubuntu 26.04 (Resolute Raccoon), kernel 7.0.0-22-generic |
| Stack | Lemonade 10.8.0 → llama.cpp backend |
| Model | Qwen3.6-35B-A3B, Q4_K_XL quantization, ~21 GB |
Important detail: the NPU was idle during these benchmarks. llama.cpp inference runs entirely on CPU cores hammering LPDDR5x system RAM at roughly 120 GB/s. The XDNA 2 silicon sat there, sipping power, contributing exactly nothing. This is pure CPU muscle.
I loaded both models — the standard GGUF and the MTP variant — and fired the same prompt at each.
Here it is. One prompt. Two models. No tricks.
| Metric | Baseline GGUF | MTP GGUF | Delta |
|---|---|---|---|
| ⚡ Tokens per second | 21.02 | 26.00 | 🟢 +23.7% |
| ⏱️ Total generation time | 32.68s | 31.27s | −1.41s |
| 📝 Output tokens | 706 | 832 | +126 more content |
| 🏎️ Effective throughput | 21.0 tok/s | 26.0 tok/s | +5 tok/s faster |
| 📦 Quantization | Q4_K_XL | Q4_K_XL | Identical |
Let that sink in. The MTP model generated 126 more tokens (18% more content) in 1.41 fewer seconds. It did more work in less time — the gold standard for a real optimization.
No overclocking. No draft model. No GPU. Just a GGUF file with an MTP head grafted on, running on a $400 mini PC with integrated graphics, delivering a near-24% speedup.
So why did it work here when Metal and Ampere both failed?
1. Unified Memory Architecture (UMA) Wins
The HX 370's LPDDR5x is a single pool shared by CPU, GPU, and NPU. There's no PCIe bus separating compute from memory. When the MTP verifier needs to touch model weights for batch verification, it's hitting the same ~120 GB/s pipe the draft tokens came through — no copies, no transfers, no bus contention.
2. Zen 5 Absorbs Draft Overhead
The 12 Zen 5 cores in the HX 370 have substantially higher single-thread throughput than Apple's efficiency-focused design. Draft token computation is embarrassingly parallel — the MTP head is small relative to the trunk — and Zen 5 chews through it without breaking a sweat.
3. Qwen3.6's MTP Head Is Exceptionally Well-Trained
Unlike some earlier MTP implementations where the prediction head was an afterthought, Qwen3.6's MTP head was trained with the same rigor as the main model. Under Q4_K_XL quantization, the head weights retain enough precision to maintain high acceptance rates.
4. Lemonade 10.8.0 Ships MTP as a First-Class Feature
This is huge. MTP isn't a hidden flag you need to hunt through llama.cpp docs to enable. lemonade list shows both models right alongside each other. No --spec-type incantations. No tuning. It just works.
This is the part of the blog I didn't want to write — but testing isn't testing if you only show the wins.
After confirming the 26 tok/s victory on simple prompts, I integrated the MTP model into a real chat system with full context: system prompts, tool definitions, chat histories, and agent memories. This is where things got real — and where the benchmark numbers stopped telling the whole story.
| Scenario | Prompt Size | Prefill Latency | Generation Speed | UX Verdict |
|---|---|---|---|---|
| 🧪 Simple curl prompt | ~50 tokens | Instant (<1s) | 26.0 tok/s | 🟢 Excellent |
| 💬 Real chat turn | 4K–8K tokens | ~30 seconds | 26.0 tok/s | 🟡 Annoying |
| 🤖 Full agent context + tools | 8K–32K tokens | 60–120 seconds | 26.0 tok/s | 🔴 Unusable |
Here's what happened:
The problem isn't generation — it's prefill. Your 26 tok/s generation speed is memory-bandwidth-bound (LPDDR5x at 120 GB/s handles this beautifully). But prefill — processing all input tokens in a single parallel forward pass — is compute-bound. The 12 Zen 5 cores simply can't parallelize a 16K+ token prefill fast enough.
MTP helps generation by 24%, but generation was never the bottleneck for real chat. Prefill was always the silent killer, and MTP doesn't touch it.
I tried several mitigations:
| Fix | Result |
|---|---|
--cache-prompt + --flash-attn |
❌ --flash-attn not in Lemonade's build — model failed to load (500 error) |
--batch-size 2048 |
🟡 Moderate prefill improvement (~15%) |
--cache-prompt alone |
🟡 Helps subsequent turns (KV cache reuse), but first request still ~60s cold cache |
Reducing ctx_size to 16384 |
🟡 ~30% faster prefill, but loses long-context capability |
| Use Case | Performance | Recommended? |
|---|---|---|
| Simple Q&A / single-turn | 🟢 Instant, 26 tok/s | ✅ Absolutely |
| Batch inference / offline processing | 🟢 Fast throughput | ✅ Perfect for queues |
| API backend for non-interactive tasks | 🟢 Good enough | ✅ 60s is fine for background jobs |
| Interactive chat assistant | 🔴 60s+ prefill per session | ❌ Not suitable |
| Multi-turn agent (tools + memories) | 🔴 90s+ prefill per request | ❌ Impractical |
Multi-Token Prediction works on CPU (+24% tok/s), but the prefill wall makes it unsuitable for context-heavy interactive chat. MTP solved generation but couldn't touch prefill — and in real LLM usage, prefill is the real killer.
My Ryzen AI 9 HX 370 is a fantastic inference server for batch jobs, document processing, or dedicated API endpoints where 60-second latency is acceptable. But for interactive chat with full agent context, it hits a hard wall that no amount of MTP tuning can breach.
To fix this properly, you need either:
# 1. Install Lemonade
sudo add-apt-repository ppa:lemonade-sdk/lemonade
sudo apt update && sudo apt install lemonade
# 2. Pull and run the MTP model
lemonade run Qwen3.6-35B-A3B-MTP-GGUF
# 3. Benchmark generation speed
curl -s http://127.0.0.1:13305/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.6-35B-A3B-MTP-GGUF",
"messages": [{"role": "user", "content": "Explain the halting problem in one paragraph"}],
"stream": false
}' | jq '.timings.predicted_per_second'
# 4. Compare against baseline
lemonade run Qwen3.6-35B-A3B-GGUF
One caveat: you need at least 24 GB of available RAM for the Q4_K_XL quant.
┌───────────────────────────────────────────────────────┐
│ │
│ 🥇 Qwen3.6-35B-A3B-MTP-GGUF │
│ 26.0 tok/s — the clear winner for generation │
│ +23.7% over baseline, zero config required │
│ │
│ ⚠️ BUT: Prefill dominates real-world UX │
│ Simple Q&A: 🟢 Excellent │
│ Chat with context: 🔴 60s+ first byte │
│ Agent with tools: 🔴 Impractical │
│ │
│ 🥈 Qwen3.6-35B-A3B-GGUF (baseline) │
│ 21.0 tok/s — still respectable │
│ Only use if disk space is extremely tight │
│ │
└───────────────────────────────────────────────────────┘
The internet said MTP doesn't work on consumer hardware without a GPU. The internet was wrong about generation but right about the bigger picture. On a Ryzen AI 9 HX 370 with unified LPDDR5x memory, the MTP variant of Qwen3.6-35B-A3B delivers a clean 23.7% generation speedup — but real-world chat systems bottleneck on prefill, which MTP doesn't address.
If you need a fast inference backend for batch processing or simple API calls, the MTP GGUF is the obvious choice. If you need an interactive chat assistant, invest the savings in a GPU or an API subscription.
Sometimes the most valuable test result isn't the win — it's learning exactly where the win stops mattering.
John is a Software Engineer at NXagents.net who runs benchmarks on mini PCs so you don't have to. His Ryzen AI 9 HX 370 is currently humming away in Canada while he types this from Asia, probably generating more haikus about Canadian weather.