I Tested Multi-Token Prediction on a Mini PC — and It Actually Won (Then Reality Checked Me)

Benchmarking Qwen3.6-35B-A3B vs Qwen3.6-35B-A3B-MTP on a Ryzen AI 9 HX 370 with Lemonade 10.8.0

Everybody said Multi-Token Prediction (MTP) doesn't work on CPU-only setups. The Metal benchmarks showed net losses at every configuration. The HackMD deep dive on an RTX 3090 found no llama.cpp speculative method beat baseline. Alan West wrote a whole manifesto explaining why MTP fails on quantized models.

So naturally, I had to test it myself.

MTP vs Baseline: Split-screen comparison on mini PC with terminal benchmarks

What Even Is MTP?

Multi-Token Prediction is a form of self-speculative decoding baked directly into the model. Instead of generating one token per forward pass — the way LLMs have worked since GPT-1 — the model predicts multiple future tokens at once, then a verifier accepts or rejects them in a single batch.

In plain English:

Without MTP: Token₁ → Token₂ → Token₃ → Token₄ (4 forward passes, one token each)
With MTP: Draft[Token₂, Token₃, Token₄] → Verify all at once (1-2 passes)

The theory is elegant. If the model guesses right more often than not, you effectively get 2–3 tokens per pass for roughly the same cost as one. Unsloth reported 240 tok/s on high-end GPUs — a 1.4x–2.2x speedup.

MTP Architecture: Traditional token-by-token generation vs batched draft-and-verify

Qwen3.6-35B-A3B is Alibaba's April 2026 Mixture of Experts model. It has 35B total parameters but only activates ~3B per token — which makes it surprisingly viable on consumer hardware. The -MTP GGUF bundles a multi-token prediction head alongside the standard weights in a single file, ready for llama.cpp's speculative decoding engine.

The catch? The internet was pretty convinced it wouldn't work on anything short of a high-end GPU.

What the Research Said (Spoiler: It Was Wrong)

Before I ran a single benchmark, the literature painted a grim picture.

llama.cpp Issue #23752 — Metal (Apple Silicon): MTP speculative decoding produced worse throughput than baseline at every single configuration. Even with 100% draft acceptance (n_max=0), throughput dropped 11%. At n_max=6, it was 28% slower. The issue author was blunt: "MTP on Metal, as a new feature, may never have provided a speedup."

HackMD — RTX 3090 + Q4 Quant: A rigorous test of every llama.cpp speculative mode on Qwen3.6-35B-A3B with Q4 quantization. DFlash at draft-max=8? 44.6% slower. Oleg draft-spec? 52.8% slower. The conclusion was even more blunt: "No llama.cpp speculative-decoding method tested gives a positive yield on consumer Ampere with Q4 quantized target."

Alan West — DEV Community: Identified three failure modes: low acceptance rates on quantized models (the MTP head degrades more than trunk weights), KV cache thrashing, and CUDA graph capture failures. His prescription: "Walk the three steps above and you'll usually find the culprit within an hour."

The pattern seemed clear. MTP was a GPU-only toy. CPU inference — even with fast LPDDR5x — couldn't touch it.

Then I typed lemonade run Qwen3.6-35B-A3B-MTP-GGUF and everything changed.

The Hardware That Shouldn't Have Won

Here's the rig:

Component	Spec
Processor	AMD Ryzen AI 9 HX 370 (Strix Point)
RAM	96 GB LPDDR5x (~89 Gi in Ubuntu)
iGPU	Radeon 800M (RDNA 3.5)
NPU	XDNA 2 (not used — llama.cpp runs CPU-only)
OS	Ubuntu 26.04 (Resolute Raccoon), kernel 7.0.0-22-generic
Stack	Lemonade 10.8.0 → llama.cpp backend
Model	Qwen3.6-35B-A3B, Q4_K_XL quantization, ~21 GB

Important detail: the NPU was idle during these benchmarks. llama.cpp inference runs entirely on CPU cores hammering LPDDR5x system RAM at roughly 120 GB/s. The XDNA 2 silicon sat there, sipping power, contributing exactly nothing. This is pure CPU muscle.

I loaded both models — the standard GGUF and the MTP variant — and fired the same prompt at each.

Terminal showing curl benchmark with highlighted 25.99 tok/s result

The Numbers Don't Lie

Here it is. One prompt. Two models. No tricks.

Metric	Baseline GGUF	MTP GGUF	Delta
⚡ Tokens per second	21.02	26.00	🟢 +23.7%
⏱️ Total generation time	32.68s	31.27s	−1.41s
📝 Output tokens	706	832	+126 more content
🏎️ Effective throughput	21.0 tok/s	26.0 tok/s	+5 tok/s faster
📦 Quantization	Q4_K_XL	Q4_K_XL	Identical

Let that sink in. The MTP model generated 126 more tokens (18% more content) in 1.41 fewer seconds. It did more work in less time — the gold standard for a real optimization.

Benchmark bar chart: Baseline 21.0 tok/s vs MTP 26.0 tok/s with +23.7% callout

No overclocking. No draft model. No GPU. Just a GGUF file with an MTP head grafted on, running on a $400 mini PC with integrated graphics, delivering a near-24% speedup.

Why the Ryzen AI 9 HX 370 Defies the Research

So why did it work here when Metal and Ampere both failed?

1. Unified Memory Architecture (UMA) Wins

The HX 370's LPDDR5x is a single pool shared by CPU, GPU, and NPU. There's no PCIe bus separating compute from memory. When the MTP verifier needs to touch model weights for batch verification, it's hitting the same ~120 GB/s pipe the draft tokens came through — no copies, no transfers, no bus contention.

2. Zen 5 Absorbs Draft Overhead

The 12 Zen 5 cores in the HX 370 have substantially higher single-thread throughput than Apple's efficiency-focused design. Draft token computation is embarrassingly parallel — the MTP head is small relative to the trunk — and Zen 5 chews through it without breaking a sweat.

3. Qwen3.6's MTP Head Is Exceptionally Well-Trained

Unlike some earlier MTP implementations where the prediction head was an afterthought, Qwen3.6's MTP head was trained with the same rigor as the main model. Under Q4_K_XL quantization, the head weights retain enough precision to maintain high acceptance rates.

4. Lemonade 10.8.0 Ships MTP as a First-Class Feature

This is huge. MTP isn't a hidden flag you need to hunt through llama.cpp docs to enable. lemonade list shows both models right alongside each other. No --spec-type incantations. No tuning. It just works.

⚠️ The Honest Truth: Where MTP Falls Short

This is the part of the blog I didn't want to write — but testing isn't testing if you only show the wins.

After confirming the 26 tok/s victory on simple prompts, I integrated the MTP model into a real chat system with full context: system prompts, tool definitions, chat histories, and agent memories. This is where things got real — and where the benchmark numbers stopped telling the whole story.

The Prefill Wall

Scenario	Prompt Size	Prefill Latency	Generation Speed	UX Verdict
🧪 Simple curl prompt	~50 tokens	Instant (<1s)	26.0 tok/s	🟢 Excellent
💬 Real chat turn	4K–8K tokens	~30 seconds	26.0 tok/s	🟡 Annoying
🤖 Full agent context + tools	8K–32K tokens	60–120 seconds	26.0 tok/s	🔴 Unusable

Here's what happened:

Simple prompt: first byte in <1s, full response in ~2s.
Real request with 16K context (system prompt + tools + multi-turn history + agent memories): first byte in ~75 seconds, then blazing at 26 tok/s.

The problem isn't generation — it's prefill. Your 26 tok/s generation speed is memory-bandwidth-bound (LPDDR5x at 120 GB/s handles this beautifully). But prefill — processing all input tokens in a single parallel forward pass — is compute-bound. The 12 Zen 5 cores simply can't parallelize a 16K+ token prefill fast enough.

MTP helps generation by 24%, but generation was never the bottleneck for real chat. Prefill was always the silent killer, and MTP doesn't touch it.

Attempted Fixes

I tried several mitigations:

Fix	Result
`--cache-prompt` + `--flash-attn`	❌ `--flash-attn` not in Lemonade's build — model failed to load (500 error)
`--batch-size 2048`	🟡 Moderate prefill improvement (~15%)
`--cache-prompt` alone	🟡 Helps subsequent turns (KV cache reuse), but first request still ~60s cold cache
Reducing `ctx_size` to 16384	🟡 ~30% faster prefill, but loses long-context capability

Where This Setup Excels (and Where It Doesn't)

Use Case	Performance	Recommended?
Simple Q&A / single-turn	🟢 Instant, 26 tok/s	✅ Absolutely
Batch inference / offline processing	🟢 Fast throughput	✅ Perfect for queues
API backend for non-interactive tasks	🟢 Good enough	✅ 60s is fine for background jobs
Interactive chat assistant	🔴 60s+ prefill per session	❌ Not suitable
Multi-turn agent (tools + memories)	🔴 90s+ prefill per request	❌ Impractical

The Honest Verdict

Multi-Token Prediction works on CPU (+24% tok/s), but the prefill wall makes it unsuitable for context-heavy interactive chat. MTP solved generation but couldn't touch prefill — and in real LLM usage, prefill is the real killer.

My Ryzen AI 9 HX 370 is a fantastic inference server for batch jobs, document processing, or dedicated API endpoints where 60-second latency is acceptable. But for interactive chat with full agent context, it hits a hard wall that no amount of MTP tuning can breach.

To fix this properly, you need either:

A GPU (eGPU over USB4, or a dedicated build with RTX 4090) for CUDA-accelerated prefill
Apple Silicon (M2 Ultra / M3 Max) with Apple's inference stack
External API (GPT-4, Claude) with sub-second prefill on hyperscaler GPUs

Quick Start: Try It on Your Own Hardware

# 1. Install Lemonade
sudo add-apt-repository ppa:lemonade-sdk/lemonade
sudo apt update && sudo apt install lemonade

# 2. Pull and run the MTP model
lemonade run Qwen3.6-35B-A3B-MTP-GGUF

# 3. Benchmark generation speed
curl -s http://127.0.0.1:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.6-35B-A3B-MTP-GGUF",
    "messages": [{"role": "user", "content": "Explain the halting problem in one paragraph"}],
    "stream": false
  }' | jq '.timings.predicted_per_second'

# 4. Compare against baseline
lemonade run Qwen3.6-35B-A3B-GGUF

One caveat: you need at least 24 GB of available RAM for the Q4_K_XL quant.

The Verdict

┌───────────────────────────────────────────────────────┐
│                                                       │
│   🥇 Qwen3.6-35B-A3B-MTP-GGUF                         │
│      26.0 tok/s — the clear winner for generation      │
│      +23.7% over baseline, zero config required        │
│                                                       │
│   ⚠️ BUT: Prefill dominates real-world UX              │
│      Simple Q&A: 🟢 Excellent                          │
│      Chat with context: 🔴 60s+ first byte             │
│      Agent with tools: 🔴 Impractical                  │
│                                                       │
│   🥈 Qwen3.6-35B-A3B-GGUF (baseline)                  │
│      21.0 tok/s — still respectable                    │
│      Only use if disk space is extremely tight         │
│                                                       │
└───────────────────────────────────────────────────────┘

The internet said MTP doesn't work on consumer hardware without a GPU. The internet was wrong about generation but right about the bigger picture. On a Ryzen AI 9 HX 370 with unified LPDDR5x memory, the MTP variant of Qwen3.6-35B-A3B delivers a clean 23.7% generation speedup — but real-world chat systems bottleneck on prefill, which MTP doesn't address.

If you need a fast inference backend for batch processing or simple API calls, the MTP GGUF is the obvious choice. If you need an interactive chat assistant, invest the savings in a GPU or an API subscription.

Sometimes the most valuable test result isn't the win — it's learning exactly where the win stops mattering.

Split-screen: MTP speedup on one side, prefill bottleneck on the other — the honest verdict

John is a Software Engineer at NXagents.net who runs benchmarks on mini PCs so you don't have to. His Ryzen AI 9 HX 370 is currently humming away in Canada while he types this from Asia, probably generating more haikus about Canadian weather.