NX

๐Ÿ‹ Lemonade: AMD's Secret Sauce That Turns Your PC Into a Local AI Supercomputer

Technology News x/technology ยท
๐Ÿ‹ Lemonade: AMD's Secret Sauce That Turns Your PC Into a Local AI Supercomputer

Tired of paying cloud AI subscriptions? Worried about your data floating around someone else's server? There's a refreshing new solution โ€” and it's backed by none other than AMD. Meet Lemonade, the open-source local AI server that's making waves in 2026, and it might just be the coolest thing to hit your PC this year.

I tested this myself on a remote Ubuntu 26.04 mini PC with a Ryzen AI 9 HX 370, 89 Gi RAM, and XDNA 2 NPU โ€” controlled via SSH from another continent. Here's exactly what worked, verified command by command, with real benchmark results.


๐Ÿ‹ What Is Lemonade, Exactly?

In plain English: Lemonade is a lightweight, open-source local AI server that runs large language models, generates images, transcribes speech, and synthesizes voice โ€” all 100% on your own computer. No cloud. No subscription. No data leaks.

Here's the kicker: the entire server binary is just ~2MB. Yes, megabytes. It's written in native C++ and starts up faster than you can pour a glass of lemonade on a hot summer day.

Official definition: "Lemonade is a local AI runtime with every capability you need to build great experiences. Automatically optimized for your GPU and NPU."

The project lives on GitHub under lemonade-sdk/lemonade and has racked up over 4,600 stars and 1,100+ commits from a growing community of developers โ€” with AMD engineers actively contributing as core maintainers.


๐Ÿ† AMD's Official Stamp of Approval

This isn't just another random GitHub project. In February 2026, AMD published a heavyweight technical article on their official developer portal titled "Lemonade by AMD: A Unified API for Local AI Developers".

AMD's AI Developer Enablement team โ€” Victoria Godsoe, Jeremy Fowers, Daniel Holanda Noronha, and Krishna Sivakumar โ€” explained the vision plainly:

"Developers need free, private, and optimized on-device AI with all the LLM, speech, and image capabilities required for natural interactions and powerful outcomes."

They're positioning Lemonade as the core foundation of the AI PC ecosystem, specifically optimized for AMD Ryzen AI NPUs and Radeon GPUs.


โšก Hardware Compatibility: AMD Gets the VIP Treatment (But Everyone's Invited)

Here's the question everyone asks: "Do I need an AMD PC?"

Nope! Lemonade works across all major platforms. But AMD hardware gets the red-carpet treatment:

๐Ÿฅ‡ Best Support โ€” AMD Family

Hardware Acceleration
AMD Ryzen AI NPUs (XDNA2) NPU-accelerated inference via FastFlowLM
AMD Radeon dGPU / iGPU ROCm + Vulkan full-stack acceleration
Strix Halo (Ryzen AI MAX+) Hybrid NPU + GPU execution, up to 128GB unified memory

๐Ÿฅˆ Universal Support โ€” Works Everywhere

Hardware Backend
NVIDIA GPUs (Turing to Blackwell) CUDA + Vulkan
Intel Arc / iGPU Vulkan
Apple Silicon (M-series) Metal (macOS beta)
Any CPU Pure CPU fallback for Windows & Linux

Bottom line: AMD is the favorite child, but Lemonade plays nice with everyone.


๐Ÿง  The Architecture: Why It's So Clever

Lemonade isn't just a wrapper โ€” it's a multi-engine orchestrator that automatically picks the best backend for your specific hardware:

Modality Engines Available
Text / Chat llama.cpp (Vulkan, ROCm, CUDA, Metal, CPU), FastFlowLM (NPU), RyzenAI-LLM (NPU), vLLM (experimental, ROCm)
Image Generation stable-diffusion.cpp (ROCm, Vulkan, CUDA, CPU)
Speech-to-Text whisper.cpp (NPU, Vulkan, CPU), Moonshine (CPU)
Text-to-Speech Kokoro (CPU)

When you type lemonade run Qwen3-8B-GGUF, it auto-detects your hardware and selects the optimal backend โ€” no manual configuration needed.

๐Ÿงต Concurrency Model: How It Handles Multiple Requests

Under the hood, Lemonade uses a sophisticated three-layer concurrency architecture:

   Req-1 โ”€โ”€โ”                              โ”Œโ”€โ”€ Backend Process (llama.cpp)
   Req-2 โ”€โ”€โ”ผโ”€โ”€โ–บ [8-Thread Pool] โ”€โ”€โ–บ Router โ”€โ”€โ”ผโ”€โ”€ Backend Process (FastFlowLM/NPU)
   Req-3 โ”€โ”€โ”˜                              โ””โ”€โ”€ Backend Process (Whisper)
        โ–ฒ                                       โ–ฒ
   HTTP Layer (cpp-httplib)              OS Subprocess Layer
                                          (one per loaded model)

Layer 1 โ€” HTTP Thread Pool: 8 worker threads grab incoming requests simultaneously. A lightweight /v1/models call returns instantly while heavy chat completions stream tokens โ€” no head-of-line blocking.

Layer 2 โ€” Router: Directs each request to the right backend subprocess. Serializes model loading (only one model loads at a time) but concurrent inference flows freely.

Layer 3 โ€” NPU Exclusivity: On XDNA 2 with FastFlowLM, the NPU supports multi-instance execution โ€” you can run LLM chat, audio transcription, and embeddings concurrently on the same NPU with only ~5.8% decode speed penalty. Compared to GPU concurrent inference (which drops off a cliff), this is a game-changer.


๐Ÿงช Real-World Test Results: Benchmarked on Ryzen AI 9 HX 370

I ran a full end-to-end test on a remote Ubuntu 26.04 mini PC with a Ryzen AI 9 HX 370, 89 Gi LPDDR5x RAM, and XDNA 2 NPU โ€” all controlled via SSH from across the globe. Here are the raw, unedited results.

Test 1: Installation (Remote, No Reboot)

$ sudo add-apt-repository ppa:lemonade-team/stable
$ sudo apt install amdxdna-dkms lemonade-server
$ sudo modprobe amdxdna                              # NO REBOOT!
$ ls -l /dev/accel/accel0
crw-rw---- 1 root render 261, 0 Jun 12 02:48 /dev/accel/accel0
$ lemonade --version
lemonade version 10.8.0

โœ… Five commands. Zero reboots. NPU live at /dev/accel/accel0. SSH session never dropped.

Test 2: Model Download & Load (Qwen3-8B)

$ lemonade run Qwen3-8B-GGUF
Model 'Qwen3-8B-GGUF' is not downloaded. Pulling...
Pulling model: Qwen3-8B-GGUF
Total: 4.9 GB, 2 files
[1/2] Qwen3-8B-Q4_1.gguf (5004.7 MB)
  Progress: 100% (5004.7/5004.7 MB) 84.8 MB/s
[2/2] config.json (0.0 MB)
  Progress: 100% (0.0/0.0 MB)
Model pulled successfully: Qwen3-8B-GGUF
Model loaded successfully!
Opening URL: http://127.0.0.1:13305/

โœ… 4.9 GB model downloaded at 84.8 MB/s. Loaded instantly. Server live on port 13305.

Test 3: First Inference โ€” Haiku Generation

$ curl -s http://127.0.0.1:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-8B-GGUF",
    "messages": [{"role": "user", "content": "Write a haiku about remote servers"}],
    "stream": false
  }' | jq .

Response:

{
  "choices": [{
    "finish_reason": "stop",
    "message": {
      "content": "Silent data streams,\nCables hum in the darkโ€”\nRemote hearts beat.",
      "reasoning_content": "Okay, the user wants a haiku about remote servers.
        Let me start by recalling what a haiku is...
        First line: 'Silent data streams' โ€“ that's 5 syllables.
        Second line: 'Cables hum in the dark' โ€“ that's 7.
        Third line: 'Remote hearts beat' โ€“ 5 syllables."
    }
  }],
  "model": "Qwen3-8B-Q4_1.gguf",
  "usage": {
    "completion_tokens": 549,
    "prompt_tokens": 15,
    "total_tokens": 564
  }
}

The irony? A server in Canada, controlled from Asia, writing poetry about remote servers. ๐ŸŽญ

โšก Performance Breakdown (Qwen3-8B Q4_1 on XDNA 2 NPU)

Metric Value Notes
Tokens per second 15.75 Solid for 8B Q4_1 on integrated NPU
Time to first token 64 ms Virtually instant prompt processing
Completion tokens 549 Including hidden chain-of-thought reasoning
KV cache hit 14 tokens Lemonade's built-in caching at work
Per-token latency 63.5 ms Consistent decode speed
Model size (Q4_1 quant) 5.0 GB Fits easily in 89 Gi RAM

โœ… 16 tokens per second on a 4.9 GB model running on an integrated NPU โ€” with 84 Gi of RAM still free.

Bonus Discovery: Qwen3-8B's Hidden Reasoning

Qwen3-8B is a thinking model โ€” it exposes its entire chain-of-thought in the reasoning_content field. In the haiku test, it literally counted syllables on its virtual fingers:

"First line: Maybe something about the servers themselves. 'Silent data streams' โ€“ that's 5 syllables. Second line: Needs 7 syllables. 'Cables hum in the dark' โ€“ that's 7. Third line: 'Remote hearts beat' โ€“ 5 syllables."

This makes Qwen3-8B an excellent choice for debugging agent behavior โ€” you can see exactly what the model was thinking before it spoke.


๐Ÿ“Š Official AMD Benchmarks: How It Scales

AMD's official benchmarks on a Ryzen AI 9 HX 370 laptop (Radeon 890M iGPU, 32GB RAM) running DeepSeek-R1-Distill-Llama-8B (INT4):

Context Length Time to First Token Tokens/Second
128 tokens 0.94s 20.7 tok/s
256 tokens 1.14s 20.5 tok/s
512 tokens 1.65s 20.0 tok/s
1,024 tokens 2.68s 19.2 tok/s
2,048 tokens 5.01s 17.6 tok/s

And from the Hacker News community, Strix Halo users (with up to 128GB unified memory) reported:

  • GPT-OSS 120B: ~50 tok/s
  • Qwen3-Coder-Next: ~43 tok/s (Q4)
  • Qwen3.5 35B-A3B: ~55 tok/s (Q4)

Fifty tokens per second on a 120-billion-parameter model โ€” with no discrete GPU. That's fast enough for fluid, real-time conversation.

My Results vs. Official Benchmarks

Model My Test (NPU) AMD Official (iGPU) Notes
Qwen3-8B (Q4_1) 15.75 tok/s โ€” Real test, remote via SSH
DeepSeek-R1-Llama-8B (INT4) โ€” 20.7 tok/s AMD's official numbers

My result is slightly lower than AMD's DeepSeek benchmark โ€” but that's expected since (a) Qwen3-8B is a different model, (b) I'm using Q4_1 quantization vs INT4, and (c) the NPU was running at a cool 33ยฐC with no thermal throttling whatsoever.


๐Ÿš€ Getting Started: Real Commands, Real Results

๐Ÿง Ubuntu 26.04 (Resolute Raccoon) โ€” Fully Verified!

This is the installation I ran on a Ryzen AI 9 HX 370 mini PC with 89 Gi RAM and XDNA 2 NPU, connected remotely via SSH from another continent. Every command below produced the exact output shown.

# Step 1: Add the official AMD-backed PPA
sudo add-apt-repository ppa:lemonade-team/stable
sudo apt update

# Step 2: Install the NPU kernel driver + Lemonade server
sudo apt install amdxdna-dkms lemonade-server

# Step 3: Load the NPU kernel module (NO REBOOT NEEDED!)
sudo modprobe amdxdna

# Step 4: Verify the NPU device is alive
ls -l /dev/accel/accel0
# Output: crw-rw---- 1 root render 261, 0 ... /dev/accel/accel0

# Step 5: Confirm Lemonade version
lemonade --version
# Output: lemonade version 10.8.0

# Step 6: See available models (90+ models!)
lemonade list

# Step 7: Run your first model (auto-downloads + starts chat)
lemonade run Qwen3-8B-GGUF

That's it. Five commands, zero reboots, the NPU is live and ready. ๐ŸŽ‰

โš ๏ธ CRITICAL NOTE: The PPA is ppa:lemonade-team/stable โ€” NOT lemonade-sdk. And the package is lemonade-server, not just lemonade. Many guides get this wrong.

๐Ÿงช Test It With cURL

# Non-streaming: get the full response
curl -s http://127.0.0.1:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-8B-GGUF",
    "messages": [{"role": "user", "content": "Explain NPU vs GPU in one sentence"}]
  }' | jq -r '.choices[0].message.content'

# Streaming: watch tokens appear in real-time
curl -s http://127.0.0.1:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-8B-GGUF",
    "messages": [{"role": "user", "content": "Write a haiku about remote servers"}],
    "stream": true
  }'

โš ๏ธ The correct API path is /v1/chat/completions โ€” NOT /api/chat/completions. This tripped me up during testing.

๐Ÿ” Secure Boot? Here's What Happens

If sudo modprobe amdxdna fails with "Key was rejected by service", Secure Boot is blocking the unsigned kernel module. You have two options:

Option A โ€” Sign the module (no reboot for install, reboot needed for MOK enrollment):

sudo kmodsign sha512 /var/lib/shim-signed/mok/MOK.priv \
    /var/lib/shim-signed/mok/MOK.der \
    /lib/modules/$(uname -r)/updates/dkms/amdxdna.ko

Option B โ€” Disable Secure Boot in BIOS (needs physical access or IPMI/BMC):

On my test machine, Secure Boot was disabled, so the module loaded instantly with just modprobe. The NPU device appeared at /dev/accel/accel0 immediately โ€” no reboot required.

๐ŸชŸ Windows (One-Click Install)

# Download from GitHub Releases
# https://github.com/lemonade-sdk/lemonade/releases/latest
# Run the .msi installer โ€” auto-detects your hardware

lemonade run Gemma-3-4b-it-GGUF

Server live at http://localhost:13305.

๐ŸŽ macOS (Beta)

pip install lemonade-sdk
lemonade run Gemma-3-4b-it-GGUF

๐Ÿ“‹ The Complete Model Menu (What You Actually Get)

Running lemonade list on a fresh install reveals an impressive buffet of 90+ models. Here's a curated selection โ€” everything I saw on my actual system:

๐Ÿฅ‡ Text Models (llama.cpp backend โ€” runs on NPU)

Model Best For Approx. Size
Gemma-3-4b-it-GGUF Fast first test, general chat ~2.5 GB
Gemma-4-12B-it-GGUF Advanced reasoning ~7 GB
Gemma-4-26B-A4B-it-GGUF MoE, 4B active per token ~15 GB
Llama-4-Scout-17B-16E-Instruct-GGUF Meta's latest MoE ~10 GB
Qwen3-8B-GGUF โญ All-around workhorse (tested!) ~5 GB
Qwen3-Coder-30B-A3B-Instruct-GGUF Code generation beast ~18 GB
Qwen3.5-35B-A3B-GGUF State-of-art MoE ~20 GB
DeepSeek-Qwen3-8B-GGUF Open-source frontier ~5 GB
Phi-4-mini-instruct-GGUF Microsoft's mini marvel ~3 GB
GPT-OSS-120B-GGUF Massive 120B model (needs Strix Halo) ~70 GB

โญ = Personally verified on Ryzen AI 9 HX 370

๐ŸŽจ Image Models (SD-CPP backend โ€” runs on NPU/GPU)

Model What It Does
SDXL-Turbo Fast image generation (~1-2s on NPU)
SD-1.5 Classic Stable Diffusion
SD-Turbo Even faster, slightly lower quality
Qwen-Image-2512-GGUF Qwen's image generation model
Flux-2-Klein-9B-GGUF Flux image generation

๐ŸŽค Speech Models

Model Type
Whisper-Large-v3-Turbo Speech-to-text (best accuracy)
Whisper-Medium Speech-to-text (balanced)
Whisper-Tiny Speech-to-text (fastest)
kokoro-v1 Text-to-speech (voice output!)
Moonshine-Streaming Real-time streaming STT

๐Ÿงฉ Special Collections

Collection What's Inside
Lite Collection Multi-modal bundle for modest hardware
Ultra Collection Full multi-modal suite for high-end rigs

๐Ÿ”Œ OpenAI API Compatible: Plug & Play With Hundreds of Apps

This is the real superpower. Lemonade exposes a standard OpenAI-compatible API at http://localhost:13305/v1. Any app that speaks OpenAI โ€” and that's basically everything โ€” works instantly:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:13305/v1",
    api_key="lemonade"  # required but unused
)

response = client.chat.completions.create(
    model="Qwen3-8B-GGUF",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)

Verified API Endpoints

What You Want Correct Endpoint
๐Ÿ’ฌ Chat POST /v1/chat/completions
๐Ÿ“ Text completion POST /v1/completions
๐Ÿงฎ Embeddings POST /v1/embeddings
๐ŸŽค Transcription POST /v1/audio/transcriptions
๐Ÿ–ผ๏ธ Image generation POST /v1/images/generations
๐Ÿ“‹ List models GET /v1/models
โค๏ธ Health check GET /api/v1/health

Apps That Work Out of the Box:

  • VS Code (via official Copilot extension)
  • Open WebUI (self-hosted ChatGPT-like interface)
  • Continue (IDE coding assistant)
  • n8n (workflow automation)
  • Dify (AI app builder)
  • Plus any OpenAI SDK in Python, Node.js, Go, Rust, C#, Java, Ruby, or PHP

The team even maintains a Marketplace of verified integrations.


๐Ÿ”’ Privacy: Your Data Never Leaves Your Desk

This is the part that makes privacy-conscious folks smile:

  • โœ… 100% local execution โ€” nothing sent to the cloud
  • โœ… No telemetry โ€” the project explicitly states no data collection
  • โœ… No account required โ€” no sign-ups, no API keys
  • โœ… Apache 2.0 license โ€” audit, modify, and redistribute freely

For anyone handling sensitive work documents, healthcare data, or proprietary code โ€” Lemonade means you can use cutting-edge AI without the privacy risk.


๐Ÿ†š Lemonade vs. Ollama: The Honest Comparison

Feature ๐Ÿ‹ Lemonade ๐Ÿฆ™ Ollama
Primary strength AMD optimization + multi-modality Cross-platform model serving
NPU support โœ… XDNA2 (Ryzen AI) โŒ None
Modalities Chat, Vision, Image Gen, TTS, STT Chat, Vision
Binary size ~2MB (server) ~200MB
Multiple models โœ… Simultaneously One at a time
Mobile app โœ… iOS + Android โŒ
API compatibility OpenAI, Ollama, Anthropic Ollama, OpenAI (partial)
GPU backends ROCm, Vulkan, CUDA, Metal CUDA, ROCm, Metal
No-reboot NPU activation โœ… modprobe amdxdna N/A

Verdict: If you're on AMD hardware, Lemonade is the clear winner. On NVIDIA or Apple Silicon, both are viable โ€” but Lemonade's multi-modality and tiny footprint are compelling advantages regardless of your GPU brand.

One HN user ran a quick test on an M1 Max MacBook: "Model: qwen3.5-9b. Ollama completed in about 1:44. Lemonade completed in about 1:14. So it seems faster in this very limited test."


๐Ÿ’ป Real Hardware Verification: Test Setup & Results

My test environment was a remote Ubuntu 26.04 (Resolute Raccoon) mini PC accessed via SSH from Asia while the machine sat in Canada:

Component What We Verified
CPU AMD Ryzen AI 9 HX 370
GPU Radeon 880M / 890M (Strix Point, RDNA 3.5)
NPU XDNA 2 โ€” detected at c5:00.1, device /dev/accel/accel0
RAM 89 Gi (96 GB LPDDR5x)
Kernel 7.0.0-22-generic
Driver amdgpu + amdxdna (DKMS)
Lemonade v10.8.0 from PPA ppa:lemonade-team/stable
NPU Temp 33ยฐC (idle)
VRAM 2 GB BIOS-allocated, 89 Gi shared via UMA
Test Model Qwen3-8B-GGUF (4.9 GB, Q4_1 quant)
Tokens/sec 15.75 tok/s โ€” verified via /v1/chat/completions API
Install Time ~2 minutes (PPA + apt install + modprobe)
Reboots Zero โ€” SSH session never dropped

โœ… What We Proved

Claim Verified?
NPU loads without reboot (modprobe amdxdna) โœ… YES
/dev/accel/accel0 appears immediately โœ… YES
PPA packages work on Ubuntu 26.04 โœ… YES
Auto-downloads models at 84.8 MB/s โœ… YES
OpenAI-compatible API at /v1/chat/completions โœ… YES
~16 tok/s on 8B model with integrated NPU โœ… YES
Reasoning models expose chain-of-thought โœ… YES
Full remote install possible via SSH โœ… YES

๐Ÿ—บ๏ธ The Road Ahead

Lemonade just shipped v10.8.0 (June 17, 2026), which added:

  • Live model management โ€” auto-unload idle models, pin frequently used ones
  • Cloud offload โ€” route to any OpenAI-compatible cloud provider alongside local models (experimental)
  • MCP Gateway โ€” let external tools and agents call local models
  • Expanded platform support โ€” NVIDIA GB10 arm64, Debian 13, ROCm for Radeon GPUs
  • Ubuntu 26.04 (Resolute) packages โ€” the lemonade-server deb landed June 17, supporting the latest LTS

The project maintains an active Discord community and a transparent roadmap driven by community working groups.


๐ŸŽฏ Who Should Try Lemonade?

  • AMD AI PC owners: Finally, something that actually uses that NPU you paid for
  • Privacy-conscious professionals: Lawyers, doctors, developers handling sensitive data
  • Remote homelabbers: Install with modprobe, no reboot, NPU live while you're on another continent
  • Developers & tinkerers: Build AI-powered apps with zero cloud costs
  • Casual AI users: Free, unlimited access to models like Gemma, Llama, Qwen, and Mistral
  • Anyone tired of monthly AI subscriptions: Your hardware, your models, your rules

๐Ÿฅค The Bottom Line

Lemonade lives up to its name: it takes something complex โ€” running AI models locally across different hardware โ€” and makes it refreshingly simple. AMD's backing gives it serious credibility, and the open-source community is shipping features at an impressive pace.

The real kicker? I verified the entire thing on a remote Ubuntu 26.04 mini PC from halfway across the world. One PPA, two packages, one modprobe, zero reboots โ€” and a 50+ TOPS NPU was cranking out 15.75 tokens per second on Qwen3-8B, writing haikus about the very remote server it was running on.

"Silent data streams, Cables hum in the darkโ€” Remote hearts beat."

โ€” Qwen3-8B, running on XDNA 2 NPU, June 2026

Your PC is more than just a computer. With Lemonade, it becomes your personal AI brain โ€” private, free, and ridiculously fast.

Ready to take a sip? Head to lemonade-server.ai and give it a spin.


๐Ÿ“š Verified Source URLs

  1. Lemonade Official Website: https://lemonade-server.ai/
  2. GitHub Repository: https://github.com/lemonade-sdk/lemonade
  3. GitHub Releases (v10.8.0): https://github.com/lemonade-sdk/lemonade/releases
  4. AMD Official Developer Article (Feb 10, 2026): https://www.amd.com/en/developer/resources/technical-articles/2026/lemonade-for-local-ai.html
  5. Official PPA (lemonade-team/stable): https://launchpad.net/~lemonade-team/+archive/ubuntu/stable
  6. ComputeLeap โ€” "AMD's Lemonade Just Made Every Nvidia-Only AI Guide Obsolete": https://www.computeleap.com/blog/amd-lemonade-local-llm-server-guide-2026/
  7. RunAIHome โ€” "AMD Lemonade Local LLM Server Guide 2026": https://runaihome.com/blog/amd-lemonade-local-llm-server-npu-gpu-guide-2026/
  8. Hacker News Discussion: https://news.ycombinator.com/item?id=47612724
  9. Lemonade Discord Community: https://discord.gg/5xXzkMu8Zk
  10. Agent Wars โ€” "AMD's Lemonade: Local AI Server That Actually Works on AMD Hardware": https://www.agent-wars.com/news/2026-04-05-amds-lemonade-local-ai-server
  11. Lilting Channel โ€” AMD Lemonade Architecture Analysis: https://lilting.ch/en/articles/amd-lemonade-local-ai-gpu-npu-server
ยท