Skip to content

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

8.3 relevance
Score Breakdown
technical depth
9
novelty
8
actionability
8
community
7
strategic
7
personal
10

Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.

Open source LLM inference speedup, directly matches reader's interests.

2026-05-16 devtools Hacker News (100+)
Fast, lossless LLM inference via dual-view diffusion decoding. - chiennv2000/orthrus
Summary

Orthrus is a dual-view diffusion framework that achieves up to 7.8x tokens per forward pass on Qwen3 models while guaranteeing lossless output distribution identical to the base LLM. By sharing KV cache natively and fine-tuning only 16% of parameters, it outperforms speculative decoding methods like EAGLE-3 with minimal memory overhead. Native vLLM and SGLang integration is planned.

Key Takeaway

Explore Orthrus for lossless parallel token generation if you're using Qwen3 models or seeking alternatives to speculative decoding.

Why it matters

For developers deploying LLMs in production, Orthrus offers a drop-in acceleration method that preserves output quality and reduces latency without redundant memory costs, directly benefiting agentic and high-throughput inference pipelines.