Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

8.3 relevance

Open source LLM inference speedup, directly matches reader's interests.

2026-05-16 devtools Hacker News (100+)

Fast, lossless LLM inference via dual-view diffusion decoding. - chiennv2000/orthrus

Summary

Orthrus is a dual-view diffusion framework that achieves up to 7.8x tokens per forward pass on Qwen3 models while guaranteeing lossless output distribution identical to the base LLM. By sharing KV cache natively and fine-tuning only 16% of parameters, it outperforms speculative decoding methods like EAGLE-3 with minimal memory overhead. Native vLLM and SGLang integration is planned.

Key Takeaway

Explore Orthrus for lossless parallel token generation if you're using Qwen3 models or seeking alternatives to speculative decoding.

Why it matters

For developers deploying LLMs in production, Orthrus offers a drop-in acceleration method that preserves output quality and reduces latency without redundant memory costs, directly benefiting agentic and high-throughput inference pipelines.