Google LiteRT-LM Speeds Up Local Inference Up to 2.2x With Gemma 4 Multi-Token Prediction

7.7 relevance

Google LiteRT-LM achieves 2.2x faster local inference with multi-token prediction—novel and directly relevant to ML inference optimization.

Languages infoq.com

Google LiteRT-LM Speeds Up Local Inference Up to 2.2x With Gemma 4 Multi-Token Prediction

Summary

Google's LiteRT-LM, built on LiteRT (formerly TensorFlow Lite), delivers up to 2.2x faster on-device inference for Gemma 4 by natively supporting multi-token prediction drafters with memory-local speculative decoding. Benchmarks show 1.8x-3.7x faster prefill and decode than llama.cpp, MLX, Cactus, and ONNX, while the Gemma 4 E2B model uses only 607MB on Apple mobile CPUs. The runtime adds Swift and JavaScript APIs, session management for KV cache persistence, and agentic features like constrained decoding and function calling.

Author

Sergio De Simone