Google LiteRT-LM Speeds Up Local Inference Up to 2.2x With Gemma 4 Multi-Token Prediction
Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.
Google LiteRT-LM achieves 2.2x faster local inference with multi-token prediction—novel and directly relevant to ML inference optimization.
Google's LiteRT-LM, built on LiteRT (formerly TensorFlow Lite), delivers up to 2.2x faster on-device inference for Gemma 4 by natively supporting multi-token prediction drafters with memory-local speculative decoding. Benchmarks show 1.8x-3.7x faster prefill and decode than llama.cpp, MLX, Cactus, and ONNX, while the Gemma 4 E2B model uses only 607MB on Apple mobile CPUs. The runtime adds Swift and JavaScript APIs, session management for KV cache persistence, and agentic features like constrained decoding and function calling.