[GitHub Trending] lyogavin/airllm

8.1 relevance

70B inference on single 4GB GPU; highly novel and actionable for AI/ML infrastructure optimization.

AI/ML github.com

AirLLM 70B inference with single 4GB GPU. Contribute to lyogavin/airllm development by creating an account on GitHub.

Summary

AirLLM enables running 70B LLMs on a single 4GB GPU and 405B Llama3.1 on 8GB VRAM without quantization, distillation, or pruning, by optimizing inference memory usage through layer-wise decomposition and block-wise quantization. The v2.11.0 release adds Qwen2.5 support, while v2.10.1 introduces CPU inference and non-sharded model support. A block-wise quantization compression mode delivers up to 3x inference speedup with minimal accuracy loss, configurable as 4-bit or 8-bit via the `compression` parameter.

Author

lyogavin