[GitHub Trending] ggml-org/llama.cpp
Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.
Essential LLM inference tool, always relevant.
llama.cpp, the flagship C/C++ inference engine for the ggml library, runs LLMs on Apple Silicon, x86, RISC-V, RISC-V, and NVIDIA GPUs with 1.5-bit to 8-bit quantization and CPU+GPU hybrid inference. Recent additions include Hugging Face cache migration, multimodal support in llama-server, VS Code/Vim FIM plugins, and native GGUF support on Hugging Face Inference Endpoints. It now supports the gpt-oss model in native MXFP4 format, developed in collaboration with NVIDIA.
Evaluate llama.cpp for local LLM inference in your agent orchestration stack, leveraging its GGUF support and new multimodal server for cost-effective prototyping and seamless cloud integration.
For a senior engineer building AI agents and cloud infrastructure, llama.cpp provides a lightweight, high-performance local inference engine that integrates with cloud endpoints via GGUF and offers developer tooling (VS Code/Vim) to accelerate agent prototyping and edge deployment.
llama.cpp Manifesto / ggml / ops LLM inference in C/C++ Recent API changes Changelog for libllama API Changelog for llama-server REST API Hot topics Hugging Face cache migration: models downloaded with -hf are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools. guide : using the new WebUI of llama.cpp guide : running gpt-oss with llama.cpp [FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗 Support for the gpt-oss model with native MXFP4 format has been added |