[GitHub Trending] ggml-org/llama.cpp

9.2 relevance

Essential LLM inference tool, always relevant.

2026-05-18 ai/ml GitHub Trending

LLM inference in C/C++. Contribute to ggml-org/llama.cpp development by creating an account on GitHub.

Summary

llama.cpp, the flagship C/C++ inference engine for the ggml library, runs LLMs on Apple Silicon, x86, RISC-V, RISC-V, and NVIDIA GPUs with 1.5-bit to 8-bit quantization and CPU+GPU hybrid inference. Recent additions include Hugging Face cache migration, multimodal support in llama-server, VS Code/Vim FIM plugins, and native GGUF support on Hugging Face Inference Endpoints. It now supports the gpt-oss model in native MXFP4 format, developed in collaboration with NVIDIA.

Key Takeaway

Evaluate llama.cpp for local LLM inference in your agent orchestration stack, leveraging its GGUF support and new multimodal server for cost-effective prototyping and seamless cloud integration.

Why it matters

For a senior engineer building AI agents and cloud infrastructure, llama.cpp provides a lightweight, high-performance local inference engine that integrates with cloud endpoints via GGUF and offers developer tooling (VS Code/Vim) to accelerate agent prototyping and edge deployment.

Full Article

llama.cpp Manifesto / ggml / ops LLM inference in C/C++ Recent API changes Changelog for libllama API Changelog for llama-server REST API Hot topics Hugging Face cache migration: models downloaded with -hf are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools. guide : using the new WebUI of llama.cpp guide : running gpt-oss with llama.cpp [FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗 Support for the gpt-oss model with native MXFP4 format has been added |