UPSTREAM PR #17425: server : add Anthropic Messages API support by loci-dev · Pull Request #279 · auroralabs-loci/llama.cpp

loci-dev · 2025-11-21T12:46:11Z

Summary

This PR adds full Anthropic Messages API compatibility to llama-server, enabling drop-in replacement for applications using the Anthropic API. The implementation converts Anthropic's format to OpenAI-compatible internal format, reusing existing inference pipeline without modifying core llama.cpp functionality.

Motivation

Enables llama.cpp to serve as a local/self-hosted alternative to Anthropic's Claude API
Allows Claude Code and other Anthropic-compatible clients to work with llama-server
Provides feature parity with both OpenAI and Anthropic API formats

Features Implemented

Endpoints:

POST /v1/messages - Chat completions with streaming support
POST /v1/messages/count_tokens - Token counting for prompts

Functionality:

Streaming with proper Anthropic SSE event types (message_start, content_block_delta, etc.)
Tool use (function calling) with tool_use/tool_result content blocks
Vision support with image content blocks (base64 and URL)
System prompts and multi-turn conversations
Extended thinking parameter support
Permissive validation with sensible defaults (e.g., max_tokens defaults to 4096)
Spec-compliant responses (no timing fields that break Anthropic clients)

Architecture:

Adapter pattern: Anthropic format → OpenAI-compatible internal format
Zero changes to core inference or ggml
Minimal intrusive changes (4 deletions across all files)
Clean separation via OAICOMPAT_TYPE_ANTHROPIC enum

Testing

Test Coverage:

24 comprehensive tests in test_anthropic_api.py
Tests cover: basic messages, streaming, tools, vision, token counting, parameters, error handling, content block indices
Test results: 23 passed, 1 skipped (vision test requires multimodal model)

loci-review · 2025-11-25T17:19:13Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #279 - Anthropic Messages API Support

Analysis Scope: Comparison of version a752fcbb (target) vs aab9b31c (baseline)
Project: llama.cpp
Changes: +1,712 additions, -6 deletions across 8 files

Performance Impact Assessment

The performance analysis reveals zero measurable impact on core inference functionality. All binaries show 0.0% power consumption change, with no detectable modifications to performance-critical functions. The code changes implement an API adapter layer in the server component without touching inference paths.

Power Consumption Analysis:
All 16 binaries maintain identical power consumption profiles. The largest binaries show no variation: libllama.so (228,844 nJ), llama-tts (285,154 nJ), llama-cvector-generator (278,999 nJ), and llama-run (245,370 nJ) remain unchanged. Floating-point deltas under 0.5 nJ represent measurement precision limits rather than actual changes.

Function-Level Analysis:
No functions exhibit measurable response time or throughput changes. Core inference functions (llama_decode, llama_encode, llama_tokenize) maintain identical execution characteristics. The tokenization pipeline, memory management (llama_memory_clear, llama_kv_cache operations), and batch processing (llama_batch_init, llama_batch_get_one) show no modifications.

Tokens Per Second Impact:
Zero impact on inference throughput. The changes add request format conversion logic (anthropic_params_from_json) and response formatting (to_json_anthropic) in the server layer, executing before and after inference. These operations add approximately 0.1-0.5 ms per request, which is negligible compared to typical inference times. No functions in the tokenization or inference pipeline were modified.

Code Implementation:
The PR implements Anthropic Messages API compatibility through an adapter pattern that converts between API formats. New endpoints (/v1/messages, /v1/messages/count_tokens) route through existing inference infrastructure. The implementation adds OAICOMPAT_TYPE_ANTHROPIC enum handling and Anthropic-specific SSE formatting without modifying core llama.cpp or ggml libraries.

Conclusion:
The changes are isolated to the API server layer with no performance impact on inference operations. Tokens per second remains unaffected as no tokenization or decoding functions were modified.

loci-dev temporarily deployed to PROD__AL_DEMO November 21, 2025 12:46 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 25 times, most recently from 0f2d111 to 8e531a0 Compare November 25, 2025 14:09

server : add Anthropic Messages API support

aa6192d

loci-dev force-pushed the upstream-PR17425-branch_QuickEdge-feature/anthropic-api-support branch from b7c322b to aa6192d Compare November 25, 2025 16:41

loci-dev temporarily deployed to PROD__AL_DEMO November 25, 2025 16:41 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 30 times, most recently from d516828 to 0a006e7 Compare December 2, 2025 04:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17425: server : add Anthropic Messages API support#279

UPSTREAM PR #17425: server : add Anthropic Messages API support#279
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17425-branch_QuickEdge-feature/anthropic-api-support

loci-dev commented Nov 21, 2025

Uh oh!

loci-review bot commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Nov 21, 2025

Summary

Motivation

Features Implemented

Testing

Uh oh!

loci-review bot commented Nov 25, 2025

Performance Analysis Summary: PR #279 - Anthropic Messages API Support

Performance Impact Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants