Skip to content

UPSTREAM PR #17425: server : add Anthropic Messages API support#279

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17425-branch_QuickEdge-feature/anthropic-api-support
Open

UPSTREAM PR #17425: server : add Anthropic Messages API support#279
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17425-branch_QuickEdge-feature/anthropic-api-support

Conversation

@loci-dev
Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#17425

claude-code

Summary

This PR adds full Anthropic Messages API compatibility to llama-server, enabling drop-in replacement for applications using the Anthropic API. The implementation converts Anthropic's format to OpenAI-compatible internal format, reusing existing inference pipeline without modifying core llama.cpp functionality.

Motivation

  • Enables llama.cpp to serve as a local/self-hosted alternative to Anthropic's Claude API
  • Allows Claude Code and other Anthropic-compatible clients to work with llama-server
  • Provides feature parity with both OpenAI and Anthropic API formats

Features Implemented

Endpoints:

  • POST /v1/messages - Chat completions with streaming support
  • POST /v1/messages/count_tokens - Token counting for prompts

Functionality:

  • Streaming with proper Anthropic SSE event types (message_start, content_block_delta, etc.)
  • Tool use (function calling) with tool_use/tool_result content blocks
  • Vision support with image content blocks (base64 and URL)
  • System prompts and multi-turn conversations
  • Extended thinking parameter support
  • Permissive validation with sensible defaults (e.g., max_tokens defaults to 4096)
  • Spec-compliant responses (no timing fields that break Anthropic clients)

Architecture:

  • Adapter pattern: Anthropic format → OpenAI-compatible internal format
  • Zero changes to core inference or ggml
  • Minimal intrusive changes (4 deletions across all files)
  • Clean separation via OAICOMPAT_TYPE_ANTHROPIC enum

Testing

Test Coverage:

  • 24 comprehensive tests in test_anthropic_api.py
  • Tests cover: basic messages, streaming, tools, vision, token counting, parameters, error handling, content block indices
  • Test results: 23 passed, 1 skipped (vision test requires multimodal model)

@loci-dev loci-dev force-pushed the main branch 25 times, most recently from 0f2d111 to 8e531a0 Compare November 25, 2025 14:09
@loci-dev loci-dev force-pushed the upstream-PR17425-branch_QuickEdge-feature/anthropic-api-support branch from b7c322b to aa6192d Compare November 25, 2025 16:41
@loci-review
Copy link
Copy Markdown

loci-review bot commented Nov 25, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #279 - Anthropic Messages API Support

Analysis Scope: Comparison of version a752fcbb (target) vs aab9b31c (baseline)
Project: llama.cpp
Changes: +1,712 additions, -6 deletions across 8 files


Performance Impact Assessment

The performance analysis reveals zero measurable impact on core inference functionality. All binaries show 0.0% power consumption change, with no detectable modifications to performance-critical functions. The code changes implement an API adapter layer in the server component without touching inference paths.

Power Consumption Analysis:
All 16 binaries maintain identical power consumption profiles. The largest binaries show no variation: libllama.so (228,844 nJ), llama-tts (285,154 nJ), llama-cvector-generator (278,999 nJ), and llama-run (245,370 nJ) remain unchanged. Floating-point deltas under 0.5 nJ represent measurement precision limits rather than actual changes.

Function-Level Analysis:
No functions exhibit measurable response time or throughput changes. Core inference functions (llama_decode, llama_encode, llama_tokenize) maintain identical execution characteristics. The tokenization pipeline, memory management (llama_memory_clear, llama_kv_cache operations), and batch processing (llama_batch_init, llama_batch_get_one) show no modifications.

Tokens Per Second Impact:
Zero impact on inference throughput. The changes add request format conversion logic (anthropic_params_from_json) and response formatting (to_json_anthropic) in the server layer, executing before and after inference. These operations add approximately 0.1-0.5 ms per request, which is negligible compared to typical inference times. No functions in the tokenization or inference pipeline were modified.

Code Implementation:
The PR implements Anthropic Messages API compatibility through an adapter pattern that converts between API formats. New endpoints (/v1/messages, /v1/messages/count_tokens) route through existing inference infrastructure. The implementation adds OAICOMPAT_TYPE_ANTHROPIC enum handling and Anthropic-specific SSE formatting without modifying core llama.cpp or ggml libraries.

Conclusion:
The changes are isolated to the API server layer with no performance impact on inference operations. Tokens per second remains unaffected as no tokenization or decoding functions were modified.

@loci-dev loci-dev force-pushed the main branch 30 times, most recently from d516828 to 0a006e7 Compare December 2, 2025 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants