Support llama-server's /infill endpoint #272

mostlygeek · 2025-08-27T15:08:28Z

This allows using llama-swap to support /infill endpoint that is part of llama-server. This endpoint is very fast for code infilling when used llama.vscode.

Requests to /infill will also show up on the Activities page:

Summary by CodeRabbit

New Features
- Added reranking endpoints (/rerank, /reranking, plus existing /v1/rerank, /v1/reranking).
- Introduced /infill endpoint for code infilling.
- Enhanced metrics reporting to use timing data when available, improving token counts, throughput, and duration accuracy.
Documentation
- Reorganized endpoint docs to highlight llama-server (llama.cpp) support.
- Documented new /infill and reranking endpoints under llama-server.
- Removed OpenAI API reranking entries to reduce confusion.

coderabbitai · 2025-08-27T15:08:34Z

Walkthrough

Adds new routes (/rerank, /reranking, /infill) to the proxy and documents a dedicated llama-server endpoints section in README; refactors metrics middleware to introduce a MetricsRecorder, handle streaming vs non-streaming responses, and parse optional timings fields to override usage-based metrics.

Changes

Cohort / File(s)	Summary
Documentation `README.md`	Reflowed endpoint docs: removed OpenAI rerank bullet, added a dedicated "✅ llama-server (llama.cpp) supported endpoints" section listing `v1/rerank`, `v1/reranking`, `/rerank`, and new `/infill` (code infilling).
Proxy routing `proxy/proxymanager.go`	Registers new routes `/reranking`, `/rerank`, and `/infill` (wired to existing `proxyOAIHandler`); reorders/aliases rerank endpoints and updates embeddings comment to include reranking.
Metrics middleware `proxy/metrics_middleware.go`	Adds exported `MetricsRecorder` type; switches to Content-Type-based streaming detection; splits response handling into `processStreamingResponse` and `processNonStreamingResponse`; parses `timings` JSON (overrides token counts, throughput, duration) and falls back to `usage` and elapsed time; emits metrics via `metricsMonitor.addMetrics`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant P as Proxy (proxymanager)
  participant H as proxyOAIHandler
  participant M as MetricsMiddleware / MetricsRecorder
  participant MM as MetricsMonitor

  C->>P: HTTP request (/v1/chat/... | /v1/rerank | /reranking | /infill)
  Note over P: New routes include /rerank, /reranking, /infill
  P->>H: Forward request
  activate H
  H->>M: Response passes through metrics middleware
  activate M
  alt response Content-Type == text/event-stream (streaming)
    M->>M: processStreamingResponse: scan SSE "data:" lines, parse JSON chunks
  else non-streaming
    M->>M: processNonStreamingResponse: read full body JSON
  end
  M->>M: parseAndRecordMetrics: check `timings` ? use timings : use usage + elapsed
  M->>MM: addMetrics(Timestamp, Model, InputTokens, OutputTokens, PPS, TPS, DurationMs)
  deactivate M
  H-->>P: Response
  deactivate H
  P-->>C: Response

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Fix token metrics parsing #199 — Modifies proxy/metrics_middleware.go to parse timings and adjust metrics; strongly related to the metrics changes here.
Add metrics logging and an Activity page to show requests #195 — Adjusts metrics middleware and proxy routing; overlaps with MetricsRecorder/middleware refactor and route additions.
Add endpoint aliases for reranking models #201 — Adds/changes reranking route registrations (/rerank, /v1/reranking) similar to the routing edits in this PR.

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch support-infill

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

proxy/metrics_middleware.go (1)
72-101: Auto-detect SSE to record metrics for /infill without stream=true.

Right now, isStreaming is set solely from the request JSON’s stream flag. /infill streams by default and many clients (e.g., llama.vscode) don’t set stream=true, so metrics parsing will treat the SSE body as non-JSON and skip recording. Add a header-based fallback to detect text/event-stream responses.

Apply this diff to keep defaults non-negative:
-	tokensPerSecond := -1.0
-	promptPerSecond := -1.0
+	tokensPerSecond := 0.0
+	promptPerSecond := 0.0
And add SSE detection (outside this hunk) to the writer:
// import "strings"

func (w *MetricsResponseWriter) WriteHeader(statusCode int) {
	// Detect SSE from response headers
	if ct := w.ResponseWriter.Header().Get("Content-Type"); strings.Contains(ct, "text/event-stream") {
		w.metricsRecorder.isStreaming = true
	}
	w.ResponseWriter.WriteHeader(statusCode)
}

func (w *MetricsResponseWriter) Write(b []byte) (int, error) {
	// Late header detection fallback
	if !w.metricsRecorder.isStreaming {
		if ct := w.ResponseWriter.Header().Get("Content-Type"); strings.Contains(ct, "text/event-stream") {
			w.metricsRecorder.isStreaming = true
		}
	}
	n, err := w.ResponseWriter.Write(b)
	if err != nil {
		return n, err
	}
	w.body = append(w.body, b...)
	return n, nil
}
Optionally, also treat path-based defaults as streaming:
// In MetricsMiddleware(...) before creating writer:
isStreaming := gjson.GetBytes(bodyBytes, "stream").Bool()
if strings.HasSuffix(c.Request.URL.Path, "/infill") {
	isStreaming = true
}
If you want, I can open a follow-up PR with these changes and a unit test for /infill SSE.

🧹 Nitpick comments (3)

README.md (2)
23-25: Add missing /reranking alias to match the router.

The proxy registers both /rerank and /reranking, but the README only lists /rerank. Add /reranking for accuracy.

Apply this diff:
-  - `v1/rerank`, `v1/reranking`, `/rerank`
+  - `v1/rerank`, `v1/reranking`, `/rerank`, `/reranking`
25-25: Clarify that requests to /infill must include a model when going through llama-swap.

llama-server’s /infill does not require a model field, but llama-swap needs it to select the upstream model; otherwise the proxy returns 400. A one‑line note here will prevent confusion for users coming from llama.vscode or direct llama.cpp usage. (github.com)

Would you like me to propose a minimal README snippet with an example curl?
proxy/metrics_middleware.go (1)

95-101: Timings override LGTM, minor robustness nit.

Overriding tokens and rates with timings.* is correct for llama-server. Consider guarding against missing timing fields (leave defaults if zero) to avoid misleading zeros when a server omits one of them.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c55d0cc and 916b210.

📒 Files selected for processing (3)

README.md (1 hunks)
proxy/metrics_middleware.go (1 hunks)
proxy/proxymanager.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

proxy/metrics_middleware.go (2)

proxy/proxymanager_test.go (2)

TestProxyManager_MiddlewareWritesMetrics_NonStreaming (749-782)

TestProxyManager_MiddlewareWritesMetrics_Streaming (784-817)

proxy/metrics_monitor.go (4)

ID (12-21)

Metrics (24-26)

mu (33-38)

mp (54-66)

🔇 Additional comments (3)

proxy/proxymanager.go (2)

194-196: Comment and grouping LGTM.

“Support embeddings and reranking” label matches the new routes; no behavior change.

197-202: Rerank aliases look good.

Registering /rerank and /reranking alongside /v1/rerank and /v1/reranking matches common llama-server aliases.

proxy/metrics_middleware.go (1)

88-91: Usage override LGTM.

Using usage.prompt_tokens and usage.completion_tokens when present is correct and keeps backward compatibility.

proxy/proxymanager.go

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

proxy/metrics_middleware.go (2)

70-96: TokensPerSecond defaults to -1 without timings; breaks existing tests and dashboards. Add a fallback computation.

When timings are absent, TokensPerSecond remains -1. Tests assert > 0.0 and dashboards expect non-negative values.

 	// timings data
 	tokensPerSecond := -1.0
 	promptPerSecond := -1.0
 	durationMs := int(time.Since(rec.startTime).Milliseconds())
@@
 	if timings.Exists() {
 		inputTokens = int(jsonData.Get("timings.prompt_n").Int())
 		outputTokens = int(jsonData.Get("timings.predicted_n").Int())
 		promptPerSecond = jsonData.Get("timings.prompt_per_second").Float()
 		tokensPerSecond = jsonData.Get("timings.predicted_per_second").Float()
 		durationMs = int(jsonData.Get("timings.prompt_ms").Float() + jsonData.Get("timings.predicted_ms").Float())
-	}
+	} else {
+		// Fallback: derive rates from elapsed time and usage counts
+		secs := float64(durationMs) / 1000.0
+		if secs <= 0 {
+			secs = 0.001 // avoid div-by-zero and keep tests/assertions happy
+		}
+		if outputTokens > 0 {
+			tokensPerSecond = float64(outputTokens) / secs
+		} else {
+			tokensPerSecond = 0.0
+		}
+		if inputTokens > 0 {
+			promptPerSecond = float64(inputTokens) / secs
+		} else {
+			promptPerSecond = 0.0
+		}
+	}

27-27: Typo in user-facing error message ("ready" → "read").

-			pm.sendErrorResponse(c, http.StatusBadRequest, "could not ready request body")
+			pm.sendErrorResponse(c, http.StatusBadRequest, "could not read request body")

🧹 Nitpick comments (5)

proxy/metrics_middleware.go (5)
58-63: Make streaming detection case-insensitive and add a body-based fallback.

Some upstreams don’t set Content-Type reliably; also header values may vary in casing. Fall back to detecting SSE markers in the body to avoid missing metrics for streaming responses.
-		// check for streaming response
-		if strings.Contains(c.Writer.Header().Get("Content-Type"), "text/event-stream") {
-			writer.metricsRecorder.processStreamingResponse(writer.body)
-		} else {
-			writer.metricsRecorder.processNonStreamingResponse(writer.body)
-		}
+		// check for streaming response
+		ct := strings.ToLower(c.Writer.Header().Get("Content-Type"))
+		if strings.Contains(ct, "text/event-stream") ||
+			bytes.Contains(writer.body, []byte("\ndata:")) ||
+			bytes.HasPrefix(bytes.TrimSpace(writer.body), []byte("data:")) {
+			writer.metricsRecorder.processStreamingResponse(writer.body)
+		} else {
+			writer.metricsRecorder.processNonStreamingResponse(writer.body)
+		}
41-43: Minor wording nit: “modelID” → “model ID”.

Improves user-facing message clarity.
-			pm.sendErrorResponse(c, http.StatusBadRequest, fmt.Sprintf("could not find real modelID for %s", requestedModel))
+			pm.sendErrorResponse(c, http.StatusBadRequest, fmt.Sprintf("could not find real model ID for %s", requestedModel))
15-20: Consider keeping MetricsRecorder unexported unless it’s part of the public API.

Reduces package surface; export only if it’s referenced outside proxy.
-type MetricsRecorder struct {
+type metricsRecorder struct {
 	metricsMonitor *MetricsMonitor
 	realModelName  string
 	//	isStreaming    bool
 	startTime time.Time
 }
Note: update the composite literal at Line 49 accordingly.

111-144: SSE parsing is line-based and assumes single-line JSON payloads.

Some servers can split one event across multiple data: lines. Consider accumulating consecutive data: lines into one buffer per event before JSON parse.

I can provide a small helper that rebuilds SSE events from multi-line data: frames if you’d like.

164-171: Bound buffered response size to avoid excessive memory use on long streams.

We only need the tail containing metrics; bound to last N bytes (e.g., 64 KiB).
 	n, err := w.ResponseWriter.Write(b)
@@
-	w.body = append(w.body, b...)
+	w.body = append(w.body, b...)
+	// keep only the last 64 KiB to limit memory usage
+	const maxBuf = 64 << 10
+	if len(w.body) > maxBuf {
+		w.body = w.body[len(w.body)-maxBuf:]
+	}
 	return n, nil

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 916b210 and 8d1a069.

📒 Files selected for processing (1)

proxy/metrics_middleware.go (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

proxy/metrics_middleware.go (2)

proxy/metrics_monitor.go (2)

MetricsMonitor (33-38)

ID (12-21)

proxy/proxymanager_test.go (2)

TestProxyManager_MiddlewareWritesMetrics_Streaming (784-817)

TestProxyManager_MiddlewareWritesMetrics_NonStreaming (749-782)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: run-tests

🔇 Additional comments (2)

proxy/metrics_middleware.go (2)

98-106: LGTM on metrics emission.

Fields align with TokenMetrics and respect timings override when present.

58-63: Double-check metrics assertion tests

I ran searches for tests asserting TokensPerSecond > 0 and DurationMs > 0, but didn’t find any direct matches in the codebase. To be sure these changes haven’t introduced a regression, please:

Run the full test suite and confirm that any existing tests covering the metrics middleware still pass.

Verify that streaming responses still yield TokensPerSecond > 0 in their tests.

Verify that non-streaming responses still yield TokensPerSecond > 0 in their tests.

Confirm tests asserting DurationMs > 0 still behave as expected.

mostlygeek added 3 commits August 27, 2025 07:49

.

e5961ba

update metricsmiddleware for .timings response

b31bce1

update comments and readme

916b210

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

proxy/proxymanager.go Show resolved Hide resolved

detect streaming requests by response header instead of json body

8d1a069

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

mostlygeek merged commit 57803fd into main Aug 27, 2025
3 checks passed

coderabbitai bot mentioned this pull request Aug 28, 2025

feat: add /completion endpoint #275

Merged

mostlygeek deleted the support-infill branch October 10, 2025 05:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support llama-server's /infill endpoint #272

Support llama-server's /infill endpoint #272

Uh oh!

mostlygeek commented Aug 27, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Aug 27, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support llama-server's /infill endpoint #272

Support llama-server's /infill endpoint #272

Uh oh!

Conversation

mostlygeek commented Aug 27, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mostlygeek commented Aug 27, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 27, 2025 •

edited

Loading