Skip to content

Conversation

@mostlygeek
Copy link
Owner

@mostlygeek mostlygeek commented Aug 27, 2025

This allows using llama-swap to support /infill endpoint that is part of llama-server. This endpoint is very fast for code infilling when used llama.vscode.

Requests to /infill will also show up on the Activities page:

image

Summary by CodeRabbit

  • New Features

    • Added reranking endpoints (/rerank, /reranking, plus existing /v1/rerank, /v1/reranking).
    • Introduced /infill endpoint for code infilling.
    • Enhanced metrics reporting to use timing data when available, improving token counts, throughput, and duration accuracy.
  • Documentation

    • Reorganized endpoint docs to highlight llama-server (llama.cpp) support.
    • Documented new /infill and reranking endpoints under llama-server.
    • Removed OpenAI API reranking entries to reduce confusion.

@coderabbitai
Copy link

coderabbitai bot commented Aug 27, 2025

Walkthrough

Adds new routes (/rerank, /reranking, /infill) to the proxy and documents a dedicated llama-server endpoints section in README; refactors metrics middleware to introduce a MetricsRecorder, handle streaming vs non-streaming responses, and parse optional timings fields to override usage-based metrics.

Changes

Cohort / File(s) Summary
Documentation
README.md
Reflowed endpoint docs: removed OpenAI rerank bullet, added a dedicated "✅ llama-server (llama.cpp) supported endpoints" section listing v1/rerank, v1/reranking, /rerank, and new /infill (code infilling).
Proxy routing
proxy/proxymanager.go
Registers new routes /reranking, /rerank, and /infill (wired to existing proxyOAIHandler); reorders/aliases rerank endpoints and updates embeddings comment to include reranking.
Metrics middleware
proxy/metrics_middleware.go
Adds exported MetricsRecorder type; switches to Content-Type-based streaming detection; splits response handling into processStreamingResponse and processNonStreamingResponse; parses timings JSON (overrides token counts, throughput, duration) and falls back to usage and elapsed time; emits metrics via metricsMonitor.addMetrics.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant P as Proxy (proxymanager)
  participant H as proxyOAIHandler
  participant M as MetricsMiddleware / MetricsRecorder
  participant MM as MetricsMonitor

  C->>P: HTTP request (/v1/chat/... | /v1/rerank | /reranking | /infill)
  Note over P: New routes include /rerank, /reranking, /infill
  P->>H: Forward request
  activate H
  H->>M: Response passes through metrics middleware
  activate M
  alt response Content-Type == text/event-stream (streaming)
    M->>M: processStreamingResponse: scan SSE "data:" lines, parse JSON chunks
  else non-streaming
    M->>M: processNonStreamingResponse: read full body JSON
  end
  M->>M: parseAndRecordMetrics: check `timings` ? use timings : use usage + elapsed
  M->>MM: addMetrics(Timestamp, Model, InputTokens, OutputTokens, PPS, TPS, DurationMs)
  deactivate M
  H-->>P: Response
  deactivate H
  P-->>C: Response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch support-infill

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
proxy/metrics_middleware.go (1)

72-101: Auto-detect SSE to record metrics for /infill without stream=true.

Right now, isStreaming is set solely from the request JSON’s stream flag. /infill streams by default and many clients (e.g., llama.vscode) don’t set stream=true, so metrics parsing will treat the SSE body as non-JSON and skip recording. Add a header-based fallback to detect text/event-stream responses.

Apply this diff to keep defaults non-negative:

-	tokensPerSecond := -1.0
-	promptPerSecond := -1.0
+	tokensPerSecond := 0.0
+	promptPerSecond := 0.0

And add SSE detection (outside this hunk) to the writer:

// import "strings"

func (w *MetricsResponseWriter) WriteHeader(statusCode int) {
	// Detect SSE from response headers
	if ct := w.ResponseWriter.Header().Get("Content-Type"); strings.Contains(ct, "text/event-stream") {
		w.metricsRecorder.isStreaming = true
	}
	w.ResponseWriter.WriteHeader(statusCode)
}

func (w *MetricsResponseWriter) Write(b []byte) (int, error) {
	// Late header detection fallback
	if !w.metricsRecorder.isStreaming {
		if ct := w.ResponseWriter.Header().Get("Content-Type"); strings.Contains(ct, "text/event-stream") {
			w.metricsRecorder.isStreaming = true
		}
	}
	n, err := w.ResponseWriter.Write(b)
	if err != nil {
		return n, err
	}
	w.body = append(w.body, b...)
	return n, nil
}

Optionally, also treat path-based defaults as streaming:

// In MetricsMiddleware(...) before creating writer:
isStreaming := gjson.GetBytes(bodyBytes, "stream").Bool()
if strings.HasSuffix(c.Request.URL.Path, "/infill") {
	isStreaming = true
}

If you want, I can open a follow-up PR with these changes and a unit test for /infill SSE.

🧹 Nitpick comments (3)
README.md (2)

23-25: Add missing /reranking alias to match the router.

The proxy registers both /rerank and /reranking, but the README only lists /rerank. Add /reranking for accuracy.

Apply this diff:

-  - `v1/rerank`, `v1/reranking`, `/rerank`
+  - `v1/rerank`, `v1/reranking`, `/rerank`, `/reranking`

25-25: Clarify that requests to /infill must include a model when going through llama-swap.

llama-server’s /infill does not require a model field, but llama-swap needs it to select the upstream model; otherwise the proxy returns 400. A one‑line note here will prevent confusion for users coming from llama.vscode or direct llama.cpp usage. (github.com)

Would you like me to propose a minimal README snippet with an example curl?

proxy/metrics_middleware.go (1)

95-101: Timings override LGTM, minor robustness nit.

Overriding tokens and rates with timings.* is correct for llama-server. Consider guarding against missing timing fields (leave defaults if zero) to avoid misleading zeros when a server omits one of them.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c55d0cc and 916b210.

📒 Files selected for processing (3)
  • README.md (1 hunks)
  • proxy/metrics_middleware.go (1 hunks)
  • proxy/proxymanager.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
proxy/metrics_middleware.go (2)
proxy/proxymanager_test.go (2)
  • TestProxyManager_MiddlewareWritesMetrics_NonStreaming (749-782)
  • TestProxyManager_MiddlewareWritesMetrics_Streaming (784-817)
proxy/metrics_monitor.go (4)
  • ID (12-21)
  • Metrics (24-26)
  • mu (33-38)
  • mp (54-66)
🔇 Additional comments (3)
proxy/proxymanager.go (2)

194-196: Comment and grouping LGTM.

“Support embeddings and reranking” label matches the new routes; no behavior change.


197-202: Rerank aliases look good.

Registering /rerank and /reranking alongside /v1/rerank and /v1/reranking matches common llama-server aliases.

proxy/metrics_middleware.go (1)

88-91: Usage override LGTM.

Using usage.prompt_tokens and usage.completion_tokens when present is correct and keeps backward compatibility.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
proxy/metrics_middleware.go (2)

70-96: TokensPerSecond defaults to -1 without timings; breaks existing tests and dashboards. Add a fallback computation.

When timings are absent, TokensPerSecond remains -1. Tests assert > 0.0 and dashboards expect non-negative values.

 	// timings data
 	tokensPerSecond := -1.0
 	promptPerSecond := -1.0
 	durationMs := int(time.Since(rec.startTime).Milliseconds())
@@
 	if timings.Exists() {
 		inputTokens = int(jsonData.Get("timings.prompt_n").Int())
 		outputTokens = int(jsonData.Get("timings.predicted_n").Int())
 		promptPerSecond = jsonData.Get("timings.prompt_per_second").Float()
 		tokensPerSecond = jsonData.Get("timings.predicted_per_second").Float()
 		durationMs = int(jsonData.Get("timings.prompt_ms").Float() + jsonData.Get("timings.predicted_ms").Float())
-	}
+	} else {
+		// Fallback: derive rates from elapsed time and usage counts
+		secs := float64(durationMs) / 1000.0
+		if secs <= 0 {
+			secs = 0.001 // avoid div-by-zero and keep tests/assertions happy
+		}
+		if outputTokens > 0 {
+			tokensPerSecond = float64(outputTokens) / secs
+		} else {
+			tokensPerSecond = 0.0
+		}
+		if inputTokens > 0 {
+			promptPerSecond = float64(inputTokens) / secs
+		} else {
+			promptPerSecond = 0.0
+		}
+	}

27-27: Typo in user-facing error message ("ready" → "read").

-			pm.sendErrorResponse(c, http.StatusBadRequest, "could not ready request body")
+			pm.sendErrorResponse(c, http.StatusBadRequest, "could not read request body")
🧹 Nitpick comments (5)
proxy/metrics_middleware.go (5)

58-63: Make streaming detection case-insensitive and add a body-based fallback.

Some upstreams don’t set Content-Type reliably; also header values may vary in casing. Fall back to detecting SSE markers in the body to avoid missing metrics for streaming responses.

-		// check for streaming response
-		if strings.Contains(c.Writer.Header().Get("Content-Type"), "text/event-stream") {
-			writer.metricsRecorder.processStreamingResponse(writer.body)
-		} else {
-			writer.metricsRecorder.processNonStreamingResponse(writer.body)
-		}
+		// check for streaming response
+		ct := strings.ToLower(c.Writer.Header().Get("Content-Type"))
+		if strings.Contains(ct, "text/event-stream") ||
+			bytes.Contains(writer.body, []byte("\ndata:")) ||
+			bytes.HasPrefix(bytes.TrimSpace(writer.body), []byte("data:")) {
+			writer.metricsRecorder.processStreamingResponse(writer.body)
+		} else {
+			writer.metricsRecorder.processNonStreamingResponse(writer.body)
+		}

41-43: Minor wording nit: “modelID” → “model ID”.

Improves user-facing message clarity.

-			pm.sendErrorResponse(c, http.StatusBadRequest, fmt.Sprintf("could not find real modelID for %s", requestedModel))
+			pm.sendErrorResponse(c, http.StatusBadRequest, fmt.Sprintf("could not find real model ID for %s", requestedModel))

15-20: Consider keeping MetricsRecorder unexported unless it’s part of the public API.

Reduces package surface; export only if it’s referenced outside proxy.

-type MetricsRecorder struct {
+type metricsRecorder struct {
 	metricsMonitor *MetricsMonitor
 	realModelName  string
 	//	isStreaming    bool
 	startTime time.Time
 }

Note: update the composite literal at Line 49 accordingly.


111-144: SSE parsing is line-based and assumes single-line JSON payloads.

Some servers can split one event across multiple data: lines. Consider accumulating consecutive data: lines into one buffer per event before JSON parse.

I can provide a small helper that rebuilds SSE events from multi-line data: frames if you’d like.


164-171: Bound buffered response size to avoid excessive memory use on long streams.

We only need the tail containing metrics; bound to last N bytes (e.g., 64 KiB).

 	n, err := w.ResponseWriter.Write(b)
@@
-	w.body = append(w.body, b...)
+	w.body = append(w.body, b...)
+	// keep only the last 64 KiB to limit memory usage
+	const maxBuf = 64 << 10
+	if len(w.body) > maxBuf {
+		w.body = w.body[len(w.body)-maxBuf:]
+	}
 	return n, nil
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 916b210 and 8d1a069.

📒 Files selected for processing (1)
  • proxy/metrics_middleware.go (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
proxy/metrics_middleware.go (2)
proxy/metrics_monitor.go (2)
  • MetricsMonitor (33-38)
  • ID (12-21)
proxy/proxymanager_test.go (2)
  • TestProxyManager_MiddlewareWritesMetrics_Streaming (784-817)
  • TestProxyManager_MiddlewareWritesMetrics_NonStreaming (749-782)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: run-tests
🔇 Additional comments (2)
proxy/metrics_middleware.go (2)

98-106: LGTM on metrics emission.

Fields align with TokenMetrics and respect timings override when present.


58-63: Double-check metrics assertion tests

I ran searches for tests asserting TokensPerSecond > 0 and DurationMs > 0, but didn’t find any direct matches in the codebase. To be sure these changes haven’t introduced a regression, please:

  • Run the full test suite and confirm that any existing tests covering the metrics middleware still pass.
  • Verify that streaming responses still yield TokensPerSecond > 0 in their tests.
  • Verify that non-streaming responses still yield TokensPerSecond > 0 in their tests.
  • Confirm tests asserting DurationMs > 0 still behave as expected.

@mostlygeek mostlygeek merged commit 57803fd into main Aug 27, 2025
3 checks passed
@mostlygeek mostlygeek deleted the support-infill branch October 10, 2025 05:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants