-
Notifications
You must be signed in to change notification settings - Fork 118
Support llama-server's /infill endpoint #272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds new routes (/rerank, /reranking, /infill) to the proxy and documents a dedicated llama-server endpoints section in README; refactors metrics middleware to introduce a MetricsRecorder, handle streaming vs non-streaming responses, and parse optional Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Client
participant P as Proxy (proxymanager)
participant H as proxyOAIHandler
participant M as MetricsMiddleware / MetricsRecorder
participant MM as MetricsMonitor
C->>P: HTTP request (/v1/chat/... | /v1/rerank | /reranking | /infill)
Note over P: New routes include /rerank, /reranking, /infill
P->>H: Forward request
activate H
H->>M: Response passes through metrics middleware
activate M
alt response Content-Type == text/event-stream (streaming)
M->>M: processStreamingResponse: scan SSE "data:" lines, parse JSON chunks
else non-streaming
M->>M: processNonStreamingResponse: read full body JSON
end
M->>M: parseAndRecordMetrics: check `timings` ? use timings : use usage + elapsed
M->>MM: addMetrics(Timestamp, Model, InputTokens, OutputTokens, PPS, TPS, DurationMs)
deactivate M
H-->>P: Response
deactivate H
P-->>C: Response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
proxy/metrics_middleware.go (1)
72-101: Auto-detect SSE to record metrics for/infillwithoutstream=true.Right now,
isStreamingis set solely from the request JSON’sstreamflag./infillstreams by default and many clients (e.g., llama.vscode) don’t setstream=true, so metrics parsing will treat the SSE body as non-JSON and skip recording. Add a header-based fallback to detecttext/event-streamresponses.Apply this diff to keep defaults non-negative:
- tokensPerSecond := -1.0 - promptPerSecond := -1.0 + tokensPerSecond := 0.0 + promptPerSecond := 0.0And add SSE detection (outside this hunk) to the writer:
// import "strings" func (w *MetricsResponseWriter) WriteHeader(statusCode int) { // Detect SSE from response headers if ct := w.ResponseWriter.Header().Get("Content-Type"); strings.Contains(ct, "text/event-stream") { w.metricsRecorder.isStreaming = true } w.ResponseWriter.WriteHeader(statusCode) } func (w *MetricsResponseWriter) Write(b []byte) (int, error) { // Late header detection fallback if !w.metricsRecorder.isStreaming { if ct := w.ResponseWriter.Header().Get("Content-Type"); strings.Contains(ct, "text/event-stream") { w.metricsRecorder.isStreaming = true } } n, err := w.ResponseWriter.Write(b) if err != nil { return n, err } w.body = append(w.body, b...) return n, nil }Optionally, also treat path-based defaults as streaming:
// In MetricsMiddleware(...) before creating writer: isStreaming := gjson.GetBytes(bodyBytes, "stream").Bool() if strings.HasSuffix(c.Request.URL.Path, "/infill") { isStreaming = true }If you want, I can open a follow-up PR with these changes and a unit test for
/infillSSE.
🧹 Nitpick comments (3)
README.md (2)
23-25: Add missing/rerankingalias to match the router.The proxy registers both
/rerankand/reranking, but the README only lists/rerank. Add/rerankingfor accuracy.Apply this diff:
- - `v1/rerank`, `v1/reranking`, `/rerank` + - `v1/rerank`, `v1/reranking`, `/rerank`, `/reranking`
25-25: Clarify that requests to/infillmust include amodelwhen going through llama-swap.llama-server’s
/infilldoes not require amodelfield, but llama-swap needs it to select the upstream model; otherwise the proxy returns 400. A one‑line note here will prevent confusion for users coming from llama.vscode or direct llama.cpp usage. (github.com)Would you like me to propose a minimal README snippet with an example curl?
proxy/metrics_middleware.go (1)
95-101: Timings override LGTM, minor robustness nit.Overriding tokens and rates with
timings.*is correct for llama-server. Consider guarding against missing timing fields (leave defaults if zero) to avoid misleading zeros when a server omits one of them.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
README.md(1 hunks)proxy/metrics_middleware.go(1 hunks)proxy/proxymanager.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
proxy/metrics_middleware.go (2)
proxy/proxymanager_test.go (2)
TestProxyManager_MiddlewareWritesMetrics_NonStreaming(749-782)TestProxyManager_MiddlewareWritesMetrics_Streaming(784-817)proxy/metrics_monitor.go (4)
ID(12-21)Metrics(24-26)mu(33-38)mp(54-66)
🔇 Additional comments (3)
proxy/proxymanager.go (2)
194-196: Comment and grouping LGTM.“Support embeddings and reranking” label matches the new routes; no behavior change.
197-202: Rerank aliases look good.Registering
/rerankand/rerankingalongside/v1/rerankand/v1/rerankingmatches common llama-server aliases.proxy/metrics_middleware.go (1)
88-91: Usage override LGTM.Using
usage.prompt_tokensandusage.completion_tokenswhen present is correct and keeps backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
proxy/metrics_middleware.go (2)
70-96: TokensPerSecond defaults to -1 without timings; breaks existing tests and dashboards. Add a fallback computation.When timings are absent, TokensPerSecond remains -1. Tests assert > 0.0 and dashboards expect non-negative values.
// timings data tokensPerSecond := -1.0 promptPerSecond := -1.0 durationMs := int(time.Since(rec.startTime).Milliseconds()) @@ if timings.Exists() { inputTokens = int(jsonData.Get("timings.prompt_n").Int()) outputTokens = int(jsonData.Get("timings.predicted_n").Int()) promptPerSecond = jsonData.Get("timings.prompt_per_second").Float() tokensPerSecond = jsonData.Get("timings.predicted_per_second").Float() durationMs = int(jsonData.Get("timings.prompt_ms").Float() + jsonData.Get("timings.predicted_ms").Float()) - } + } else { + // Fallback: derive rates from elapsed time and usage counts + secs := float64(durationMs) / 1000.0 + if secs <= 0 { + secs = 0.001 // avoid div-by-zero and keep tests/assertions happy + } + if outputTokens > 0 { + tokensPerSecond = float64(outputTokens) / secs + } else { + tokensPerSecond = 0.0 + } + if inputTokens > 0 { + promptPerSecond = float64(inputTokens) / secs + } else { + promptPerSecond = 0.0 + } + }
27-27: Typo in user-facing error message ("ready" → "read").- pm.sendErrorResponse(c, http.StatusBadRequest, "could not ready request body") + pm.sendErrorResponse(c, http.StatusBadRequest, "could not read request body")
🧹 Nitpick comments (5)
proxy/metrics_middleware.go (5)
58-63: Make streaming detection case-insensitive and add a body-based fallback.Some upstreams don’t set Content-Type reliably; also header values may vary in casing. Fall back to detecting SSE markers in the body to avoid missing metrics for streaming responses.
- // check for streaming response - if strings.Contains(c.Writer.Header().Get("Content-Type"), "text/event-stream") { - writer.metricsRecorder.processStreamingResponse(writer.body) - } else { - writer.metricsRecorder.processNonStreamingResponse(writer.body) - } + // check for streaming response + ct := strings.ToLower(c.Writer.Header().Get("Content-Type")) + if strings.Contains(ct, "text/event-stream") || + bytes.Contains(writer.body, []byte("\ndata:")) || + bytes.HasPrefix(bytes.TrimSpace(writer.body), []byte("data:")) { + writer.metricsRecorder.processStreamingResponse(writer.body) + } else { + writer.metricsRecorder.processNonStreamingResponse(writer.body) + }
41-43: Minor wording nit: “modelID” → “model ID”.Improves user-facing message clarity.
- pm.sendErrorResponse(c, http.StatusBadRequest, fmt.Sprintf("could not find real modelID for %s", requestedModel)) + pm.sendErrorResponse(c, http.StatusBadRequest, fmt.Sprintf("could not find real model ID for %s", requestedModel))
15-20: Consider keeping MetricsRecorder unexported unless it’s part of the public API.Reduces package surface; export only if it’s referenced outside proxy.
-type MetricsRecorder struct { +type metricsRecorder struct { metricsMonitor *MetricsMonitor realModelName string // isStreaming bool startTime time.Time }Note: update the composite literal at Line 49 accordingly.
111-144: SSE parsing is line-based and assumes single-line JSON payloads.Some servers can split one event across multiple data: lines. Consider accumulating consecutive data: lines into one buffer per event before JSON parse.
I can provide a small helper that rebuilds SSE events from multi-line data: frames if you’d like.
164-171: Bound buffered response size to avoid excessive memory use on long streams.We only need the tail containing metrics; bound to last N bytes (e.g., 64 KiB).
n, err := w.ResponseWriter.Write(b) @@ - w.body = append(w.body, b...) + w.body = append(w.body, b...) + // keep only the last 64 KiB to limit memory usage + const maxBuf = 64 << 10 + if len(w.body) > maxBuf { + w.body = w.body[len(w.body)-maxBuf:] + } return n, nil
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
proxy/metrics_middleware.go(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
proxy/metrics_middleware.go (2)
proxy/metrics_monitor.go (2)
MetricsMonitor(33-38)ID(12-21)proxy/proxymanager_test.go (2)
TestProxyManager_MiddlewareWritesMetrics_Streaming(784-817)TestProxyManager_MiddlewareWritesMetrics_NonStreaming(749-782)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: run-tests
🔇 Additional comments (2)
proxy/metrics_middleware.go (2)
98-106: LGTM on metrics emission.Fields align with TokenMetrics and respect timings override when present.
58-63: Double-check metrics assertion testsI ran searches for tests asserting
TokensPerSecond > 0andDurationMs > 0, but didn’t find any direct matches in the codebase. To be sure these changes haven’t introduced a regression, please:
- Run the full test suite and confirm that any existing tests covering the metrics middleware still pass.
- Verify that streaming responses still yield
TokensPerSecond > 0in their tests.- Verify that non-streaming responses still yield
TokensPerSecond > 0in their tests.- Confirm tests asserting
DurationMs > 0still behave as expected.
This allows using llama-swap to support
/infillendpoint that is part of llama-server. This endpoint is very fast for code infilling when used llama.vscode.Requests to
/infillwill also show up on the Activities page:Summary by CodeRabbit
New Features
Documentation