Skip to content

UPSTREAM PR #18110: server: (router) allow child process to report status via stdout#595

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18110-branch_ngxson-xsn/router_cmd_stdout
Open

UPSTREAM PR #18110: server: (router) allow child process to report status via stdout#595
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18110-branch_ngxson-xsn/router_cmd_stdout

Conversation

@loci-dev
Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#18110

In case the router listen on a specific address other than 127.0.0.1, the child process will fail to report its status back to the router

This change complete replace this reporting mechanism to using pipe (stdout) instead.

@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 16, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #595

Overview

This PR refactors the router-child IPC mechanism from HTTP-based to stdout pipe-based communication. Analysis shows no performance impact on inference paths. The changes affect only the model loading initialization sequence, which occurs once per model instance startup.

Key Findings

Inference Performance Impact: None

No changes detected in tokenization or inference functions. The following critical functions remain unmodified:

  • llama_decode
  • llama_encode
  • llama_tokenize
  • ggml_mul_mat
  • llama_build_graph

Tokens per second: No impact expected. The refactored code executes only during model initialization, not during token generation. Request proxying and inference hot paths are unchanged.

Startup Performance Improvement

The modified server_models::load() and setup_child_server() functions show reduced latency:

  • Eliminated HTTP client instantiation overhead (approximately 1000000 ns per model load)
  • Removed JSON serialization and HTTP POST request (approximately 2000000 ns per operation)
  • Replaced with stdout write operation (approximately 100000 ns)

Net improvement: approximately 2900000 ns per child process startup.

Power Consumption Analysis

All analyzed binaries show negligible change:

  • build.bin.libllama.so: 0 nJ change (186068 nJ baseline)
  • build.bin.llama-run: 0 nJ change (222960 nJ baseline)
  • build.bin.llama-cvector-generator: -1 nJ change (255554 nJ baseline)
  • build.bin.llama-tts: 0 nJ change (259957 nJ baseline)

Remaining 12 binaries show 0 nJ change. Total power consumption difference across all binaries: -1 nJ (negligible).

Modified Functions

The changes affect non-inference code paths:

  • server_models::load(): Adds stdout parsing logic with strstr() overhead (approximately 2000 ns per log line, one-time during startup)
  • server_models::setup_child_server(): Replaces HTTP POST with stdout write
  • Removed post_router_models_status HTTP endpoint handler

Code Changes

The PR implements a protocol change for status reporting between router and child processes. The modification eliminates network stack usage for local IPC, replacing it with direct pipe communication. This addresses a configuration bug where routers listening on non-localhost addresses prevented child status reporting.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from e02e9be to 9f1f66d Compare December 19, 2025 11:08
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 006b713 to 51e2c27 Compare December 25, 2025 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants