Skip to content

server: add --stop-idle-seconds for router mode#19380

Closed
leonardcser wants to merge 2 commits intoggml-org:masterfrom
leonardcser:unload-idle-seconds
Closed

server: add --stop-idle-seconds for router mode#19380
leonardcser wants to merge 2 commits intoggml-org:masterfrom
leonardcser:unload-idle-seconds

Conversation

@leonardcser
Copy link

@leonardcser leonardcser commented Feb 5, 2026

Summary

  • Adds --stop-idle-seconds CLI flag that fully terminates idle model subprocesses in router mode after N seconds of inactivity
  • On next request, ensure_model_loaded() re-spawns the process automatically (when --models-autoload is enabled)
  • Unlike --sleep-idle-seconds which unloads VRAM/RAM within a child process, this flag kills the subprocess entirely

Refs #19379, follow-up to #18189

Test plan

  • Build and verify --stop-idle-seconds appears in --help
  • Run router mode with --stop-idle-seconds 5 --models-dir <dir>, send a request, wait >5s, verify process is terminated
  • Send another request and verify the model re-spawns via autoload
  • Verify existing router tests still pass

@leonardcser leonardcser marked this pull request as ready for review February 5, 2026 23:35
@leonardcser leonardcser changed the title server: add --unload-idle-seconds for router mode server: add --stop-idle-seconds for router mode Feb 5, 2026
@github-actions github-actions bot added the python python script changes label Feb 5, 2026
@ngxson
Copy link
Contributor

ngxson commented Feb 6, 2026

I disagree with this feature, it sounds redundant to me.

The auto-sleep functionality added in #18228 should already allow model weight to be unloaded after a timeout, while also allowing static endpoints like /props, /models to be accessed without waking up the server. The current PR is missing that, and adding it will just duplicate a lot of code.

If the server instance need to be unloaded, it's up to the downstream application to decide when to unload it.

@ngxson ngxson closed this Feb 6, 2026
@leonardcser
Copy link
Author

Thanks for the feedback

@apunkt
Copy link

apunkt commented Mar 13, 2026

If the server instance need to be unloaded, it's up to the downstream application to decide when to unload it.

agree, however unloading from downstream app now still leaves ps on gpu as mentioned in #19379_

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants