UPSTREAM PR #18228: server: add auto-sleep after N seconds of idle#640
Closed
UPSTREAM PR #18228: server: add auto-sleep after N seconds of idle#640
Conversation
26a6f0f to
cf53bc9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mirrored from ggml-org/llama.cpp#18228
Sleeping on Idle
The server supports an automatic sleep mode that activates after a specified period of inactivity (no incoming tasks). This feature, introduced in PR #18228, can be enabled using the
--sleep-idle-secondscommand-line argument. It works seamlessly in both single-model and multi-model configurations.When the server enters sleep mode, the model and its associated memory (including the KV cache) are unloaded from RAM to conserve resources. Any new incoming task will automatically trigger the model to reload.
Note that the following endpoints are exempt from being considered as incoming tasks. They do not trigger model reloading and do not reset the idle timer:
GET /healthGET /propsImplementation
The implementation of this feature consists of 3 main parts:
server_queuesleeping stateserver_contextsleeping stateserver_res_generatorhookThe main loop inside
server_queueacts as a watchdog timer (so we can avoid spawning a dedicated thread just for the watchdog). Upon timing condition passed, it signals toserver_contextto unload the model.server_res_generatorhooks on any incoming request, and will ask theserver_queueto resume if it is in sleeping state. Note that some requests like/healthbypass this check (they can only access read-only data ofserver_context)Upon requested to resume,
server_queuesignalsserver_contextto reload models, then unblockserver_res_generatorto proceed with the rest of the request.