server: improve slots scheduling for n_cmpl by ngxson · Pull Request #18789 · ggml-org/llama.cpp

ngxson · 2026-01-12T17:17:57Z

This PR introduces scheduling mechanism inspired by thread barrier, which allow launching n_cmpl slots at the same time.

I tested with repeated requests to /v1/completions using the following payload:

{
    "prompt": "I believe the meaning of life is",
    "stream": false,
    "n": 3,
    "n_predict": 100,
    "id_slot": 0
}

And so far it works correctly

ggerganov · 2026-01-13T10:30:50Z

I found a failure case with this sample script using the same server from previous comment:

#!/bin/bash
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 3, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 2, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 1, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 3, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 2, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 1, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &

The server gets stuck and with several waiting tasks and does not proceed further:

0.02.599.299 D srv  update_slots: run slots completed
0.02.599.300 D que    start_loop: waiting for new tasks
0.02.599.300 D que    start_loop: processing new tasks
0.02.599.311 D que    start_loop: processing task, id = 19
0.02.599.312 D srv  process_sing: failed to reserve 3 slots, defer task, id_task = 19
0.02.599.312 D que         defer: defer task, id = 19
0.02.599.313 D que    start_loop: processing task, id = 16
0.02.599.313 D srv  process_sing: requested slot is reserved for another task (task_id_next = 4), defer task, id_task = 16
0.02.599.313 D que         defer: defer task, id = 16
0.02.599.314 D que    start_loop: processing task, id = 13
0.02.599.314 D srv  process_sing: requested slot is reserved for another task (task_id_next = 4), defer task, id_task = 13
0.02.599.314 D que         defer: defer task, id = 13
0.02.599.315 D que    start_loop: processing task, id = 34
0.02.599.315 D que    start_loop: update slots
0.02.599.315 I srv  update_slots: all slots are idle
0.02.599.315 D que    start_loop: waiting for new tasks
0.02.599.337 D srv  update_chat_: Parsing chat message:  and welcome to our
0.02.599.338 D Parsing input with format Content-only:  and welcome to our
0.02.599.360 D Parsed message: {"role":"assistant","content":" and welcome to our"}
0.02.599.483 D srv          stop: all tasks already finished, no need to cancel
0.02.599.490 D res  remove_waiti: remove task 11 from waiting list. current waiting = 24 (before remove)
0.02.599.490 D res  remove_waiti: remove task 10 from waiting list. current waiting = 23 (before remove)
0.02.599.491 D res  remove_waiti: remove task 8 from waiting list. current waiting = 22 (before remove)
0.02.599.491 D srv          stop: all tasks already finished, no need to cancel

ggerganov · 2026-01-13T13:52:52Z

Appears to be working correctly with the latest version. Looking at the implementation.

ngxson · 2026-01-13T14:01:10Z

I found the root cause: In the case where a task reserves a specific slot and the task is deferred, we must pop the exact task that is reserved for that specific slot.

The old logic only pop one single task at the front of queue_tasks_deferred when a slot is released, let's say if we have n_slots, then if the reserved task is in n_slots + 1-th position in queue_tasks_deferred, then there is a chance it will never be pop'ed

An alternative is to pop all tasks from queue_tasks_deferred whenever a slot is free, but I think it will have worse performance. Because only one single slot is freed when pop_deferred_task is called, then in theory, there can only be 1 deferred task that can proceed.

ngxson · 2026-01-13T14:04:06Z

Yet an alternative fix, instead of having slot.task_id_next as a task ID, we can also store a std::unique_ptr<const server_task> task_next, so the task won't be deferred.task_next will be launched whenever the current slot is released

I'll see if this is actually better or not.

ggerganov · 2026-01-13T14:05:36Z

tools/server/server-context.cpp

                            break;
                        }
                    }
+                    unreserve_slots(task.id_target);


Can we unreserve automatically inside slot.release()? To avoid forgetting an unreserve in some branch of the logic.

I don't think it's possible, because this implementation works based on the assumption that released slots must preserve the next task ID, so that the scheduler can launch that specific next task for that free slot.

But your point does make the idea of having next_task more appealing to have. In this case, slot.release() will move next_task to task, and the scheduler is only responsible for making sure all linked tasks can start at the same time

ggerganov · 2026-01-13T15:08:05Z

tools/server/server-context.cpp

+                    if (task.is_parent()) {
+                        // if this is a parent task, we want to make sure parent + all child tasks can be launched at the same time
+
+                        // the current slot must be either reserved for this task, or free (checked above)
+                        GGML_ASSERT(slot->task_id_next == -1 || slot->task_id_next == task.id);
+                        slot->task_id_next = task.id;
+
+                        // need to reserve n_children more slots
+                        if (try_reserve_child_slots(*slot, task.n_children, task.id)) {
+                            // all required slots have been reserved, safe to proceed
+                            int task_id = task.id;
+                            if (!launch_slots_with_child_tasks(*slot, std::move(task))) {
+                                SRV_ERR("failed to launch slots with child tasks, id_task = %d\n", task_id);
+                                // task must be dropped on error
+                                break;
+                            }
+                            break;
+                        } else {
+                            // failed to reserve all required slots, we defer this task for processing later
+                            SRV_DBG("failed to reserve %d slots, defer task, id_task = %d\n", task.n_children + 1, task.id);
+                            queue_tasks.defer(std::move(task));
+                            // note: the current slot + child slots are already reserved for this task
+                            break;
+                        }
+                    }


I am wondering if we can significantly simplify the "reserve" logic by containing it fully within this block. Something like:

// pseudo code if (task.is_parent()) { // note `try_reserve()` does not need to mutate the slots auto reserve_info = try_reserve(slot, task); if (reserve_info.ok()) { launch_task_with_children(slot, std::move(task), reserve_info); } else { defer(std::move(task)); } // could be moved in the destructor of `reserve_info` too // also, this might just be a noop in this approach since we didn't mutate the slots unreserve(reserve_info); break; }

I feel like there is no need to maintain the reservation state of the slots beyond attempting to launch the parent and the children.

I thought about this approach at first, but it won't work well if the server receives a large number of mixed n_cmpl and normal tasks.

Because normal completion task only need one slot, it will always be launched when a slot is free. But n_cmpl requires multiple slots, so they may end up being deferred until all normal tasks are done. Eventually I think we still somehow need to tell the scheduler that "this slot must be paused until this (parent) task is launched"

Also, if we have multiple (either n_cmpl or normal tasks) that want to use the same slot, having a notion of reserved slot could also make sure the slot is not being assigned to a random task, which can cause cache to be cleared.

Because normal completion task only need one slot, it will always be launched when a slot is free. But n_cmpl requires multiple slots, so they may end up being deferred until all normal tasks are done.

This is true, but it would mainly happen in cases where the llama-server is not configured appropriately for it's clients. Even with the reservation logic, if a long single-slot generation task starts processing, then all n_cmpl == n_slot incoming requests would stall until the long single-slot task finishes. This rather means that the incoming requests are not directed to properly configured server to handle them efficiently - i.e. it's a problem of the application config. Either allocate more slots, or prevent flooding the server with tasks that are fighting with each other for slots/context.

I am a bit worried that the reservation logic is too complex for what it achieves, so prefer to simplify. Since this functionality has almost no usage atm, it would be better to keep it simple. I want to integrate it with parallel FIM in llama.vscode and llama.vim and start using. With time, if see that this is not sufficient or if more use cases appear, we can extend with reservation. WDYT?

hmm ok make sense then, I'll update the PR with the more simple approach

I implemented that in the last commit. I tested it with your script + inserted some non-n_cmpl tasks in the middle, and it still works

ggerganov

Thanks, working well on my end.

ggerganov · 2026-01-14T18:19:19Z

tools/server/server-context.cpp

+                    if (task.is_parent()) {
+                        int res = launch_slots_with_parent_task(*slot, std::move(task));
+                        if (res == 2) {
+                            SRV_ERR("failed to launch slots with parent task, id_task = %d\n", id_task);
+                            break; // drop the task
+                        } else if (res == 1) {
+                            SRV_DBG("not enough slots, defer task, id_task = %d\n", id_task);
+                            queue_tasks.defer(std::move(task));
+                            break;
+                        }


The double std::move(task) looks a bit weird, but it works. We can decompose the launch_slots_with_parent_task() function into try(const server_task & task) and launch(server_task && task) to make it cleaner.

ah yeah you're right, task can potentially be invalidated at queue_tasks.defer(std::move(task)) because it was moved to launch_slots_with_parent_task earlier. I fixed it in the last commit by breaking it into 2 step:

get_free_slots: get N free slots for child tasks

launch_slots_with_parent_task: launch parent task + child tasks

ngxson · 2026-01-15T10:17:36Z

The last CI failed due to missing curl, rebased to latest master now. I'll merge this once the CI is green

* server : make sure children tasks are scheduled to launch with parent * fix * add comment pointing to this PR * fix * clean up * more debug messages * add pop_deferred_task with specific ID version * improve the logic * simple approach * no double move * correct return type of launch_slots_with_parent_task

ngxson added 2 commits January 12, 2026 17:41

server : make sure children tasks are scheduled to launch with parent

b55964a

fix

e32545f

ngxson requested a review from ggerganov as a code owner January 12, 2026 17:17

add comment pointing to this PR

821e329

ngxson mentioned this pull request Jan 12, 2026

[Mirror] server: improve slots scheduling for n_cmpl ngxson/llama.cpp#80

Open

ngxson added 2 commits January 12, 2026 18:37

fix

25702ba

clean up

9481b9d

github-actions bot added examples python python script changes server labels Jan 12, 2026

ngxson added 2 commits January 13, 2026 14:22

more debug messages

f0349e4

add pop_deferred_task with specific ID version

da6e2ba

improve the logic

d6b0d23

ggerganov reviewed Jan 13, 2026

View reviewed changes

ggerganov mentioned this pull request Jan 13, 2026

Use fixed slot id for FIM requests ggml-org/llama.vscode#155

Merged

ggerganov reviewed Jan 13, 2026

View reviewed changes

simple approach

ba86ad9

ggerganov approved these changes Jan 14, 2026

View reviewed changes

ngxson added 2 commits January 14, 2026 23:07

no double move

79c1967

Merge branch 'master' into xsn/n_cmpl_sync_barrier

d5505b1

loci-dev mentioned this pull request Jan 15, 2026

UPSTREAM PR #18789: server: improve slots scheduling for n_cmpl auroralabs-loci/llama.cpp#928

Open

ngxson added 2 commits January 15, 2026 11:59

correct return type of launch_slots_with_parent_task

8b5474a

Merge branch 'master' into xsn/n_cmpl_sync_barrier

64b48eb

ngxson merged commit a04c2b0 into ggml-org:master Jan 15, 2026
78 of 79 checks passed

ggerganov mentioned this pull request Jan 18, 2026

Misc. bug: llama server returns HTTP error code 500 #17316

Closed

Conversation

ngxson commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 13, 2026

Uh oh!

ggerganov commented Jan 13, 2026

Uh oh!

ngxson commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson commented Jan 12, 2026 •

edited

Loading

ngxson commented Jan 13, 2026 •

edited

Loading

ngxson commented Jan 13, 2026 •

edited

Loading

ggerganov Jan 13, 2026 •

edited

Loading

ngxson Jan 13, 2026 •

edited

Loading

ggerganov Jan 14, 2026 •

edited

Loading