Skip to content

server: improve slots scheduling for n_cmpl#18789

Merged
ngxson merged 13 commits intoggml-org:masterfrom
ngxson:xsn/n_cmpl_sync_barrier
Jan 15, 2026
Merged

server: improve slots scheduling for n_cmpl#18789
ngxson merged 13 commits intoggml-org:masterfrom
ngxson:xsn/n_cmpl_sync_barrier

Conversation

@ngxson
Copy link
Copy Markdown
Contributor

@ngxson ngxson commented Jan 12, 2026

Ref: #18663 (comment)

This PR introduces scheduling mechanism inspired by thread barrier, which allow launching n_cmpl slots at the same time.

image

I tested with repeated requests to /v1/completions using the following payload:

{
    "prompt": "I believe the meaning of life is",
    "stream": false,
    "n": 3,
    "n_predict": 100,
    "id_slot": 0
}

And so far it works correctly

@ngxson ngxson requested a review from ggerganov as a code owner January 12, 2026 17:17
@github-actions github-actions bot added examples python python script changes server labels Jan 12, 2026
@ggerganov
Copy link
Copy Markdown
Member

I found a failure case with this sample script using the same server from previous comment:

#!/bin/bash
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 3, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 2, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 1, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 3, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 2, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 1, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &

The server gets stuck and with several waiting tasks and does not proceed further:

0.02.599.299 D srv  update_slots: run slots completed
0.02.599.300 D que    start_loop: waiting for new tasks
0.02.599.300 D que    start_loop: processing new tasks
0.02.599.311 D que    start_loop: processing task, id = 19
0.02.599.312 D srv  process_sing: failed to reserve 3 slots, defer task, id_task = 19
0.02.599.312 D que         defer: defer task, id = 19
0.02.599.313 D que    start_loop: processing task, id = 16
0.02.599.313 D srv  process_sing: requested slot is reserved for another task (task_id_next = 4), defer task, id_task = 16
0.02.599.313 D que         defer: defer task, id = 16
0.02.599.314 D que    start_loop: processing task, id = 13
0.02.599.314 D srv  process_sing: requested slot is reserved for another task (task_id_next = 4), defer task, id_task = 13
0.02.599.314 D que         defer: defer task, id = 13
0.02.599.315 D que    start_loop: processing task, id = 34
0.02.599.315 D que    start_loop: update slots
0.02.599.315 I srv  update_slots: all slots are idle
0.02.599.315 D que    start_loop: waiting for new tasks
0.02.599.337 D srv  update_chat_: Parsing chat message:  and welcome to our
0.02.599.338 D Parsing input with format Content-only:  and welcome to our
0.02.599.360 D Parsed message: {"role":"assistant","content":" and welcome to our"}
0.02.599.483 D srv          stop: all tasks already finished, no need to cancel
0.02.599.490 D res  remove_waiti: remove task 11 from waiting list. current waiting = 24 (before remove)
0.02.599.490 D res  remove_waiti: remove task 10 from waiting list. current waiting = 23 (before remove)
0.02.599.491 D res  remove_waiti: remove task 8 from waiting list. current waiting = 22 (before remove)
0.02.599.491 D srv          stop: all tasks already finished, no need to cancel

@ggerganov
Copy link
Copy Markdown
Member

Appears to be working correctly with the latest version. Looking at the implementation.

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Jan 13, 2026

I found the root cause: In the case where a task reserves a specific slot and the task is deferred, we must pop the exact task that is reserved for that specific slot.

The old logic only pop one single task at the front of queue_tasks_deferred when a slot is released, let's say if we have n_slots, then if the reserved task is in n_slots + 1-th position in queue_tasks_deferred, then there is a chance it will never be pop'ed

An alternative is to pop all tasks from queue_tasks_deferred whenever a slot is free, but I think it will have worse performance. Because only one single slot is freed when pop_deferred_task is called, then in theory, there can only be 1 deferred task that can proceed.

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Jan 13, 2026

Yet an alternative fix, instead of having slot.task_id_next as a task ID, we can also store a std::unique_ptr<const server_task> task_next, so the task won't be deferred.task_next will be launched whenever the current slot is released

I'll see if this is actually better or not.

break;
}
}
unreserve_slots(task.id_target);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unreserve automatically inside slot.release()? To avoid forgetting an unreserve in some branch of the logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible, because this implementation works based on the assumption that released slots must preserve the next task ID, so that the scheduler can launch that specific next task for that free slot.

But your point does make the idea of having next_task more appealing to have. In this case, slot.release() will move next_task to task, and the scheduler is only responsible for making sure all linked tasks can start at the same time

Comment on lines +1792 to +1816
if (task.is_parent()) {
// if this is a parent task, we want to make sure parent + all child tasks can be launched at the same time

// the current slot must be either reserved for this task, or free (checked above)
GGML_ASSERT(slot->task_id_next == -1 || slot->task_id_next == task.id);
slot->task_id_next = task.id;

// need to reserve n_children more slots
if (try_reserve_child_slots(*slot, task.n_children, task.id)) {
// all required slots have been reserved, safe to proceed
int task_id = task.id;
if (!launch_slots_with_child_tasks(*slot, std::move(task))) {
SRV_ERR("failed to launch slots with child tasks, id_task = %d\n", task_id);
// task must be dropped on error
break;
}
break;
} else {
// failed to reserve all required slots, we defer this task for processing later
SRV_DBG("failed to reserve %d slots, defer task, id_task = %d\n", task.n_children + 1, task.id);
queue_tasks.defer(std::move(task));
// note: the current slot + child slots are already reserved for this task
break;
}
}
Copy link
Copy Markdown
Member

@ggerganov ggerganov Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we can significantly simplify the "reserve" logic by containing it fully within this block. Something like:

// pseudo code
if (task.is_parent()) {
    // note `try_reserve()` does not need to mutate the slots
    auto reserve_info = try_reserve(slot, task);

    if (reserve_info.ok()) {
        launch_task_with_children(slot, std::move(task), reserve_info);
    } else {
        defer(std::move(task));
    }

    // could be moved in the destructor of `reserve_info` too
    // also, this might just be a noop in this approach since we didn't mutate the slots
    unreserve(reserve_info);

    break;
}

I feel like there is no need to maintain the reservation state of the slots beyond attempting to launch the parent and the children.

Copy link
Copy Markdown
Contributor Author

@ngxson ngxson Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this approach at first, but it won't work well if the server receives a large number of mixed n_cmpl and normal tasks.

Because normal completion task only need one slot, it will always be launched when a slot is free. But n_cmpl requires multiple slots, so they may end up being deferred until all normal tasks are done. Eventually I think we still somehow need to tell the scheduler that "this slot must be paused until this (parent) task is launched"

Also, if we have multiple (either n_cmpl or normal tasks) that want to use the same slot, having a notion of reserved slot could also make sure the slot is not being assigned to a random task, which can cause cache to be cleared.

Copy link
Copy Markdown
Member

@ggerganov ggerganov Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because normal completion task only need one slot, it will always be launched when a slot is free. But n_cmpl requires multiple slots, so they may end up being deferred until all normal tasks are done.

This is true, but it would mainly happen in cases where the llama-server is not configured appropriately for it's clients. Even with the reservation logic, if a long single-slot generation task starts processing, then all n_cmpl == n_slot incoming requests would stall until the long single-slot task finishes. This rather means that the incoming requests are not directed to properly configured server to handle them efficiently - i.e. it's a problem of the application config. Either allocate more slots, or prevent flooding the server with tasks that are fighting with each other for slots/context.

I am a bit worried that the reservation logic is too complex for what it achieves, so prefer to simplify. Since this functionality has almost no usage atm, it would be better to keep it simple. I want to integrate it with parallel FIM in llama.vscode and llama.vim and start using. With time, if see that this is not sufficient or if more use cases appear, we can extend with reservation. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ok make sense then, I'll update the PR with the more simple approach

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented that in the last commit. I tested it with your script + inserted some non-n_cmpl tasks in the middle, and it still works

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, working well on my end.

Comment on lines +1751 to +1760
if (task.is_parent()) {
int res = launch_slots_with_parent_task(*slot, std::move(task));
if (res == 2) {
SRV_ERR("failed to launch slots with parent task, id_task = %d\n", id_task);
break; // drop the task
} else if (res == 1) {
SRV_DBG("not enough slots, defer task, id_task = %d\n", id_task);
queue_tasks.defer(std::move(task));
break;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double std::move(task) looks a bit weird, but it works. We can decompose the launch_slots_with_parent_task() function into try(const server_task & task) and launch(server_task && task) to make it cleaner.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah you're right, task can potentially be invalidated at queue_tasks.defer(std::move(task)) because it was moved to launch_slots_with_parent_task earlier. I fixed it in the last commit by breaking it into 2 step:

  • get_free_slots: get N free slots for child tasks
  • launch_slots_with_parent_task: launch parent task + child tasks

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Jan 15, 2026

The last CI failed due to missing curl, rebased to latest master now. I'll merge this once the CI is green

@ngxson ngxson merged commit a04c2b0 into ggml-org:master Jan 15, 2026
78 of 79 checks passed
MaheshJakkala pushed a commit to MaheshJakkala/llama.cpp that referenced this pull request Mar 15, 2026
* server : make sure children tasks are scheduled to launch with parent

* fix

* add comment pointing to this PR

* fix

* clean up

* more debug messages

* add pop_deferred_task with specific ID version

* improve the logic

* simple approach

* no double move

* correct return type of launch_slots_with_parent_task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants