server: improve slots scheduling for n_cmpl#18789
Conversation
|
I found a failure case with this sample script using the same server from previous comment: #!/bin/bash
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 3, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 2, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 1, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 3, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 2, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 1, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &The server gets stuck and with several waiting tasks and does not proceed further: |
|
Appears to be working correctly with the latest version. Looking at the implementation. |
|
I found the root cause: In the case where a task reserves a specific slot and the task is deferred, we must pop the exact task that is reserved for that specific slot. The old logic only pop one single task at the front of An alternative is to pop all tasks from |
|
Yet an alternative fix, instead of having I'll see if this is actually better or not. |
tools/server/server-context.cpp
Outdated
| break; | ||
| } | ||
| } | ||
| unreserve_slots(task.id_target); |
There was a problem hiding this comment.
Can we unreserve automatically inside slot.release()? To avoid forgetting an unreserve in some branch of the logic.
There was a problem hiding this comment.
I don't think it's possible, because this implementation works based on the assumption that released slots must preserve the next task ID, so that the scheduler can launch that specific next task for that free slot.
But your point does make the idea of having next_task more appealing to have. In this case, slot.release() will move next_task to task, and the scheduler is only responsible for making sure all linked tasks can start at the same time
tools/server/server-context.cpp
Outdated
| if (task.is_parent()) { | ||
| // if this is a parent task, we want to make sure parent + all child tasks can be launched at the same time | ||
|
|
||
| // the current slot must be either reserved for this task, or free (checked above) | ||
| GGML_ASSERT(slot->task_id_next == -1 || slot->task_id_next == task.id); | ||
| slot->task_id_next = task.id; | ||
|
|
||
| // need to reserve n_children more slots | ||
| if (try_reserve_child_slots(*slot, task.n_children, task.id)) { | ||
| // all required slots have been reserved, safe to proceed | ||
| int task_id = task.id; | ||
| if (!launch_slots_with_child_tasks(*slot, std::move(task))) { | ||
| SRV_ERR("failed to launch slots with child tasks, id_task = %d\n", task_id); | ||
| // task must be dropped on error | ||
| break; | ||
| } | ||
| break; | ||
| } else { | ||
| // failed to reserve all required slots, we defer this task for processing later | ||
| SRV_DBG("failed to reserve %d slots, defer task, id_task = %d\n", task.n_children + 1, task.id); | ||
| queue_tasks.defer(std::move(task)); | ||
| // note: the current slot + child slots are already reserved for this task | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
I am wondering if we can significantly simplify the "reserve" logic by containing it fully within this block. Something like:
// pseudo code
if (task.is_parent()) {
// note `try_reserve()` does not need to mutate the slots
auto reserve_info = try_reserve(slot, task);
if (reserve_info.ok()) {
launch_task_with_children(slot, std::move(task), reserve_info);
} else {
defer(std::move(task));
}
// could be moved in the destructor of `reserve_info` too
// also, this might just be a noop in this approach since we didn't mutate the slots
unreserve(reserve_info);
break;
}I feel like there is no need to maintain the reservation state of the slots beyond attempting to launch the parent and the children.
There was a problem hiding this comment.
I thought about this approach at first, but it won't work well if the server receives a large number of mixed n_cmpl and normal tasks.
Because normal completion task only need one slot, it will always be launched when a slot is free. But n_cmpl requires multiple slots, so they may end up being deferred until all normal tasks are done. Eventually I think we still somehow need to tell the scheduler that "this slot must be paused until this (parent) task is launched"
Also, if we have multiple (either n_cmpl or normal tasks) that want to use the same slot, having a notion of reserved slot could also make sure the slot is not being assigned to a random task, which can cause cache to be cleared.
There was a problem hiding this comment.
Because normal completion task only need one slot, it will always be launched when a slot is free. But n_cmpl requires multiple slots, so they may end up being deferred until all normal tasks are done.
This is true, but it would mainly happen in cases where the llama-server is not configured appropriately for it's clients. Even with the reservation logic, if a long single-slot generation task starts processing, then all n_cmpl == n_slot incoming requests would stall until the long single-slot task finishes. This rather means that the incoming requests are not directed to properly configured server to handle them efficiently - i.e. it's a problem of the application config. Either allocate more slots, or prevent flooding the server with tasks that are fighting with each other for slots/context.
I am a bit worried that the reservation logic is too complex for what it achieves, so prefer to simplify. Since this functionality has almost no usage atm, it would be better to keep it simple. I want to integrate it with parallel FIM in llama.vscode and llama.vim and start using. With time, if see that this is not sufficient or if more use cases appear, we can extend with reservation. WDYT?
There was a problem hiding this comment.
hmm ok make sense then, I'll update the PR with the more simple approach
There was a problem hiding this comment.
I implemented that in the last commit. I tested it with your script + inserted some non-n_cmpl tasks in the middle, and it still works
ggerganov
left a comment
There was a problem hiding this comment.
Thanks, working well on my end.
| if (task.is_parent()) { | ||
| int res = launch_slots_with_parent_task(*slot, std::move(task)); | ||
| if (res == 2) { | ||
| SRV_ERR("failed to launch slots with parent task, id_task = %d\n", id_task); | ||
| break; // drop the task | ||
| } else if (res == 1) { | ||
| SRV_DBG("not enough slots, defer task, id_task = %d\n", id_task); | ||
| queue_tasks.defer(std::move(task)); | ||
| break; | ||
| } |
There was a problem hiding this comment.
The double std::move(task) looks a bit weird, but it works. We can decompose the launch_slots_with_parent_task() function into try(const server_task & task) and launch(server_task && task) to make it cleaner.
There was a problem hiding this comment.
ah yeah you're right, task can potentially be invalidated at queue_tasks.defer(std::move(task)) because it was moved to launch_slots_with_parent_task earlier. I fixed it in the last commit by breaking it into 2 step:
get_free_slots: get N free slots for child taskslaunch_slots_with_parent_task: launch parent task + child tasks
|
The last CI failed due to missing curl, rebased to latest master now. I'll merge this once the CI is green |
* server : make sure children tasks are scheduled to launch with parent * fix * add comment pointing to this PR * fix * clean up * more debug messages * add pop_deferred_task with specific ID version * improve the logic * simple approach * no double move * correct return type of launch_slots_with_parent_task
Ref: #18663 (comment)
This PR introduces scheduling mechanism inspired by thread barrier, which allow launching
n_cmplslots at the same time.I tested with repeated requests to
/v1/completionsusing the following payload:{ "prompt": "I believe the meaning of life is", "stream": false, "n": 3, "n_predict": 100, "id_slot": 0 }And so far it works correctly