Conversation
|
@JohannesGaessler Would be useful to rerun the 128 slot benchmark with this branch and see if this change makes a positive impact. |
|
It doesn't seem to make a difference:
|
|
Sorry, I did the benchmark wrong. The default of
With the scheduling logic on master there is a regression as the number of slots increases, with the patch in this PR this is greatly reduced. |
|
This is inline with my expectation. Thanks. So the results that we were discussing in the other thread were all performed with front-loading all of the requests? In that case the stream slicing issue is not relevant at all because all of the slots are running through the entire benchmark. Curious how the vllm bench looks with request rate of 5? |
Yes.
You mean a benchmark of vllm using the vllm tool? I didn't test it so far but think it's not going to be very interesting. As long as the throughput is high enough that a server is basically idele when the last request arrives at ~200 s all servers are going to finish very close to each other unless they're stalling for some reason. |
|
Ah yes - makes sense. |
I think this would make the stream slicing discussed in #14924 (comment) less prominent.