-
Notifications
You must be signed in to change notification settings - Fork 201
scheduler proposal continuation #905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler proposal continuation #905
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
ahg-g
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I recommend adding an example of how the framework will execute the extension points. This will help align and clarify when each extension is executed.
|
|
||
| type EndpointState struct { | ||
| // storage is per Scheduling Cycle, and so has no thread-safe concerns. | ||
| // TODO should think if the above is true or should we use sync map for thread safety. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be the case unless the extension interfaces are changed to evaluate one pod, and the framework launches those evaluations in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about running profiles in parallel? running different scorers in parallel?
if we will eventually have support in python scorers for example, we might want to parallelize those scorers.
I think this is very low priority atm, so maybe it's not worth discussing it at this point.
docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go
Outdated
Show resolved
Hide resolved
docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go
Show resolved
Hide resolved
docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go
Outdated
Show resolved
Hide resolved
92a3711 to
802a806
Compare
Signed-off-by: Nir Rozenbaum <[email protected]>
7907977 to
dd93126
Compare
| Picker selects the endpoint(s) from the provided list of scored endpoints. Picker MUST return, one endpoint at minimum. | ||
|
|
||
|
|
||
| ### PostSchedule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like a PreRequest extension point (similar to the PostSchedule but I think calling it PreRequest is more accurate). The use cases are:
- Prefix cache affinity. Once a server is picked for a request, update the prefix indexes from the request prompt.
- LoRA affinity or some other form of session affinity. Update the server(s) a lora adapter is scheduled so the following requests can be scheduled on the same server.
Why not just use PostResponse? The issue is latency. Suppose you have 2 requests for the same LoRA adapter at t1, and t2. Request1 was sent to server1 and response1 was back at t3 (taking longer than t2-t1). Then for Request2 you lose the affinity.
What about request failures? Request failures will lead to an inaccurate recording of the affinity. However, the impact is negligible so long the failure rate is long (which should generally be expected).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll split my answer into two parts:
- your request sounds to me like you're looking for an extension point in the director rather than in scheduler (in requestcontrol layer). the scheduler is responsible for selecting the endpoint(s). once a selection was made the caller can do with this information anything like what you requested (director is the caller).
- the behavior you're describing sounds incorrect to me. the fact that scheduler selected an endpoint is not enough to get to a conclusion that next requests should be directed to the same server (in both examples you provided). the server might be non-responsive and scheduler might picked it according to metrics that were collected before its unavailability. this may lead to many subsequent requests that are sent to the same server that is non-responsive. If the server did NOT receive the request or did NOT handled the request successfully, we shouldn't assume it has the relevant cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the LoRA use case is applicable to supporting. It makes the assumption that LoRA loading is zero cost and happens on the fly, in response to requests. Unsure if this is always the case.
If not, then for the first request to choose some Pod meant that it already had the LoRA available and loaded, which means this is still the case when the second request comes. Hope I understood the use case described correctly...
Similarly for session affinity, do you see multiple same requests being sent in parallel without receiving a response? Might not be so likely for a chat like scenario but maybe there are others where it makes sense.
Regarding prefix, I understand that currently would benefit from both callbacks (optimistic prefix based on prompt once selection of target is done, and then validating and adding the response data in PostResponse). This is an implementation choice with tradeoffs (e.g., failure vs latency).
- this optimizes for the case of independent requests having the same exact prefix. How common is that?
- the PostResponse hook is called multiple times, first when the headers are received and then again for every body segment (one or more, depending if buffered). Headers are received soon after Prefill is done (typically much shorter than Decode processing). In that case, I'm not sure the window for receiving (the rare?) same prefix independent requests makes a good argument for adding another plugin point inside the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your request sounds to me like you're looking for an extension point in the director rather than in scheduler (in requestcontrol layer). the scheduler is responsible for selecting the endpoint(s). once a selection was made the caller can do with this information anything like what you requested (director is the caller).
Director sounds like a good place to have that PreRequest extension point given it also hosts the PostResponse.
the server might be non-responsive and scheduler might picked it according to metrics that were collected before its unavailability.
This issue should be solved by health checks. If an endpoint is down it should be removed from eligible endpoints to pick from. This is a general issues, isn't it? Regardless of the "affinity" behavior, if a failed endpoint exists in the data layer, scheduler will pick it and receive error anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this optimizes for the case of independent requests having the same exact prefix. How common is that?
Depends on the use case, for example, independent requests could share the same system prompt or RAG documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that we might be optimizing for a relatively narrow time window
This is optimizing the head of line blocking problem, which is quite important. Imagine you have a new system prompt and gets many concurrent requests containing that new shared system prompt. Without this optimization, you can spread them across many servers, losing the benefit of prefix affinity. Same applies for LoRA adapter.
Currently we implement LoRA affinity by refreshing the loaded LoRA adapter via metrics every 50ms. At higher QPS, we see LoRA adapters spread (losing affinity). So a even tighter metrics freshness is required.
You are right that if the first request is very fast to respond, then this is not a big problem. However we can not control that, as it's not uncommon to see inference requests to take seconds or even longer to finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we follow up with adding a PreRequest extension point in the director? @nirrozenbaum @elevran
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liu-cong I suggest to keep this proposal scoped to the scheduler and I'll open a new proposal that deals with more general pluggability soon, with section for each layer (and then we can add PreRequest as a plugin in requestcontrol layer). will tag all participants on this PR in the new one as well.
as @ahg-g wrote in the other comment, this PR seems to be very close to complete and merge, so maybe it's better to open a general one rather than continue iterating on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
Do we have an up-to-date picture of the extension points and their names? Did we settle on names? Just want to ensure we are in agreement about where in the scheduling flow the extension points are run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I’ll push soon (either later today or tomorrow) the most up to date names including an updated diagram
docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go
Outdated
Show resolved
Hide resolved
docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go
Outdated
Show resolved
Hide resolved
docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go
Outdated
Show resolved
Hide resolved
docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go
Outdated
Show resolved
Hide resolved
docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go
Outdated
Show resolved
Hide resolved
docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go
Outdated
Show resolved
Hide resolved
docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go
Show resolved
Hide resolved
docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
I may be missing context but don't we need a |
|
As a followup, I think we should convert this into a generic pluggability proposal with sections for each layer. This will allow us to define the common parts as we discussed on the other PR. |
++ agreed 👍 |
| // ProcessProfileResults would know to simply log the result of ShadowBoxing | ||
| // profile, and do nothing else with it. | ||
| ProcessProfileResults(request *Request, profileResults map[string][]*ScoredEndpoint) map[string][]*Endpoint | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a question I was thinking about-
since we agreed that Scheduler IS aware of http protocol (Request struct represents it), what are your thoughts about having ProcessResults get all profiles results but return only one?
in llm-d for example - since scheduler is aware of http, we can set the prefill header at this point and return only decode selection.
and in general I think it makes sense to return one Result, i.e., results of one profile (return []*Endpont and optionally array of backups)
@elevran @ahg-g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that this will be done by the PostSchedule plugin. Here is my thinking of how we will compose with the default behavior:
-
We will have a PostSchedule plugin, named
DestinationEndpointfor example, that ships with GIE that sets the endpoints as per the epp protocol (https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol#destination-endpoint). It will operate on the lsit of endpoints returned fromProcessResults. -
In the p/d case:
ProcessResultsreturns the endpoints for the decoder (and so DestinationEndpoint plugin will set them as per the epp<-> proxy protocol), while the prefill endpoint is stored in CycleState.- Another PostSchedule plugin is added, call it
DestinationPrefillEndpointsets the prefill endpoint in a header.
This maintains the separation of concern and composes with the non p/d case nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed everything except for the CycleState part, in which I probably didn't understand your intention.
in current Proposal ProcessResults signature is:
ProcessProfileResults(request *Request, profileResults map[string][]*ScoredEndpoint) map[string][]*Endpointlooking on the return value here - we get back a map from profile-name -> it's results (set of endpoints).
let's assume we have PostSchedule in director (or as @liu-cong named it, PreRequest, which I agree is more descriptive/accurate, plus having PreRequest and PostResponse sounds to me like a better terminology than PostSchedule and PostResponse).
using this plugin we should be able to set the headers of destination (decode) + prefill.
can you explain what you meant in the CycleState part?
ProcessResults returns the endpoints for the decoder (and so DestinationEndpoint plugin will set them as per the epp<-> proxy protocol), while the prefill endpoint is stored in CycleState.
CycleState scope is within the scheduler, so it's not returning back to requestcontrol layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking on the return value here - we get back a map from profile-name -> it's results (set of endpoints).
Oh, I thought we return a single final list that in the normal case a PreRequest plugin uses to set the destination endpoints.
CycleState scope is within the scheduler, so it's not returning back to requestcontrol layer.
can you explain what you meant in the CycleState part?
I assumed CycleState is scoped across all extension points executed for a request. If we do that, then what I was proposing is to use CycleState as the communication channel between the P/D specific ProcessResults and PreRequest plugins to communicate the list of prefill servers to be set in the headers. Again, this was assuming that ProcessResults returned a list of endpoints, not a map.
Another approach is to continue to have ProcessResult return a map, but the decode endpoints should have a key that the default PreRequest plugin understands for both the default and p/d case (in the p/d case, ProcessResult can set the key for the decode profile endpoints to whatever value expected by the default PreRequest plugin). And the addition PreRequest plugin operates on the prefill endpoints to set them as a header.
I am trying to avoid having the scheduling layer implement parts of the epp <-> proxy protocol.
In summary, we will have two PreRequest plugins: The first ships with IGE and implements the destination endpoint protocol, from the map[string][]*Endpoint it uses a well known key to lookup the list of endpoints it will use. The second ships with llm-d and implements the vllm protocol for setting the prefill endpoints as a header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the second approach, but I still think we should scope CycleState to be across all extension points executed on a request. We can define CycleState in the common pkg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m also ok with cycle state per request, although I’d prefer avoiding it. I just think we need to pay special attention that the field is not abused, cause it becomes extremely easy to communicate information in a generic cycle state rather than well defined interfaces. I would have a cycle state per layer during the request lifecycle to verify the interfaces between the layers are well defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and if we agree on this I think the proposal is ready for finalization, as we converged on everything
Yes, we have a few followups as well related to converting this into a generic extensibility proposal with subsections for various layers, defining a common pkg for types and interfaces, grouping the plugin implementations under one directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should scope CycleState to be across all extension points executed on a request
Just calling this out. But if we do this. We are stating that the EPP architecture is no longer made up of independent subsystems that could be broken out into their own lib, and instead tying them all together.
I don't particularly love that. The hope was that the director/request control would be the mortar where we would handle the grey area. But not willing to die on this hill. More just calling it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline with Kellen, I think we need to strike a balance between our goal of maintaining a reusable scheduling library and a design that works well across the layers we define in EPP.
My recommendation is to define common types, like CycleState and Request that all layers depend on. This doesn't prevent the scheduling library from being reusable in a different context.
I don't think we want to prevent plugins from implementing extensions across subsystems if it makes sense to do so. Each plugin should declare its dependencies, and if someone wants to use the scheduling library in a different context, then they can disable the plugins that does that if they are incompatible with their system, or implement in their system the behavior expected by those other extensions (e.g, adding state to CycleState that is expected by the scheduling parts of the plugin)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Kellen (at least with his initial comment) that it would be good NOT to have one CycleState per request, as it may cause some issues of bypassing interfaces that we define between the layers.
as I stated in previous comments:
I just think we need to pay special attention that the field is not abused, cause it becomes extremely easy to communicate information in a generic cycle state rather than well defined interfaces. I would have a cycle state per layer during the request lifecycle to verify the interfaces between the layers are well defined.
I agree that we need to balance as @ahg-g mentioned.
let's discuss this more as we make progress with more extension points in more layers.
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
elevran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
@elevran: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Co-authored-by: Etai Lev Ran <[email protected]>
|
one nit /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, elevran, nirrozenbaum The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ahg-g what is the nit? |
| - Cycle State | ||
|
|
||
| ProfileSelect will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call. Only a single ProfileSelect function may be defined per scheduler. | ||
| ProfilePick will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we wanted to return a boolean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally it would be good.
the point is that ProfilePick next iteration depends on the current profile execution result, so that's not always possible. for example in P/D (let's assume it's enabled):
- on first iteration we run D and have the results.
- on the second iteration we inspect the results of D and doing some check of prefix cache, and only then decide if P should run or not.
this means we cannot decide upfront on the boolean of whether additional iteration should be called or not.
this is the original reason I haven't added bool to this part.
see code example from llm-d here.
this code is going to change to use SchedulerProfiles instead of two schedulers, so this check should be embedded in the profile selection logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without a boolean, you always need an extra call to ProfilePick. With boolean, in some cases you will get to execute the right number of iterations, in others you will need an extra call that returns false with nil profile to execute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we allowing multiple ProfileSelect functions to be defined per scheduler?
Additionally, the boolean makes it much more clear to the implementer how multiple selection is handled. Otherwise a user may skim the documentation and miss the line that specifies how multiple runs should be handled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, we wanted a boolean "call again" as an explicit return, whereas an empty Alice is more of an implicit convention.
Both options are equally expressive:
call again == (len(profiles)==0)
The ProfilePick always gets called a second time in the example. The difference is only between the second call returning an empty slice or a boolean to indicate a third call is not needed.
At least that's the way I understand it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They provide the same functionality, completely agreed there.
I'm just suggesting that a return type of ...) map[string]*SchedulerProfile vs ...) (selectedProfiles map[string]*SchedulerProfile, rerun bool) allows an implementer to more easily intuit the functionality of the system, or ask themselves the question 'what does the rerun bool do?' and then go to the documentation and learn. Vs the implicit method, it requires a user to be aware of a somewhat 'secret' functionality.
Sure we will document it, but I don't think all users read documentation exhaustively, and with the added benefit of no forced extra call to ProfilePick it seems slightly better to have a bool imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think bool return value here has an inconsistent behavior.
in some scenarios, we know that no extra call is needed.
in other scenarios like the one I described above, we will do an extra call but get back empty profiles map.
IMO it's better to define a behavior that is consistent with its definition in all cases, rather than having an inconsistent bool that may confuse more than help.
btw, I do expect whoever wants to implement a plugin to read its documentation.
the function godoc is only two lines, not an exhaustive description.
Are we allowing multiple ProfileSelect functions to be defined per scheduler?
only one ProfileHandler plugin is allowed, means only one ProfileSelect is allowed.
this is still not validated in code but is documented (should be also validated in next PRs).
Sorry, posted now |
* scheduler proposal continuation Signed-off-by: Nir Rozenbaum <[email protected]> * removed functions in comments Signed-off-by: Nir Rozenbaum <[email protected]> * embed SchedulerConfig into Scheduler Signed-off-by: Nir Rozenbaum <[email protected]> * add request struct Signed-off-by: Nir Rozenbaum <[email protected]> * final updates to proposal Signed-off-by: Nir Rozenbaum <[email protected]> * minor change Signed-off-by: Nir Rozenbaum <[email protected]> * profile handler docs Signed-off-by: Nir Rozenbaum <[email protected]> * revert Signed-off-by: Nir Rozenbaum <[email protected]> * linter Signed-off-by: Nir Rozenbaum <[email protected]> * Apply suggestions from code review Co-authored-by: Etai Lev Ran <[email protected]> --------- Signed-off-by: Nir Rozenbaum <[email protected]> Co-authored-by: Etai Lev Ran <[email protected]>
This is a continuation of the scheduler interface that was made on #845.
this PR currently updates only the interfaces.go file and will include further changes to the other proposal files.