-
Notifications
You must be signed in to change notification settings - Fork 201
scheduler proposal continuation #905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
dd93126
d1397e0
f1ed49b
f5a8864
9d967c0
5bd2d46
f7c19e1
4917941
2545698
f8a14d3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,13 +14,22 @@ The Scheduling Subsystem is a framework used to implement scheduling algorithms. | |
| - The entry & exit points should be defined by the framework, acting as the API surface of the system | ||
| - Multiple scheduling 'profiles' should be able to be ran for a single request. | ||
| - They can be conditionally dependent on previous runs, or in parallel | ||
| - Plugin state is managed by the plugin itself | ||
| - State management | ||
| - State per request: This is managed by what we are calling CycleState and its lifecycle is tied to the request. | ||
| Cycle state is created internally by the Scheduler per request and its pointer is passed as argument. | ||
| - State managed by the plugin struct itself: The lifecycle of this state is tied to the plugin, and since plugins will be instantiated once, | ||
| it is a state that plugins can use across requests (like prefix-cache index). | ||
| - State managed by the data layer: each endpoint will be associated with state (currently metrics) that a data layer plugin can add to it. | ||
| A data layer plugin could be one that scrapes v1/models from the endpoint for example. | ||
|
|
||
| ## Definitions | ||
| - **Scheduling Framework** - The system created to allow for a pluggable scheduling algorithm. | ||
| - **Scheduling Profile** - A named, specific set of Filter(s), Scorer(s), & Picker used to select endpoints. | ||
| - **Scheduler** - An extensible implementation of a scheduling algorithm. Including logic to select Scheduling Profiles, the Scheduling Profiles themselves, & logic to interpret the result. | ||
| - **Scheduling Cycle** - A single run of a Scheduler through the Scheduling Framework. | ||
| - **Scheduler Profile** - A named, specific set of Filter(s), Scorer(s), & Picker used to select endpoints. | ||
| - **Scheduler Profile Run** - a one time run of the Scheduler Profile filters, scorers and picker given a request. | ||
| - **Scheduler** - An extensible implementation of a scheduling algorithm. Including logic to select Scheduler Profiles iteratively, | ||
| the Scheduler Profiles themselves, & logic to interpret the result. | ||
| - **Scheduling Cycle** - A single run of a Scheduler through the Scheduling Framework. a scheduling cycle includes one or | ||
| more Scheduler Profile runs (at least one). | ||
| - **Plugin** - Implementation of framework-defined interface(s) to add or extend logic across the framework. | ||
|
|
||
| ## Proposal | ||
|
|
@@ -33,23 +42,24 @@ The Scheduling System can loosely be defined into 3 sections: | |
| - A *configuration API* to define the Scheduler, Profile(s), & the plugins used within those profiles | ||
|
|
||
| A sketch of the System, with extension points is here: | ||
| <img src="./images/scheduler_subsystem.svg" alt="Scheduling Algorithm" width="1000" /> | ||
| <img src="./images/scheduler_cycle.png" alt="Scheduling Algorithm" width="1000" /> | ||
|
|
||
| Describing the interface extension points & flow is the simplest way to convey the intent of what the framework should enable: | ||
|
|
||
| ### PreSchedule | ||
| ### ProfileSelect (or ProfilePick) | ||
|
|
||
| PreSchedule is the entry point into the scheduling cycle (called by the framework). PreSchedule, selects profiles conditionally based on: | ||
| ProfilePick is the entry point into the scheduling cycle (called by the framework). | ||
| ProfileSelect, selects profiles conditionally based on: | ||
|
|
||
| - Request data | ||
| - Results | ||
| - Results of previously executed SchedulerProfiles | ||
| - Cycle State | ||
|
|
||
| PreSchedule will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call. Only a single PreSchedule function may be defined per scheduler. | ||
| ProfileSelect will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call. Only a single ProfileSelect function may be defined per scheduler. | ||
|
|
||
| ### Profile Cycle | ||
| ### Scheduler Profile Run | ||
|
|
||
| The profile cycle consists of 3 defined functions `Filter`, `Score`, & `Pick` | ||
| The SchedulerPprofile run consists of 3 defined phases `Filter`, `Score`, & `Pick` | ||
|
|
||
| *Profile Constraints* | ||
| - A profile can have any number of `Filter` plugins registered (including zero) | ||
|
|
@@ -61,16 +71,16 @@ The profile cycle consists of 3 defined functions `Filter`, `Score`, & `Pick` | |
| Filter runs before any scoring, and remove endpoints that are not fit for selection. The framework will return an error to the client if the endpoints are filtered to zero. | ||
|
|
||
| #### Score | ||
| Score applies a score to each remaining endpoint provided. Scorers SHOULD keep their score values in a normalized range: [0-1]. Any weighting should be added at the SchedulingProfile configuration level. | ||
| Score applies a score to each remaining endpoint provided. Scorers SHOULD keep their score values in a normalized range: [0-1]. Any weighting should be added at the SchedulerProfile configuration level. | ||
|
|
||
| #### Pick | ||
| Picker selects the endpoint(s) from the provided list of scored endpoints. Picker MUST return, one endpoint at minimum. | ||
|
|
||
|
|
||
| ### PostSchedule | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like a PreRequest extension point (similar to the PostSchedule but I think calling it PreRequest is more accurate). The use cases are:
Why not just use PostResponse? The issue is latency. Suppose you have 2 requests for the same LoRA adapter at t1, and t2. Request1 was sent to server1 and response1 was back at t3 (taking longer than t2-t1). Then for Request2 you lose the affinity. What about request failures? Request failures will lead to an inaccurate recording of the affinity. However, the impact is negligible so long the failure rate is long (which should generally be expected).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll split my answer into two parts:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure the LoRA use case is applicable to supporting. It makes the assumption that LoRA loading is zero cost and happens on the fly, in response to requests. Unsure if this is always the case. Regarding prefix, I understand that currently would benefit from both callbacks (optimistic prefix based on prompt once selection of target is done, and then validating and adding the response data in PostResponse). This is an implementation choice with tradeoffs (e.g., failure vs latency).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Director sounds like a good place to have that
This issue should be solved by health checks. If an endpoint is down it should be removed from eligible endpoints to pick from. This is a general issues, isn't it? Regardless of the "affinity" behavior, if a failed endpoint exists in the data layer, scheduler will pick it and receive error anyway.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Depends on the use case, for example, independent requests could share the same system prompt or RAG documents.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is optimizing the head of line blocking problem, which is quite important. Imagine you have a new system prompt and gets many concurrent requests containing that new shared system prompt. Without this optimization, you can spread them across many servers, losing the benefit of prefix affinity. Same applies for LoRA adapter. Currently we implement LoRA affinity by refreshing the loaded LoRA adapter via metrics every 50ms. At higher QPS, we see LoRA adapters spread (losing affinity). So a even tighter metrics freshness is required. You are right that if the first request is very fast to respond, then this is not a big problem. However we can not control that, as it's not uncommon to see inference requests to take seconds or even longer to finish.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we follow up with adding a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @liu-cong I suggest to keep this proposal scoped to the scheduler and I'll open a new proposal that deals with more general pluggability soon, with section for each layer (and then we can add PreRequest as a plugin in requestcontrol layer). will tag all participants on this PR in the new one as well.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense. Do we have an up-to-date picture of the extension points and their names? Did we settle on names? Just want to ensure we are in agreement about where in the scheduling flow the extension points are run.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes. I’ll push soon (either later today or tomorrow) the most up to date names including an updated diagram |
||
| PostSchedule receives the output of the result(s) of the scheduling cycle(s) and makes sense of the data to be consumed by the calling system. | ||
| ### ProcessProfilesResults | ||
| ProcessProfilesResults recieves the output of the result(s) of the scheduler profile(s) and makes sense of the data to be consumed by the calling system. | ||
nirrozenbaum marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### PostResponse | ||
| ### PostResponse (Out of Scheduler and mentioned here for completeness only) | ||
ahg-g marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| PostResponse is a special case extension that can optionally be implemented by a plugin that needs to augment its state based on response or request data. This should only be implemented for plugins that need to update state outside of the scheduling cycle. PostResponse is ran at the time of processing a response. | ||
|
|
||
| ## ConfigurationAPI | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,82 +22,120 @@ import ( | |
| scheduling "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" | ||
| ) | ||
|
|
||
| // READER NOTE: Currently CycleState is assumed to have appropriate request data rather that making a new object. | ||
|
|
||
| // Plugin is the parent type for all the scheduling framework plugins. | ||
| type Plugin interface { | ||
| Name() string | ||
| } | ||
|
|
||
| type Endpoint struct { | ||
| State EndpointState | ||
| Score float64 | ||
| } | ||
|
|
||
| type EndpointState struct { | ||
| // storage is per Scheduling Cycle, and so has no thread-safe concerns. | ||
| storage map[string]any //nolint:unused | ||
| // TODO should think if the above is true or should we use sync map for thread safety. | ||
|
||
| storage map[string]any | ||
| } | ||
|
|
||
| type SchedulingResult struct { | ||
| results map[string][]Endpoint //nolint:unused | ||
| // Request is a structured representation of the fields we parse out of the Request body. | ||
| type Request struct { | ||
| // RequestId is the Envoy generated Id for the request being processed | ||
| RequestId string | ||
| // TargetModel is the final target model after traffic split. | ||
| TargetModel string | ||
| // Prompt is the prompt that was sent in the request body. | ||
| Prompt string | ||
| // Headers is a map of the request headers. | ||
| Headers map[string]string | ||
| } | ||
|
|
||
| // Scheduler is the implementation of a... scheduler. | ||
| // The scheduler object is created at startup using the provided configuration. | ||
| type Scheduler interface { | ||
| // PreSchedule selects scheduling profiles through the implemented | ||
| // logic, and returns: | ||
| // - profiles - A subset of the registered scheduling profiles to be ran | ||
| PreSchedule(request map[string]any, data scheduling.CycleState, results map[string][]Endpoint) map[string]SchedulingProfile | ||
| // ScoredEndpoint encapsulates Endpoint with its Score. | ||
| // The lifecycle of an endpoint is typically different than a lifecycle of a request. | ||
| // This is intended to be used only internally by Scheduler logic and/or scheduler plugins within the lifecycle of the request. | ||
| // When returning the selected Endpoint(s) out of the Scheduler, an Endpoint is returned without the score. | ||
| type ScoredEndpoint struct { | ||
| Endpoint | ||
| Score float64 | ||
| } | ||
|
|
||
| // PostSchedule receives the output of the result(s) of the scheduling cycle(s) | ||
| // and makes sense of the data to be consumed by the calling system. | ||
| // For example: suppose you have 2 profiles ShadowBoxing Profile & Production Profile. | ||
| // PostSchedule would know to simply log the result of ShadowBoxing | ||
| // profile, and do nothing else with it. | ||
| PostSchedule(profileResults map[string][]Endpoint) SchedulingResult | ||
| type Scheduler struct { | ||
| SchedulerConfig | ||
| } | ||
|
|
||
| // SchedulerConfig is the struct that maps to the configuration file that should be further discussed. | ||
| // the configuration file should include the multi profile plugin as well as the profiles with their plugins. | ||
| // TODO should update the configuration file example.yaml to discuss its structure. | ||
| type SchedulerConfig struct { | ||
| // exactly one MultiProfilePlugin instance is required. | ||
| multiProfilePlugin MultiProfilePlugin | ||
| // map from profile name to its set of plugins. | ||
| profiles map[string]*SchedulerProfile | ||
| } | ||
|
|
||
| // SchedulingProfile is used to describe a profile that will | ||
| // SchedulerProfile is used to describe a profile that will | ||
| // run for a given scheduling cycle. | ||
| type SchedulingProfile struct { | ||
| // Name of the profile. | ||
| Name string | ||
| // Filters lists all Filter plugins associated with this Profile. Filters | ||
| // are optional. | ||
| Filters []Filter | ||
| // Scorers lists all Score plugins associated with this Profile. Scorers | ||
| // are optional. | ||
| Scorers map[Scorer]int | ||
| type SchedulerProfile struct { | ||
| // Filters lists all Filter plugins associated with this Profile. | ||
| // Filters are optional. | ||
| filters []Filter | ||
| // Scorers lists all Score plugins associated with this Profile. | ||
| // Scorers are optional. | ||
| scorers []*WeightedScorer | ||
| // Picker returns the function that picks the endpoint(s). Picker is required. | ||
| Picker Picker | ||
| picker Picker | ||
| } | ||
|
|
||
| // Filter runs before any scoring, and remove endpoints that are not fit for | ||
| // selection. The framework will return an error to the client if the endpoints | ||
| // are filtered to zero. | ||
| // Plugin is the parent type for all the scheduling framework plugins. | ||
| type Plugin interface { | ||
| Name() string | ||
| } | ||
|
|
||
| // MultiProfilePlugin defines the interface for handling multi SchedulerProfile instances. | ||
nirrozenbaum marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| type MultiProfilePlugin interface { | ||
| Plugin | ||
| // PickProfiles picks the SchedulingProfile objects to run from a list of candidate profiles, | ||
| // while taking into consideration the request properties | ||
| // and the previously executed SchedluderProfile runs along with their results. | ||
| // returns: | ||
| // - profiles - A subset of the registered scheduling profiles to be ran in next iteration | ||
| PickProfiles(request *Request, profiles map[string]*SchedulerProfile, executionResults map[string][]*ScoredEndpoint) map[string]*SchedulerProfile | ||
|
|
||
| // ProcessProfileResults handles the outcome of each selected profile. | ||
| // It may aggregate results, log test profile outputs, or apply custom logic. | ||
| // For example: suppose you have 2 profiles ShadowBoxing Profile & Production Profile. | ||
| // ProcessProfileResults would know to simply log the result of ShadowBoxing | ||
| // profile, and do nothing else with it. | ||
nirrozenbaum marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ProcessProfileResults(request *Request, profileResults map[string][]*ScoredEndpoint) map[string][]*Endpoint | ||
| } | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a question I was thinking about- in llm-d for example - since scheduler is aware of http, we can set the prefill header at this point and return only decode selection.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking that this will be done by the
This maintains the separation of concern and composes with the non p/d case nicely.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I followed everything except for the CycleState part, in which I probably didn't understand your intention. ProcessProfileResults(request *Request, profileResults map[string][]*ScoredEndpoint) map[string][]*Endpointlooking on the return value here - we get back a map from profile-name -> it's results (set of endpoints). let's assume we have can you explain what you meant in the CycleState part?
CycleState scope is within the scheduler, so it's not returning back to requestcontrol layer.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Oh, I thought we return a single final list that in the normal case a PreRequest plugin uses to set the destination endpoints.
I assumed CycleState is scoped across all extension points executed for a request. If we do that, then what I was proposing is to use CycleState as the communication channel between the P/D specific ProcessResults and PreRequest plugins to communicate the list of prefill servers to be set in the headers. Again, this was assuming that ProcessResults returned a list of endpoints, not a map. Another approach is to continue to have ProcessResult return a map, but the decode endpoints should have a key that the default PreRequest plugin understands for both the default and p/d case (in the p/d case, ProcessResult can set the key for the decode profile endpoints to whatever value expected by the default PreRequest plugin). And the addition PreRequest plugin operates on the prefill endpoints to set them as a header. I am trying to avoid having the scheduling layer implement parts of the epp <-> proxy protocol. In summary, we will have two PreRequest plugins: The first ships with IGE and implements the destination endpoint protocol, from the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer the second approach, but I still think we should scope CycleState to be across all extension points executed on a request. We can define CycleState in the common pkg.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I’m also ok with cycle state per request, although I’d prefer avoiding it. I just think we need to pay special attention that the field is not abused, cause it becomes extremely easy to communicate information in a generic cycle state rather than well defined interfaces. I would have a cycle state per layer during the request lifecycle to verify the interfaces between the layers are well defined.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, we have a few followups as well related to converting this into a generic extensibility proposal with subsections for various layers, defining a common pkg for types and interfaces, grouping the plugin implementations under one directory.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Just calling this out. But if we do this. We are stating that the EPP architecture is no longer made up of independent subsystems that could be broken out into their own lib, and instead tying them all together. I don't particularly love that. The hope was that the director/request control would be the mortar where we would handle the grey area. But not willing to die on this hill. More just calling it out.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Discussed offline with Kellen, I think we need to strike a balance between our goal of maintaining a reusable scheduling library and a design that works well across the layers we define in EPP. My recommendation is to define common types, like CycleState and Request that all layers depend on. This doesn't prevent the scheduling library from being reusable in a different context. I don't think we want to prevent plugins from implementing extensions across subsystems if it makes sense to do so. Each plugin should declare its dependencies, and if someone wants to use the scheduling library in a different context, then they can disable the plugins that does that if they are incompatible with their system, or implement in their system the behavior expected by those other extensions (e.g, adding state to CycleState that is expected by the scheduling parts of the plugin)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with Kellen (at least with his initial comment) that it would be good NOT to have one CycleState per request, as it may cause some issues of bypassing interfaces that we define between the layers.
I agree that we need to balance as @ahg-g mentioned. |
||
|
|
||
| // Filter runs before any scoring, and remove endpoints that are not fit for selection. | ||
| // The framework will return an error to the client if the endpoints are filtered to zero. | ||
| type Filter interface { | ||
| Plugin | ||
| Filter(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint | ||
| Filter(ctx context.Context, request *Request, state *scheduling.CycleState, endpoints []*Endpoint) []*Endpoint | ||
| } | ||
|
|
||
| // Scorer applies a score to each remaining endpoint provided. Scorers SHOULD | ||
| // keep their score values in a normalized range: [0-1]. Any weighting should | ||
| // be added at the SchedulingProfile configuration level. | ||
| // Scorer applies a score to each remaining endpoint provided. | ||
| // Scorers SHOULD keep their score values in a normalized range: [0-1]. | ||
| // Any weighting should be added at the SchedulerProfile configuration level. | ||
| type Scorer interface { | ||
| Plugin | ||
| Score(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint | ||
| Score(ctx context.Context, request *Request, state *scheduling.CycleState, endpoints []*Endpoint) []*ScoredEndpoint | ||
| } | ||
|
|
||
| // WeightedScorer is a struct that encapsulates a scorer with its weight. | ||
| // We need this struct in order to be able to keep scorers in profile as a slice instead of a map. | ||
| // This is very useful for having a generic AddPlugin function that registers a plugin to all its extension points. | ||
| // Using a map is much less convenient for this purpose. | ||
| type WeightedScorer struct { | ||
| Scorer | ||
| weight int | ||
| } | ||
|
|
||
| // Picker selects the endpoint(s) from the provided list of scored endpoints. | ||
| // Picker MUST return, one endpoint at minimum. | ||
| type Picker interface { | ||
| Plugin | ||
| Pick(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint | ||
| Pick(ctx context.Context, state *scheduling.CycleState, endpoints []*ScoredEndpoint) []*ScoredEndpoint | ||
| } | ||
|
|
||
| // PostResponse is NOT part of the scheduler subsystem but is specified here for completeness only. | ||
| type PostResponse interface { | ||
| Plugin | ||
| PostResponse(ctx context.Context, request map[string]any, response map[string]any) | ||
| PostResponse(ctx context.Context, request *Request, response map[string]any) | ||
| } | ||
Uh oh!
There was an error while loading. Please reload this page.