Skip to content

Conversation

@nirrozenbaum
Copy link
Contributor

This is a continuation of the scheduler interface that was made on #845.
this PR currently updates only the interfaces.go file and will include further changes to the other proposal files.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 31, 2025
@netlify
Copy link

netlify bot commented May 31, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit f8a14d3
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/6845a6f9081a2200080e8821
😎 Deploy Preview https://deploy-preview-905--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 31, 2025
@k8s-ci-robot k8s-ci-robot requested review from Jeffwan and robscott May 31, 2025 21:06
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 31, 2025
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I recommend adding an example of how the framework will execute the extension points. This will help align and clarify when each extension is executed.


type EndpointState struct {
// storage is per Scheduling Cycle, and so has no thread-safe concerns.
// TODO should think if the above is true or should we use sync map for thread safety.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be the case unless the extension interfaces are changed to evaluate one pod, and the framework launches those evaluations in parallel.

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about running profiles in parallel? running different scorers in parallel?
if we will eventually have support in python scorers for example, we might want to parallelize those scorers.

I think this is very low priority atm, so maybe it's not worth discussing it at this point.

@nirrozenbaum nirrozenbaum force-pushed the scheduler-framework branch from 92a3711 to 802a806 Compare June 3, 2025 08:14
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 3, 2025
Signed-off-by: Nir Rozenbaum <[email protected]>
Picker selects the endpoint(s) from the provided list of scored endpoints. Picker MUST return, one endpoint at minimum.


### PostSchedule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like a PreRequest extension point (similar to the PostSchedule but I think calling it PreRequest is more accurate). The use cases are:

  1. Prefix cache affinity. Once a server is picked for a request, update the prefix indexes from the request prompt.
  2. LoRA affinity or some other form of session affinity. Update the server(s) a lora adapter is scheduled so the following requests can be scheduled on the same server.

Why not just use PostResponse? The issue is latency. Suppose you have 2 requests for the same LoRA adapter at t1, and t2. Request1 was sent to server1 and response1 was back at t3 (taking longer than t2-t1). Then for Request2 you lose the affinity.

What about request failures? Request failures will lead to an inaccurate recording of the affinity. However, the impact is negligible so long the failure rate is long (which should generally be expected).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll split my answer into two parts:

  1. your request sounds to me like you're looking for an extension point in the director rather than in scheduler (in requestcontrol layer). the scheduler is responsible for selecting the endpoint(s). once a selection was made the caller can do with this information anything like what you requested (director is the caller).
  2. the behavior you're describing sounds incorrect to me. the fact that scheduler selected an endpoint is not enough to get to a conclusion that next requests should be directed to the same server (in both examples you provided). the server might be non-responsive and scheduler might picked it according to metrics that were collected before its unavailability. this may lead to many subsequent requests that are sent to the same server that is non-responsive. If the server did NOT receive the request or did NOT handled the request successfully, we shouldn't assume it has the relevant cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the LoRA use case is applicable to supporting. It makes the assumption that LoRA loading is zero cost and happens on the fly, in response to requests. Unsure if this is always the case.
If not, then for the first request to choose some Pod meant that it already had the LoRA available and loaded, which means this is still the case when the second request comes. Hope I understood the use case described correctly...
Similarly for session affinity, do you see multiple same requests being sent in parallel without receiving a response? Might not be so likely for a chat like scenario but maybe there are others where it makes sense.

Regarding prefix, I understand that currently would benefit from both callbacks (optimistic prefix based on prompt once selection of target is done, and then validating and adding the response data in PostResponse). This is an implementation choice with tradeoffs (e.g., failure vs latency).

  • this optimizes for the case of independent requests having the same exact prefix. How common is that?
  • the PostResponse hook is called multiple times, first when the headers are received and then again for every body segment (one or more, depending if buffered). Headers are received soon after Prefill is done (typically much shorter than Decode processing). In that case, I'm not sure the window for receiving (the rare?) same prefix independent requests makes a good argument for adding another plugin point inside the scheduler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your request sounds to me like you're looking for an extension point in the director rather than in scheduler (in requestcontrol layer). the scheduler is responsible for selecting the endpoint(s). once a selection was made the caller can do with this information anything like what you requested (director is the caller).

Director sounds like a good place to have that PreRequest extension point given it also hosts the PostResponse.

the server might be non-responsive and scheduler might picked it according to metrics that were collected before its unavailability.

This issue should be solved by health checks. If an endpoint is down it should be removed from eligible endpoints to pick from. This is a general issues, isn't it? Regardless of the "affinity" behavior, if a failed endpoint exists in the data layer, scheduler will pick it and receive error anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this optimizes for the case of independent requests having the same exact prefix. How common is that?

Depends on the use case, for example, independent requests could share the same system prompt or RAG documents.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that we might be optimizing for a relatively narrow time window

This is optimizing the head of line blocking problem, which is quite important. Imagine you have a new system prompt and gets many concurrent requests containing that new shared system prompt. Without this optimization, you can spread them across many servers, losing the benefit of prefix affinity. Same applies for LoRA adapter.

Currently we implement LoRA affinity by refreshing the loaded LoRA adapter via metrics every 50ms. At higher QPS, we see LoRA adapters spread (losing affinity). So a even tighter metrics freshness is required.

You are right that if the first request is very fast to respond, then this is not a big problem. However we can not control that, as it's not uncommon to see inference requests to take seconds or even longer to finish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we follow up with adding a PreRequest extension point in the director? @nirrozenbaum @elevran

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liu-cong I suggest to keep this proposal scoped to the scheduler and I'll open a new proposal that deals with more general pluggability soon, with section for each layer (and then we can add PreRequest as a plugin in requestcontrol layer). will tag all participants on this PR in the new one as well.
as @ahg-g wrote in the other comment, this PR seems to be very close to complete and merge, so maybe it's better to open a general one rather than continue iterating on this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

Do we have an up-to-date picture of the extension points and their names? Did we settle on names? Just want to ensure we are in agreement about where in the scheduling flow the extension points are run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I’ll push soon (either later today or tomorrow) the most up to date names including an updated diagram

Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
@elevran
Copy link
Contributor

elevran commented Jun 4, 2025

Should we then make the picker an extension cross profile calls, basically collapse Picker and ProcessProfileResults into one extension called Picker?

I may be missing context but don't we need a Picker per profile run and then something to consolidate results from multiple profiles (e.g., Picker in P and D profiles, but return only D in ProcessProfileResults)

@ahg-g
Copy link
Contributor

ahg-g commented Jun 4, 2025

As a followup, I think we should convert this into a generic pluggability proposal with sections for each layer. This will allow us to define the common parts as we discussed on the other PR.

@nirrozenbaum
Copy link
Contributor Author

As a followup, I think we should convert this into a generic pluggability proposal with sections for each layer. This will allow us to define the common parts as we discussed on the other PR.

++ agreed 👍

@nirrozenbaum nirrozenbaum changed the base branch from main to release-0.1 June 5, 2025 16:25
@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 5, 2025
@nirrozenbaum nirrozenbaum changed the base branch from release-0.1 to main June 5, 2025 16:29
@k8s-ci-robot k8s-ci-robot removed do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 5, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 5, 2025
// ProcessProfileResults would know to simply log the result of ShadowBoxing
// profile, and do nothing else with it.
ProcessProfileResults(request *Request, profileResults map[string][]*ScoredEndpoint) map[string][]*Endpoint
}
Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a question I was thinking about-
since we agreed that Scheduler IS aware of http protocol (Request struct represents it), what are your thoughts about having ProcessResults get all profiles results but return only one?

in llm-d for example - since scheduler is aware of http, we can set the prefill header at this point and return only decode selection.
and in general I think it makes sense to return one Result, i.e., results of one profile (return []*Endpont and optionally array of backups)
@elevran @ahg-g

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that this will be done by the PostSchedule plugin. Here is my thinking of how we will compose with the default behavior:

This maintains the separation of concern and composes with the non p/d case nicely.

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed everything except for the CycleState part, in which I probably didn't understand your intention.
in current Proposal ProcessResults signature is:

ProcessProfileResults(request *Request, profileResults map[string][]*ScoredEndpoint) map[string][]*Endpoint

looking on the return value here - we get back a map from profile-name -> it's results (set of endpoints).

let's assume we have PostSchedule in director (or as @liu-cong named it, PreRequest, which I agree is more descriptive/accurate, plus having PreRequest and PostResponse sounds to me like a better terminology than PostSchedule and PostResponse).
using this plugin we should be able to set the headers of destination (decode) + prefill.

can you explain what you meant in the CycleState part?

ProcessResults returns the endpoints for the decoder (and so DestinationEndpoint plugin will set them as per the epp<-> proxy protocol), while the prefill endpoint is stored in CycleState.

CycleState scope is within the scheduler, so it's not returning back to requestcontrol layer.

Copy link
Contributor

@ahg-g ahg-g Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking on the return value here - we get back a map from profile-name -> it's results (set of endpoints).

Oh, I thought we return a single final list that in the normal case a PreRequest plugin uses to set the destination endpoints.

CycleState scope is within the scheduler, so it's not returning back to requestcontrol layer.

can you explain what you meant in the CycleState part?

I assumed CycleState is scoped across all extension points executed for a request. If we do that, then what I was proposing is to use CycleState as the communication channel between the P/D specific ProcessResults and PreRequest plugins to communicate the list of prefill servers to be set in the headers. Again, this was assuming that ProcessResults returned a list of endpoints, not a map.

Another approach is to continue to have ProcessResult return a map, but the decode endpoints should have a key that the default PreRequest plugin understands for both the default and p/d case (in the p/d case, ProcessResult can set the key for the decode profile endpoints to whatever value expected by the default PreRequest plugin). And the addition PreRequest plugin operates on the prefill endpoints to set them as a header.

I am trying to avoid having the scheduling layer implement parts of the epp <-> proxy protocol.

In summary, we will have two PreRequest plugins: The first ships with IGE and implements the destination endpoint protocol, from the map[string][]*Endpoint it uses a well known key to lookup the list of endpoints it will use. The second ships with llm-d and implements the vllm protocol for setting the prefill endpoints as a header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the second approach, but I still think we should scope CycleState to be across all extension points executed on a request. We can define CycleState in the common pkg.

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m also ok with cycle state per request, although I’d prefer avoiding it. I just think we need to pay special attention that the field is not abused, cause it becomes extremely easy to communicate information in a generic cycle state rather than well defined interfaces. I would have a cycle state per layer during the request lifecycle to verify the interfaces between the layers are well defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if we agree on this I think the proposal is ready for finalization, as we converged on everything

Yes, we have a few followups as well related to converting this into a generic extensibility proposal with subsections for various layers, defining a common pkg for types and interfaces, grouping the plugin implementations under one directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should scope CycleState to be across all extension points executed on a request

Just calling this out. But if we do this. We are stating that the EPP architecture is no longer made up of independent subsystems that could be broken out into their own lib, and instead tying them all together.

I don't particularly love that. The hope was that the director/request control would be the mortar where we would handle the grey area. But not willing to die on this hill. More just calling it out.

Copy link
Contributor

@ahg-g ahg-g Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with Kellen, I think we need to strike a balance between our goal of maintaining a reusable scheduling library and a design that works well across the layers we define in EPP.

My recommendation is to define common types, like CycleState and Request that all layers depend on. This doesn't prevent the scheduling library from being reusable in a different context.

I don't think we want to prevent plugins from implementing extensions across subsystems if it makes sense to do so. Each plugin should declare its dependencies, and if someone wants to use the scheduling library in a different context, then they can disable the plugins that does that if they are incompatible with their system, or implement in their system the behavior expected by those other extensions (e.g, adding state to CycleState that is expected by the scheduling parts of the plugin)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Kellen (at least with his initial comment) that it would be good NOT to have one CycleState per request, as it may cause some issues of bypassing interfaces that we define between the layers.
as I stated in previous comments:

I just think we need to pay special attention that the field is not abused, cause it becomes extremely easy to communicate information in a generic cycle state rather than well defined interfaces. I would have a cycle state per layer during the request lifecycle to verify the interfaces between the layers are well defined.

I agree that we need to balance as @ahg-g mentioned.
let's discuss this more as we make progress with more extension points in more layers.

Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
@nirrozenbaum nirrozenbaum changed the title [WIP] scheduler proposal continuation scheduler proposal continuation Jun 8, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 8, 2025
Copy link
Contributor

@elevran elevran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot
Copy link
Contributor

@elevran: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ahg-g
Copy link
Contributor

ahg-g commented Jun 9, 2025

one nit

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 9, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, elevran, nirrozenbaum

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2025
@k8s-ci-robot k8s-ci-robot merged commit 4c319af into kubernetes-sigs:main Jun 9, 2025
7 checks passed
@nirrozenbaum nirrozenbaum deleted the scheduler-framework branch June 9, 2025 16:32
@nirrozenbaum
Copy link
Contributor Author

one nit

/lgtm /approve

@ahg-g what is the nit?
I can't find the extra comment.

- Cycle State

ProfileSelect will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call. Only a single ProfileSelect function may be defined per scheduler.
ProfilePick will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we wanted to return a boolean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally it would be good.
the point is that ProfilePick next iteration depends on the current profile execution result, so that's not always possible. for example in P/D (let's assume it's enabled):

  • on first iteration we run D and have the results.
  • on the second iteration we inspect the results of D and doing some check of prefix cache, and only then decide if P should run or not.

this means we cannot decide upfront on the boolean of whether additional iteration should be called or not.
this is the original reason I haven't added bool to this part.
see code example from llm-d here.
this code is going to change to use SchedulerProfiles instead of two schedulers, so this check should be embedded in the profile selection logic.

Copy link
Contributor

@ahg-g ahg-g Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a boolean, you always need an extra call to ProfilePick. With boolean, in some cases you will get to execute the right number of iterations, in others you will need an extra call that returns false with nil profile to execute.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we allowing multiple ProfileSelect functions to be defined per scheduler?

Additionally, the boolean makes it much more clear to the implementer how multiple selection is handled. Otherwise a user may skim the documentation and miss the line that specifies how multiple runs should be handled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we wanted a boolean "call again" as an explicit return, whereas an empty Alice is more of an implicit convention.
Both options are equally expressive:
call again == (len(profiles)==0)
The ProfilePick always gets called a second time in the example. The difference is only between the second call returning an empty slice or a boolean to indicate a third call is not needed.
At least that's the way I understand it...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They provide the same functionality, completely agreed there.

I'm just suggesting that a return type of ...) map[string]*SchedulerProfile vs ...) (selectedProfiles map[string]*SchedulerProfile, rerun bool) allows an implementer to more easily intuit the functionality of the system, or ask themselves the question 'what does the rerun bool do?' and then go to the documentation and learn. Vs the implicit method, it requires a user to be aware of a somewhat 'secret' functionality.

Sure we will document it, but I don't think all users read documentation exhaustively, and with the added benefit of no forced extra call to ProfilePick it seems slightly better to have a bool imo.

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think bool return value here has an inconsistent behavior.
in some scenarios, we know that no extra call is needed.
in other scenarios like the one I described above, we will do an extra call but get back empty profiles map.
IMO it's better to define a behavior that is consistent with its definition in all cases, rather than having an inconsistent bool that may confuse more than help.

btw, I do expect whoever wants to implement a plugin to read its documentation.
the function godoc is only two lines, not an exhaustive description.

Are we allowing multiple ProfileSelect functions to be defined per scheduler?

only one ProfileHandler plugin is allowed, means only one ProfileSelect is allowed.
this is still not validated in code but is documented (should be also validated in next PRs).

@ahg-g
Copy link
Contributor

ahg-g commented Jun 9, 2025

one nit
/lgtm /approve

@ahg-g what is the nit? I can't find the extra comment.

Sorry, posted now

rlakhtakia pushed a commit to rlakhtakia/gateway-api-inference-extension that referenced this pull request Jun 11, 2025
* scheduler proposal continuation

Signed-off-by: Nir Rozenbaum <[email protected]>

* removed functions in comments

Signed-off-by: Nir Rozenbaum <[email protected]>

* embed SchedulerConfig into Scheduler

Signed-off-by: Nir Rozenbaum <[email protected]>

* add request struct

Signed-off-by: Nir Rozenbaum <[email protected]>

* final updates to proposal

Signed-off-by: Nir Rozenbaum <[email protected]>

* minor change

Signed-off-by: Nir Rozenbaum <[email protected]>

* profile handler docs

Signed-off-by: Nir Rozenbaum <[email protected]>

* revert

Signed-off-by: Nir Rozenbaum <[email protected]>

* linter

Signed-off-by: Nir Rozenbaum <[email protected]>

* Apply suggestions from code review

Co-authored-by: Etai Lev Ran <[email protected]>

---------

Signed-off-by: Nir Rozenbaum <[email protected]>
Co-authored-by: Etai Lev Ran <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants