Skip to content

Simplification and refactoring of the Flow Control layer #889

@shmuelk

Description

@shmuelk

The Flow Control component is a critical component in the Endpoint Picker (EPP), enabling it to throttle workloads thus preventing over committing Model Server resources.

It was designed to be very scalable and the benchmarks run in PR kubernetes-sigs/gateway-api-inference-extension#2539 show that it indeed performs well and is scalable.

The benchmark further showed that one “Shard Processor” is more than enough to support reasonably large loads. The loads in which multiple “Shard Processors” help somewhat are very large and don’t seem to be overly practical for a single EPP. A load of 50,000 items queued, being processed at 1,000 per second, while the processing speed is great, this load means that it will take fifty seconds for the requests to be processed, which is fundamentally unacceptable. Such a scenario is a situation in which the InferencePool was significantly under resourced and is out of the scope of EPP to deal with.

Furthermore, the benchmark showed that in lower loads having more than one “Shard Processor” is detrimental and hurts performance.

As such it is believed that having only one “Shard Processor” in the system is more than enough to support the predicted loads that we will see.

In addition, the Flow Control component has multiple goroutines running, most only from time to time. This requires a set of mutex locks at several levels to maintain consistency and prevent crashes. This further complicates the code as the order of acquiring and releasing the mutex locks is critical to avoiding deadlocks in the system.

This issue proposes a set of gradual changes aiming to make Flow Control simpler without negatively impacting its performance or correctness.

The goal is to have a system where:

The notion of shards is removed.
A single goroutine / processor:
a. Processes the set of priority bands and queues
b. Performs Garbage Collection and evicts requests that have been in the queue for too long on a periodic basis
c. Modifies the Flow Control configuration as needed by adding priority bands and queues.
Mutex locks will be removed (since only one goroutine is processing).
Channels are added, as needed, to enable the “Processor” to receive configuration update requests
The existing Flow Control configuration will be updated to remove the initial Shard Configuration and the initial set of priority bands.
All priority bands will be added and removed using the InferenceObjective CRD.
This goal will be implemented as a set of PRs, in order to make it easier to review and comment. Benchmarks and tests will be run to compare the updated code base with the original code base.

This issue has been copied from the Inference Gateway project

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a triage label and requires one.

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions