-
Notifications
You must be signed in to change notification settings - Fork 1.3k
feat: Add support to enable backoff when rate limited by the Github API #6644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: andrewhibbert <[email protected]>
Signed-off-by: andrewhibbert <[email protected]>
Signed-off-by: andrewhibbert <[email protected]>
Signed-off-by: andrewhibbert <[email protected]>
Signed-off-by: andrewhibbert <[email protected]>
Signed-off-by: andrewhibbert <[email protected]>
Signed-off-by: andrewhibbert <[email protected]>
Signed-off-by: andrewhibbert <[email protected]>
zroubalik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great addition, I wonder whether we should try to add this capabilities to all scalers that use HTTP client to optain metrics?
WDYT @kedacore/keda-maintainers
|
/run-e2e github |
|
Hey, just wandering if there is any update on this? |
|
I think that this is nice but I'm worried about if the waiting time is so long, as it'll look as an issue in KEDA (nothing will happen) when it's a desired behaviour |
pkg/scalers/github_runner_scaler.go
Outdated
| func (s *githubRunnerScaler) GetMetricsAndActivity(ctx context.Context, metricName string) ([]external_metrics.ExternalMetricValue, bool, error) { | ||
| queueLen, err := s.GetWorkflowQueueLength(ctx) | ||
| if s.metadata.enableBackoff { | ||
| wait, waitDuration := s.shouldWaitForRateLimit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's introduce some debug logs here, announcing that we are waiting. If someone is debugging why KEDA doesn't scale, it can be discovered from logs easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are debug logs in shouldWaitForRateLimit() currently. Are these sufficient or are you looking for something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's totally true, mb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this and I'm not fully sure if we should wait when 429. What will it happen if a user burns the hour quota in just a few minutes? In that scenario, the process will stuck here waiting and wairting. From operator metrics pov, the loop will be blocked, but from metrics server pov, we will open more and more routines which will stuck here.
Maybe we can include a max waiting time of IDK, 15 seconds. Other option that comes to my mind is to use the current implementation you have done to skip requests meanwhile the rate limit has been exceded to reduce the load.
@kedacore/keda-core-contributors ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense that scaler caches the last metric value and returns that while it's waiting to stop being rate-limitted? That way KEDA stack won't be growing with sleeping goroutines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can include a max waiting time of IDK, 15 seconds. Other option that comes to my mind is to use the current implementation you have done to skip requests meanwhile the rate limit has been exceded to reduce the load.
I like this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I have made a change so that it returns the previous queue length rather than rechecking whilst it is rate limited. Following the best practices (https://docs.github.com/en/rest/using-the-rest-api/best-practices-for-using-the-rest-api?apiVersion=2022-11-28#handle-rate-limit-errors-appropriately) which says not to try again until after it has reset. Let me know however if you do want a maximum waiting time, I think however this should be much longer or configurable to reduce the likelihood of a ban from the API
JorTurFer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement! Just a minor nit inline
|
Hi @andrewhibbert, I see there is still a merge conflict. |
Signed-off-by: andrewhibbert <[email protected]>
|
Fixed merge conflicts |
|
/run-e2e github |
Signed-off-by: andrewhibbert <[email protected]>
Signed-off-by: andrewhibbert <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for backing off when rate limited by the GitHub API, implementing GitHub's recommended best practices for handling rate limits. The implementation respects both Retry-After headers and X-RateLimit-Reset headers to determine appropriate backoff periods.
- Adds
enableBackoffconfiguration option to GitHub runner scaler - Implements rate limit detection and backoff logic using previous queue length during rate limit periods
- Updates rate limit error handling to use structured RateLimit type instead of inline parsing
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/scalers/github_runner_scaler.go | Core implementation of rate limit backoff logic and RateLimit struct |
| pkg/scalers/github_runner_scaler_test.go | Test coverage for rate limit backoff functionality |
| tests/scalers/github_runner/github_runner_test.go | Integration test configuration update to enable backoff |
| CHANGELOG.md | Documentation of the new feature |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| remaining, _ := strconv.Atoi(header.Get("X-RateLimit-Remaining")) | ||
| reset, _ := strconv.ParseInt(header.Get("X-RateLimit-Reset"), 10, 64) | ||
| resetTime := time.Unix(reset, 0) | ||
|
|
||
| if retryAfterStr := header.Get("Retry-After"); retryAfterStr != "" { | ||
| if retrySeconds, err := strconv.Atoi(retryAfterStr); err == nil { | ||
| retryAfterTime = time.Now().Add(time.Duration(retrySeconds) * time.Second) |
Copilot
AI
Sep 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error handling is ignored when parsing rate limit headers. If these headers contain invalid values, the rate limit logic could behave unexpectedly. Consider handling parsing errors or adding validation.
| remaining, _ := strconv.Atoi(header.Get("X-RateLimit-Remaining")) | |
| reset, _ := strconv.ParseInt(header.Get("X-RateLimit-Reset"), 10, 64) | |
| resetTime := time.Unix(reset, 0) | |
| if retryAfterStr := header.Get("Retry-After"); retryAfterStr != "" { | |
| if retrySeconds, err := strconv.Atoi(retryAfterStr); err == nil { | |
| retryAfterTime = time.Now().Add(time.Duration(retrySeconds) * time.Second) | |
| remainingStr := header.Get("X-RateLimit-Remaining") | |
| remaining, err := strconv.Atoi(remainingStr) | |
| if err != nil { | |
| s.logger.V(1).Info(fmt.Sprintf("Invalid X-RateLimit-Remaining header value: %q, error: %v. Defaulting to 0.", remainingStr, err)) | |
| remaining = 0 | |
| } | |
| resetStr := header.Get("X-RateLimit-Reset") | |
| reset, err := strconv.ParseInt(resetStr, 10, 64) | |
| if err != nil { | |
| s.logger.V(1).Info(fmt.Sprintf("Invalid X-RateLimit-Reset header value: %q, error: %v. Defaulting to 0.", resetStr, err)) | |
| reset = 0 | |
| } | |
| resetTime := time.Unix(reset, 0) | |
| if retryAfterStr := header.Get("Retry-After"); retryAfterStr != "" { | |
| if retrySeconds, err := strconv.Atoi(retryAfterStr); err == nil { | |
| retryAfterTime = time.Now().Add(time.Duration(retrySeconds) * time.Second) | |
| } else { | |
| s.logger.V(1).Info(fmt.Sprintf("Invalid Retry-After header value: %q, error: %v. Ignoring.", retryAfterStr, err)) |
| if queueLen, err := scaler.GetWorkflowQueueLength(context.Background()); err != nil { | ||
| fmt.Println(err) | ||
| t.Fail() | ||
| } else if queueLen != 1 { | ||
| fmt.Printf("Expected queue length of 1 got %d\n", queueLen) | ||
| t.Fail() | ||
| } |
Copilot
AI
Sep 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use t.Errorf() or t.Fatalf() instead of fmt.Println() and t.Fail() for better test error reporting. This provides clearer test failure messages and follows Go testing conventions.
| if queueLen, err := scaler.GetWorkflowQueueLength(context.Background()); err != nil { | ||
| fmt.Println(err) | ||
| t.Fail() | ||
| } else if queueLen != 1 { | ||
| fmt.Printf("Expected queue length of 1 after rate limit backoff got %d\n", queueLen) | ||
| t.Fail() | ||
| } |
Copilot
AI
Sep 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use t.Errorf() or t.Fatalf() instead of fmt.Println() and t.Fail() for better test error reporting. This provides clearer test failure messages and follows Go testing conventions.
Adds support to backoff when a rate limit error occurs as per handle rate limit errors appropriately:
Fixes #6643
Relates to kedacore/keda-docs#1557