Skip to content

Conversation

@andrewhibbert
Copy link
Contributor

@andrewhibbert andrewhibbert commented Mar 24, 2025

Adds support to backoff when a rate limit error occurs as per handle rate limit errors appropriately:

  • If the retry-after response header is present, you should not retry your request until after that many seconds has elapsed.
  • If the x-ratelimit-remaining header is 0, you should not make another request until after the time specified by the x-ratelimit-reset header. The x-ratelimit-reset header is in UTC epoch seconds.
  • [N/A]When introducing a new scaler, I agree with the scaling governance policy
  • I have verified that my change is according to the deprecations & breaking changes policy
  • Tests have been added
  • Changelog has been updated and is aligned with our changelog requirements
  • [N/A] A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
  • A PR is opened to update the documentation on (repo) (if applicable)
  • Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #6643

Relates to kedacore/keda-docs#1557

@andrewhibbert andrewhibbert requested a review from a team as a code owner March 24, 2025 16:39
@andrewhibbert andrewhibbert changed the title Implement backoff when receive rate limit errors feat: Add support to enable backoff when rate limited by the Github API Mar 24, 2025
Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great addition, I wonder whether we should try to add this capabilities to all scalers that use HTTP client to optain metrics?

WDYT @kedacore/keda-maintainers

@zroubalik
Copy link
Member

zroubalik commented Mar 26, 2025

/run-e2e github
Update: You can check the progress here

@andrewhibbert
Copy link
Contributor Author

Hey, just wandering if there is any update on this?

@JorTurFer
Copy link
Member

I think that this is nice but I'm worried about if the waiting time is so long, as it'll look as an issue in KEDA (nothing will happen) when it's a desired behaviour

func (s *githubRunnerScaler) GetMetricsAndActivity(ctx context.Context, metricName string) ([]external_metrics.ExternalMetricValue, bool, error) {
queueLen, err := s.GetWorkflowQueueLength(ctx)
if s.metadata.enableBackoff {
wait, waitDuration := s.shouldWaitForRateLimit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's introduce some debug logs here, announcing that we are waiting. If someone is debugging why KEDA doesn't scale, it can be discovered from logs easily

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are debug logs in shouldWaitForRateLimit() currently. Are these sufficient or are you looking for something different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's totally true, mb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this and I'm not fully sure if we should wait when 429. What will it happen if a user burns the hour quota in just a few minutes? In that scenario, the process will stuck here waiting and wairting. From operator metrics pov, the loop will be blocked, but from metrics server pov, we will open more and more routines which will stuck here.
Maybe we can include a max waiting time of IDK, 15 seconds. Other option that comes to my mind is to use the current implementation you have done to skip requests meanwhile the rate limit has been exceded to reduce the load.
@kedacore/keda-core-contributors ?

Copy link
Member

@wozniakjan wozniakjan Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense that scaler caches the last metric value and returns that while it's waiting to stop being rate-limitted? That way KEDA stack won't be growing with sleeping goroutines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can include a max waiting time of IDK, 15 seconds. Other option that comes to my mind is to use the current implementation you have done to skip requests meanwhile the rate limit has been exceded to reduce the load.

I like this approach.

Copy link
Contributor Author

@andrewhibbert andrewhibbert Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I have made a change so that it returns the previous queue length rather than rechecking whilst it is rate limited. Following the best practices (https://docs.github.com/en/rest/using-the-rest-api/best-practices-for-using-the-rest-api?apiVersion=2022-11-28#handle-rate-limit-errors-appropriately) which says not to try again until after it has reset. Let me know however if you do want a maximum waiting time, I think however this should be much longer or configurable to reduce the likelihood of a ban from the API

Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement! Just a minor nit inline

@andrewhibbert andrewhibbert requested a review from zroubalik June 10, 2025 11:49
@rickbrouwer
Copy link
Member

Hi @andrewhibbert, I see there is still a merge conflict.

@andrewhibbert
Copy link
Contributor Author

Fixed merge conflicts

@rickbrouwer
Copy link
Member

rickbrouwer commented Aug 1, 2025

/run-e2e github
Update: You can check the progress here

@andrewhibbert andrewhibbert requested a review from a team as a code owner August 29, 2025 14:06
@keda-automation keda-automation requested a review from a team August 29, 2025 14:06
@zroubalik zroubalik requested a review from Copilot September 18, 2025 09:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for backing off when rate limited by the GitHub API, implementing GitHub's recommended best practices for handling rate limits. The implementation respects both Retry-After headers and X-RateLimit-Reset headers to determine appropriate backoff periods.

  • Adds enableBackoff configuration option to GitHub runner scaler
  • Implements rate limit detection and backoff logic using previous queue length during rate limit periods
  • Updates rate limit error handling to use structured RateLimit type instead of inline parsing

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
pkg/scalers/github_runner_scaler.go Core implementation of rate limit backoff logic and RateLimit struct
pkg/scalers/github_runner_scaler_test.go Test coverage for rate limit backoff functionality
tests/scalers/github_runner/github_runner_test.go Integration test configuration update to enable backoff
CHANGELOG.md Documentation of the new feature

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +487 to +493
remaining, _ := strconv.Atoi(header.Get("X-RateLimit-Remaining"))
reset, _ := strconv.ParseInt(header.Get("X-RateLimit-Reset"), 10, 64)
resetTime := time.Unix(reset, 0)

if retryAfterStr := header.Get("Retry-After"); retryAfterStr != "" {
if retrySeconds, err := strconv.Atoi(retryAfterStr); err == nil {
retryAfterTime = time.Now().Add(time.Duration(retrySeconds) * time.Second)
Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling is ignored when parsing rate limit headers. If these headers contain invalid values, the rate limit logic could behave unexpectedly. Consider handling parsing errors or adding validation.

Suggested change
remaining, _ := strconv.Atoi(header.Get("X-RateLimit-Remaining"))
reset, _ := strconv.ParseInt(header.Get("X-RateLimit-Reset"), 10, 64)
resetTime := time.Unix(reset, 0)
if retryAfterStr := header.Get("Retry-After"); retryAfterStr != "" {
if retrySeconds, err := strconv.Atoi(retryAfterStr); err == nil {
retryAfterTime = time.Now().Add(time.Duration(retrySeconds) * time.Second)
remainingStr := header.Get("X-RateLimit-Remaining")
remaining, err := strconv.Atoi(remainingStr)
if err != nil {
s.logger.V(1).Info(fmt.Sprintf("Invalid X-RateLimit-Remaining header value: %q, error: %v. Defaulting to 0.", remainingStr, err))
remaining = 0
}
resetStr := header.Get("X-RateLimit-Reset")
reset, err := strconv.ParseInt(resetStr, 10, 64)
if err != nil {
s.logger.V(1).Info(fmt.Sprintf("Invalid X-RateLimit-Reset header value: %q, error: %v. Defaulting to 0.", resetStr, err))
reset = 0
}
resetTime := time.Unix(reset, 0)
if retryAfterStr := header.Get("Retry-After"); retryAfterStr != "" {
if retrySeconds, err := strconv.Atoi(retryAfterStr); err == nil {
retryAfterTime = time.Now().Add(time.Duration(retrySeconds) * time.Second)
} else {
s.logger.V(1).Info(fmt.Sprintf("Invalid Retry-After header value: %q, error: %v. Ignoring.", retryAfterStr, err))

Copilot uses AI. Check for mistakes.
Comment on lines +808 to +814
if queueLen, err := scaler.GetWorkflowQueueLength(context.Background()); err != nil {
fmt.Println(err)
t.Fail()
} else if queueLen != 1 {
fmt.Printf("Expected queue length of 1 got %d\n", queueLen)
t.Fail()
}
Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use t.Errorf() or t.Fatalf() instead of fmt.Println() and t.Fail() for better test error reporting. This provides clearer test failure messages and follows Go testing conventions.

Copilot uses AI. Check for mistakes.
Comment on lines +821 to +827
if queueLen, err := scaler.GetWorkflowQueueLength(context.Background()); err != nil {
fmt.Println(err)
t.Fail()
} else if queueLen != 1 {
fmt.Printf("Expected queue length of 1 after rate limit backoff got %d\n", queueLen)
t.Fail()
}
Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use t.Errorf() or t.Fatalf() instead of fmt.Println() and t.Fail() for better test error reporting. This provides clearer test failure messages and follows Go testing conventions.

Copilot uses AI. Check for mistakes.
@andrewhibbert andrewhibbert closed this by deleting the head repository Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement backoff when receive rate limit errors in the Github scaler

5 participants