Skip to content

fix(loki.process): Make limit stage shutdown cancelable#6215

Merged
kalleep merged 1 commit into
grafana:mainfrom
boinger:fix/loki-process-limit-shutdown-hang
May 18, 2026
Merged

fix(loki.process): Make limit stage shutdown cancelable#6215
kalleep merged 1 commit into
grafana:mainfrom
boinger:fix/loki-process-limit-shutdown-hang

Conversation

@boinger
Copy link
Copy Markdown
Contributor

@boinger boinger commented May 9, 2026

Pull Request Details

When loki.process has stage.limit { drop = false }, the stage's
shouldThrottle calls rateLimiter.Wait, which blocks until the next
token. The Run goroutine only exits its range loop between iterations,
so a goroutine inside Wait doesn't observe close(in). With a low
rate, EntryHandler.Stop hangs for the full token interval.

This adds an optional Stopper interface (Stop()) called by
Pipeline.Stop before wg.Wait. limitStage cancels its internal
context to unblock Wait. Pipeline itself implements Stopper so
nested inner pipelines (via matcherStage) forward recursively.

Scope

The by-label code path (stage.limit { by_label_name = "..." }) is
not affected; validateLimitConfig requires drop = true with
by_label_name, so that path uses rl.Allow() and never reaches a
Wait call.

Issue(s) fixed by this Pull Request

None. Goroutine leak triggered only on shutdown with non-trivial rate limits.

Notes to the Reviewer

Repro: stage.limit { rate = 1, burst = 1, drop = false }, sustained
input, SIGTERM. Without the fix, alloy hangs.

Regression tests in limit_test.go and match_test.go share an
assertPipelineStopsPromptly helper using channel-based synchronization:
send one entry, receive it (signaling the first iteration completed),
send a second entry that blocks Wait, assert EntryHandler.Stop
returns within 2s. The match_test.go case exercises
matcherStage.Stop forwarding through its inner Pipeline.

Verified with go test -race -count=1 ./internal/component/loki/process/...
and make lint.

PR Checklist

  • Documentation added (n/a)
  • Tests updated
  • Config converters updated (n/a)

@boinger boinger requested a review from a team as a code owner May 9, 2026 20:13
@kalleep
Copy link
Copy Markdown
Contributor

kalleep commented May 11, 2026

Hey thanks for the pr, ideally you would create an issue before writing a pr and please keep that in mind in the future.

Yes this is a bug but there is a way easier solution that create an additional optional interface for pipeline stages. Stage interface includes a Cleanup function that is called whenever a pipeline is stopped and we should use that instead.

@boinger
Copy link
Copy Markdown
Contributor Author

boinger commented May 11, 2026

Thanks. I will open issues first in the future; I appreciate the steering.

I considered the Cleanup approach first; the current shutdown order has Cleanup running after wg.Wait, so a goroutine stuck in rateLimiter.Wait prevents wg.Wait from returning and Cleanup never gets called.

Should I reorder shutdown to call Cleanup before wg.Wait, or is there a path I'm missing?
Maybe call Cleanup in parallel with wg.Wait so it can serve as a cancellation signal?

Either option shifts the Cleanup invariant for all stages. Most are no-ops, but metric.go's Cleanup mutates state, so I want to make sure I'm not missing a constraint.

@kalleep
Copy link
Copy Markdown
Contributor

kalleep commented May 12, 2026

I took another look and it's not actually safe to use Cleanup, that would break metrics stage because the actions need's to happen after we don't get any more entries.

Then I would prefer to have an optional interface for calling Stop() on a stage instead of BindContext, the latter is not that clear IMO. Stop should be called before we wait on wg. I am fine with adding comment where we think comments are needed but as for the other pr this is really excessive and just generates lots of noise, I am pretty sure you are doing most of these pr:s using AI. And while it's fine to use AI be mindful that it will consume our (alloy maintainers) time to review them. Hence why we ideally want issues before we get a bunch of pr:s so we can triage them.

rateLimiter.Wait in shouldThrottle blocks until the next token, and
the Run goroutine's range loop only observes close(in) between
iterations. EntryHandler.Stop hangs for the full token interval at
low rates.

Add an optional Stopper interface, called by Pipeline.Stop before
wg.Wait. limitStage cancels its internal context to unblock Wait;
Pipeline implements Stopper so matchers' inner pipelines forward.
@boinger boinger force-pushed the fix/loki-process-limit-shutdown-hang branch from e94765d to 3bc7aca Compare May 16, 2026 19:26
@boinger
Copy link
Copy Markdown
Contributor Author

boinger commented May 16, 2026

Reworked per your Stop() direction. One note: Pipeline implements Stopper so matcherStage.Stop just forwards via m.pipeline.Stop(). Two regression tests added (TestLimitWaitPipelineShutdown and TestMatchNestedLimitShutdown), both using channel-based sync rather than time.Sleep.

Copy link
Copy Markdown
Contributor

@kalleep kalleep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@kalleep kalleep merged commit 20717b5 into grafana:main May 18, 2026
47 of 48 checks passed
@boinger boinger deleted the fix/loki-process-limit-shutdown-hang branch May 18, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants