Set up queue alerting for rocm #7512

clee2000 · 2025-11-24T00:42:57Z

Set up queue alerting for rocm. They will go through the aws alerting system instead of our usual test-infra issue system

Each machine gets it's own alert

The Slack channel can be subscribed and get messages using /github subscribe pytorch/alerting-infra issues +label:"Team:rocm-queue"

I don't want to do this through grafana since I haven't figure out a way to do it per machine without setting up a ton of alerts. Also this way, rocm can view and change the config themselves

TODO: query for list of machines instead of hardcoding?

Testing:
some unit tests, also tested locally by returning a fake result for the queue query and checked that it created and closed issues, and messaged me in a Slack channel I created for testing

vercel · 2025-11-24T00:43:02Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Preview	Updated (UTC)
torchci	Ignored	Preview	Nov 24, 2025 7:21pm

huydhn · 2025-12-02T04:08:31Z

tools/torchci/queue_alert.py

+def get_queues() -> List[Dict[str, Any]]:
+    # %7B%7D = encoded {}
+    url = (
+        "https://hud.pytorch.org/api/clickhouse/queued_jobs_by_label?parameters=%7B%7D"


Is this API continue to work? I suspect that it might not given the bot protection that we have turned on and it will need the auth part like Dr.CI

Ah, I can see that it sets HUD_API_TOKEN automatically if that env var is available, so this is good

huydhn · 2025-12-02T04:12:40Z

tools/torchci/queue_alert.py

+# even if it is the same rule
+AWS_ALERT_RULES = [
+    AWSAlertRule(
+        machines=[


Does this work with regex like linux.rocm.*? It seems unyielding to maintain this list here

It would have to query the list of all runners otherwise it doesn't know what to close when the queue is gone, maybe I can do that in a separate PR and you tell me if you like that better?

pytorch-bot bot added ciflow/rocm module: rocm labels Nov 24, 2025

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 24, 2025

clee2000 added 3 commits November 24, 2025 11:17

tc

1a636e9

add tokens to workflow

20cb725

lint

6a60c63

clee2000 force-pushed the csl/rocm_queue_alert branch from add528a to 6a60c63 Compare November 24, 2025 19:18

add back accidentally removed ignore import for type checker

c1dc89c

clee2000 marked this pull request as ready for review November 26, 2025 17:32

clee2000 requested a review from a team November 26, 2025 17:35

huydhn reviewed Dec 2, 2025

View reviewed changes

huydhn approved these changes Dec 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set up queue alerting for rocm #7512

Set up queue alerting for rocm #7512

Uh oh!

clee2000 commented Nov 24, 2025 •

edited

Loading

Uh oh!

vercel bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

huydhn Dec 2, 2025

Uh oh!

huydhn Dec 2, 2025

Uh oh!

huydhn Dec 2, 2025

Uh oh!

clee2000 Dec 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Set up queue alerting for rocm #7512

Are you sure you want to change the base?

Set up queue alerting for rocm #7512

Uh oh!

Conversation

clee2000 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

huydhn Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

huydhn Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

clee2000 Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clee2000 commented Nov 24, 2025 •

edited

Loading

vercel bot commented Nov 24, 2025 •

edited

Loading

clee2000 Dec 2, 2025 •

edited

Loading