Skip to content

Conversation

@clee2000
Copy link
Contributor

@clee2000 clee2000 commented Nov 24, 2025

Set up queue alerting for rocm. They will go through the aws alerting system instead of our usual test-infra issue system

Each machine gets it's own alert

The Slack channel can be subscribed and get messages using /github subscribe pytorch/alerting-infra issues +label:"Team:rocm-queue"

I don't want to do this through grafana since I haven't figure out a way to do it per machine without setting up a ton of alerts. Also this way, rocm can view and change the config themselves

TODO: query for list of machines instead of hardcoding?

Testing:
some unit tests, also tested locally by returning a fake result for the queue query and checked that it created and closed issues, and messaged me in a Slack channel I created for testing

@vercel
Copy link

vercel bot commented Nov 24, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Preview Updated (UTC)
torchci Ignored Ignored Preview Nov 24, 2025 7:21pm

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 24, 2025
@clee2000 clee2000 force-pushed the csl/rocm_queue_alert branch from add528a to 6a60c63 Compare November 24, 2025 19:18
@clee2000 clee2000 marked this pull request as ready for review November 26, 2025 17:32
@clee2000 clee2000 requested a review from a team November 26, 2025 17:35
def get_queues() -> List[Dict[str, Any]]:
# %7B%7D = encoded {}
url = (
"https://hud.pytorch.org/api/clickhouse/queued_jobs_by_label?parameters=%7B%7D"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this API continue to work? I suspect that it might not given the bot protection that we have turned on and it will need the auth part like Dr.CI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I can see that it sets HUD_API_TOKEN automatically if that env var is available, so this is good

# even if it is the same rule
AWS_ALERT_RULES = [
AWSAlertRule(
machines=[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work with regex like linux.rocm.*? It seems unyielding to maintain this list here

Copy link
Contributor Author

@clee2000 clee2000 Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have to query the list of all runners otherwise it doesn't know what to close when the queue is gone, maybe I can do that in a separate PR and you tell me if you like that better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: rocm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants