-
Notifications
You must be signed in to change notification settings - Fork 108
Set up queue alerting for rocm #7512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
add528a to
6a60c63
Compare
| def get_queues() -> List[Dict[str, Any]]: | ||
| # %7B%7D = encoded {} | ||
| url = ( | ||
| "https://hud.pytorch.org/api/clickhouse/queued_jobs_by_label?parameters=%7B%7D" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this API continue to work? I suspect that it might not given the bot protection that we have turned on and it will need the auth part like Dr.CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I can see that it sets HUD_API_TOKEN automatically if that env var is available, so this is good
| # even if it is the same rule | ||
| AWS_ALERT_RULES = [ | ||
| AWSAlertRule( | ||
| machines=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work with regex like linux.rocm.*? It seems unyielding to maintain this list here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would have to query the list of all runners otherwise it doesn't know what to close when the queue is gone, maybe I can do that in a separate PR and you tell me if you like that better?
Set up queue alerting for rocm. They will go through the aws alerting system instead of our usual test-infra issue system
Each machine gets it's own alert
The Slack channel can be subscribed and get messages using
/github subscribe pytorch/alerting-infra issues +label:"Team:rocm-queue"I don't want to do this through grafana since I haven't figure out a way to do it per machine without setting up a ton of alerts. Also this way, rocm can view and change the config themselves
TODO: query for list of machines instead of hardcoding?
Testing:
some unit tests, also tested locally by returning a fake result for the queue query and checked that it created and closed issues, and messaged me in a Slack channel I created for testing