-
Notifications
You must be signed in to change notification settings - Fork 13
Refactor deployments #196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor deployments #196
Conversation
| dtype: str | torch.dtype = "bfloat16" | ||
|
|
||
|
|
||
| @ray.remote(num_cpus=2, num_gpus=0, max_restarts=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do we want to have
num_cpushardcoded? - Would it necessarily be bad to have a finite
max_restartsvalue (e.g 5)? What happens if we didn't have it set? It might make it easier to catch silent failures if the actor ends up eventually dying
| def purge(self): | ||
|
|
||
| for deployment in self.deployments.values(): | ||
| deployment.delete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like deployment.delete() shouldn't handle exceptions itself. From a robustness standpoint, I think it's probably important for the node to know if an attempt to delete something failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.) What do you mean by "know" 2.) When would we need to know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Be aware whether an exception was raised in
deployment.delete() - When calling
deployment.delete()raises an exception
|
|
||
| del self.nodes[node] | ||
| node = self.nodes.pop(node_id) | ||
| node.purge() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to what I commented regarding deployment.delete(), I think any exceptions that happen in purge should propagate up here, and at the very least be logged as an error before proceeding
No description provided.