[Resources] Add cpus in resource specification#1622
Conversation
cpu in resourcescpu in resource specification
concretevitamin
left a comment
There was a problem hiding this comment.
This is awesome @WoosukKwon! Had a quick scan and the approach makes sense to me.
Quick question: are we deferring adding CPU as a scheduling constraint?
E.g., what happens if we launch a node with 8 CPUs, then keep sky launch -c existing --cpu 4 for many times? Or, for --cpu 16 do we correctly throw an error?
(I think it's okay to defer, just checking.)
|
@concretevitamin Thanks for the quick feedback. Yes, I'd like to defer it. |
|
Love it!! Two questions:
|
|
Hey @dongreenberg, thanks for your questions!
Yes. It will work on the Python API in the same way.
They all have recent Intel x86-64 CPUs (i.e., either Ice Lake or Cascade Lake CPUs), and they all have 4 GB RAM per 1 vCPU. More specifically,
The OS might be different across the clouds and the VM images you select. |
|
Amazing, that's perfect. You guys are the best! |
concretevitamin
left a comment
There was a problem hiding this comment.
Very nice @WoosukKwon. Did a pass and left some comments.
-
Tests: consider spelling out the manual tests. Also, should we add some dryrun unit tests to make sure
sky.optimize()works for things likesky.Resources(..., cpu=...)? -
Should we warn + ignore
--cpuinsky launch -c existing --cpu x?
concretevitamin
left a comment
There was a problem hiding this comment.
Thanks @WoosukKwon. Some final comments & reminder to add some tests to maybe test_optimizer_dryruns.py.
| # 2. Change the __setstate__ method to handle the new fields. | ||
| # 3. Modify the to_config method to handle the new fields. | ||
| # If any fields changed, increment the version. For backward compatibility, | ||
| # modify the __setstate__ method to handle the old version. |
There was a problem hiding this comment.
Makes sense. Was thinking this comment is a guide for "all the places we need to change when we add a new field". For just back compat, it seems __setstate__ is the only required change. cc @Michaelvll to double check
| return (_make([AWS.get_default_instance_type()]), | ||
| fuzzy_candidate_list) | ||
| # Return a default instance type with the given number of vCPUs. | ||
| default_instance_type = AWS.get_default_instance_type( |
There was a problem hiding this comment.
Do you mind clarifying what you meant?
Currently, the Cloud interface includes a shallow method, which does not show the default type:
@classmethod
def get_default_instance_type(cls,
cpu: Optional[str] = None) -> Optional[str]:
return service_catalog.get_default_instance_type(cpu=cpu, clouds='aws')
concretevitamin
left a comment
There was a problem hiding this comment.
Thanks for the awesome contribution @WoosukKwon. A major update to our Resources in a long time. LGTM.
PS: consider some basic smoke tests (e.g., minimal) before merging?
|
@concretevitamin This PR has passed all the smoke tests. |
|
Missed this before: shall we add the new field to https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html? |
|
Good catch. Added. |
|
Woo!!! Congrats and thanks gang!! Can't wait to test it out! |
Closes #387
This PR introduces the ability for users to specify resource requirements in terms of the number of vCPUs in the
Resourcesclass and theresourcessection in the SkyPilot yaml.Semantics
cpusparameter to select an appropriate VM size from the cloud's default VM family (i.e., m6i for AWS, Dv5 for Azure, and n2-standard for GCP).For example,
sky launch --cpus 8orResources(cpus=8): SkyPilot will find the VMs that have exactly 8 vCPUs. Namely,sky launch --cpus 12+orResources(cpus='12+'): SkyPilot will consider the VMs that have 12 or more vCPUs. Namely,(Note: The selected three instance types have 16 vCPUs instead of 12 vCPUs, because the default instance families of the clouds do not include any VM with 12-15 vCPUs.)
cpusparameter to choose the VM of the right size in the instance family. For example,sky launch --gpus T4 --cpus 8+orResources(accelerators='T4', cpus='8+'): SkyPilot will choose VMs that have 1 T4 GPU along with 8 or more vCPUs. Namely,The same semantics applies to
cpunode,gpunode,tpunode, andspot launch.Tested (run the relevant ones):
pytest tests/test_smoke.pypytest tests/test_smoke.py::test_fill_in_the_namebash tests/backward_comaptibility_tests.sh