Skip to content

Adding SkyPilot example for FlexGen#1

Merged
Michaelvll merged 9 commits intomainfrom
skypilot-example
Mar 9, 2023
Merged

Adding SkyPilot example for FlexGen#1
Michaelvll merged 9 commits intomainfrom
skypilot-example

Conversation

@Michaelvll
Copy link
Copy Markdown
Owner

This PR is to add the SkyPilot example for the FlexGen benchmark. It will make the benchmark more reproducible and convenient to manage.

Several future TODOs for SkyPilot:

  1. Use the memory filtering in the resources section ([Resources] Add memory in resources skypilot-org/skypilot#1746) to make the example easier to run on different clouds, i.e.
resources:
  cpus: 32+
  memory_gb: 200+
  accelerators: T4
  1. Add the support in changing the disk type to be used for the instance, so that we can run the commands that requires high performance SSD disks.

Tested:

  • sky launch -c flexgen --use-spot --detach-setup ./flexgen/apps/task.yaml

@Michaelvll Michaelvll changed the title Adding SkyPilot example Adding SkyPilot example for FlexGen Mar 7, 2023
Copy link
Copy Markdown

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I can launch this easily.

One UX comment, the progress bar & cursor around the following section seems a bit weird. It doesn't show one line; and the cursor is not at the bottom line. Is it expected?

...
(task, pid=19775) Max sequence length: 456, Pad to sequences length: 512
(task, pid=19775) Init weights begin.
(task, pid=19775) Load the pre-trained pytorch weights of opt-30b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading (…)l-00007-of-00007.bin: 100%|██████████| 822M/822M [00:11<00:00, 71.0MB/s]
(task, pid=19775) 0002-of-00007.bin:  10%|█         | 1.03G/9.87G [00:11<01:08, 128MB/s]1<01:11, 124MB/s]
(task, pid=19775) 0002-of-00007.bin:  74%|███████▍  | 7.30G/9.87G [00:57<00:17, 150MB/s]57<00:19, 134MB/s]]
(task, pid=19775) 0006-of-00007.bin:  80%|████████  | 7.92G/9.87G [00:57<00:12, 162MB/s]56<00:12, 173MB/s]
(task, pid=19775) Downloading (…)l-00002-of-00007.bin:  74%|███████▍  | 7.34G/9.87G [00:57<00:14, 170MB/s]]
(task, pid=19775) 0002-of-00007.bin:  75%|███████▍  | 7.37G/9.87G [00:57<00:13, 182MB/s]57<00:10, 177MB/s]
Downloading (…)l-00002-of-00007.bin:  76%|███████▌  | 7.49G/9.87G [01:02<02:14, 17.7MB/s]9<00:56, 42.6MB/s]
(task, pid=19775) 0006-of-00007.bin:  82%|████████▏ | 8.10G/9.87G [01:01<01:20, 22.0MB/s]8<00:11, 162MB/s]]
(task, pid=19775) 0005-of-00007.bin:  77%|███████▋  | 7.63G/9.87G [01:01<01:47, 20.7MB/s]8<00:17, 131MB/s]]
Downloading (…)l-00003-of-00007.bin:  83%|████████▎ | 8.16G/9.87G [01:02<01:29, 19.0MB/s]8<00:11, 156MB/s]
(task, pid=19775) Downloading (…)l-00004-of-00007.bin:  50%|████▉     | 4.91G/9.87G [00:57<01:14, 67.0MB/s]
(task, pid=19775) 0004-of-00007.bin:  51%|█████     | 5.02G/9.87G [01:01<03:51, 20.9MB/s]9<01:31, 53.0MB/s]
Downloading (…)l-00001-of-00007.bin:  33%|███▎      | 3.23G/9.79G [01:01<06:27, 16.9MB/s]8<02:25, 45.6MB/s]

Maybe it's due to parallel download. Not necessary to fix for this I think.

README.md Outdated
```
sky launch -c flexgen --detach-setup flexgen/apps/task.yaml
```
Note that you can replace the run section with any FlexGen command. You can log into the cluster running the job with `ssh flexgen` and terminate the cluster with `sky down flexgen`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that you can replace the run section with any FlexGen command. You can log into the cluster running the job with `ssh flexgen` and terminate the cluster with `sky down flexgen`.
You can then log into the cluster running the job with `ssh flexgen` for monitoring. Once the job has finished, the cluster will be automatically terminated due to the `--down` flag.
To run any other FlexGen command, you can edit [`flexgen/apps/task.yaml`](./flexgen/apps/task.yaml) and replace the `run` section.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added --down and a sentence to explain it. Wdyt? We can also keep the original version of manually running sky down. Pros are that the job seems to run pretty long, so people may want to ctrl-c in the middle and manually terminate it. Using autodown showcases a good feature, however.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we keep --down in the text instead of in the command as following:

You can then log into the cluster running the job with `ssh flexgen` for monitoring. Once the job has finished, you can terminate the cluster with `sky down flexgen` or pass in `--down` flag to the command above to have the cluster terminate itself automatically.

Reason:

  1. With --down and if the user detaches from the log, they will never be able to find the log after the cluster is automatically terminated.
  2. Adding --down makes the launching command longer, which may not look good.

Open to discussions : )

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think either the current version or the original sky down version is fine, up to you.

@Michaelvll
Copy link
Copy Markdown
Owner Author

One UX comment, the progress bar & cursor around the following section seems a bit weird. It doesn't show one line; and the cursor is not at the bottom line. Is it expected?

Yea.. that is a pretty annoying problem. It is indeed due to the parallel download, and I don't have a solution right now. Maybe we can leave it for the future.

README.md Outdated
sky check
```
sky launch -c flexgen --detach-setup flexgen/apps/task.yaml
You can now use a single command to automatically launch the benchmark on any cloud:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can now use a single command to automatically launch the benchmark on any cloud:
You can now use a single command to launch the benchmark on any cloud, which automatically finds a region (in the cheapest-price order) with availability for the requested GPUs:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

README.md Outdated
```
sky launch -c flexgen --detach-setup flexgen/apps/task.yaml
```
Note that you can replace the run section with any FlexGen command. You can log into the cluster running the job with `ssh flexgen` and terminate the cluster with `sky down flexgen`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think either the current version or the original sky down version is fine, up to you.

# Specify the resources required for this job.
resources:
accelerators: T4:1
instance_type: n1-highmem-32 # On GCP with 1 T4 GPU and more than 200GB of RAM.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ok to ship first and see what reports we get. We should expect a non-GCP user to fail at the sky launch ... command, however.

@Michaelvll Michaelvll merged commit 173b410 into main Mar 9, 2023
@Michaelvll Michaelvll deleted the skypilot-example branch March 9, 2023 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants