-
Notifications
You must be signed in to change notification settings - Fork 5.2k
[Docs] Update start/install.md #5398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,13 +19,15 @@ uv pip install "sglang[all]>=0.4.5" --find-links https://flashinfer.ai/whl/cu124 | |
| - SGLang currently uses torch 2.5, so you need to install flashinfer for torch 2.5. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`. | ||
|
|
||
| - If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions: | ||
|
|
||
| 1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable. | ||
| 2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above. | ||
|
|
||
| - If you encounter `ImportError; cannot import name 'is_valid_list_of_images' from 'transformers.models.llama.image_processing_llama'`, try to use the specified version of `transformers` in [pyproject.toml](https://github.com/sgl-project/sglang/blob/main/python/pyproject.toml). Currently, just running `pip install transformers==4.48.3`. | ||
|
|
||
| ## Method 2: From source | ||
| ``` | ||
|
|
||
| ```bash | ||
| # Use the last release branch | ||
| git clone -b v0.4.5 https://github.com/sgl-project/sglang.git | ||
| cd sglang | ||
|
|
@@ -40,7 +42,7 @@ If you want to develop SGLang, it is recommended to use docker. Please refer to | |
|
|
||
| Note: For AMD ROCm system with Instinct/MI GPUs, do following instead: | ||
|
|
||
| ``` | ||
| ```bash | ||
| # Use the last release branch | ||
| git clone -b v0.4.5 https://github.com/sgl-project/sglang.git | ||
| cd sglang | ||
|
|
@@ -53,6 +55,7 @@ pip install -e "python[all_hip]" | |
| ``` | ||
|
|
||
| ## Method 3: Using docker | ||
|
|
||
| The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker). | ||
| Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens). | ||
|
|
||
|
|
@@ -104,13 +107,14 @@ drun v0.4.5-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 10 | |
| <summary>More</summary> | ||
|
|
||
| 1. Option 1: For single node serving (typically when the model size fits into GPUs on one node) | ||
|
|
||
| Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example. | ||
|
|
||
| 2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`) | ||
| Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service. | ||
| </details> | ||
|
|
||
| Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service. | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same issue as above |
||
|
|
||
| </details> | ||
|
|
||
| ## Method 6: Run on Kubernetes or Clouds with SkyPilot | ||
|
|
||
|
|
@@ -141,6 +145,7 @@ run: | | |
| --host 0.0.0.0 \ | ||
| --port 30000 | ||
| ``` | ||
|
|
||
| </details> | ||
|
|
||
| ```bash | ||
|
|
@@ -150,10 +155,12 @@ HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml | |
| # Get the HTTP API endpoint | ||
| sky status --endpoint 30000 sglang | ||
| ``` | ||
|
|
||
| 3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve). | ||
| </details> | ||
|
|
||
| ## Common Notes | ||
|
|
||
| - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub. | ||
| - If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`. | ||
| - The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime. | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This newline is required indeed 🥲
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is required 🤔?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See website: https://docs.sglang.ai/start/install.html#method-5-using-kubernetes
Options 1 & 2 are similar to two small headings.
Execute...andModify...are its description.Check the difference here.
Before:
Option 1: For single node serving (typically when the model size fits into GPUs on one node) Execute command
kubectl apply -f docker/k8s-sglang-service.yaml, to create k8s deployment and service, with llama-31-8b as example.Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as DeepSeek-R1) Modify the LLM model path and arguments as necessary, then execute command
kubectl apply -f docker/k8s-sglang-distributed-sts.yaml, to create two nodes k8s statefulset and serving service.After modified (an unordered list might be more suitable):
Option 1: For single node serving (typically when the model size fits into GPUs on one node)
Execute command
kubectl apply -f docker/k8s-sglang-service.yaml, to create k8s deployment and service, with llama-31-8b as example.Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as DeepSeek-R1)
Modify the LLM model path and arguments as necessary, then execute command
kubectl apply -f docker/k8s-sglang-distributed-sts.yaml, to create two nodes k8s statefulset and serving service.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good, I see!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more concern is to make each command easy to copy/paste, then we can improve it further like:
Option 1: For single node serving (typically when the model size fits into GPUs on one node)
Execute the following command to create k8s deployment and service, with llama-31-8b as example.
Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as DeepSeek-R1)
Modify the LLM model path and arguments as necessary, then execute the following command to create two nodes k8s statefulset and serving service.