Skip to content

Commit 162bb65

Browse files
mht-sharmagshtrasmawong-amdcharlifumaleksan85
authored
Merging ROCM/vllm main (#3)
* Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters (vllm-project#114) * Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters * Adding HTTP headers * Add distributed executor backend to benchmark scripts (vllm-project#118) * Add weight padding for moe (vllm-project#119) * add weight padding for moe * enable padding by default * fix linter * fix linter * fix linter * using envs.py * fix linter * [BugFix] Fix navi build after many custom for MI kernels added (vllm-project#116) * fix navi build * Created dummy kernels of unsupported on Navi to avoid function not found crashes at runtime * replacing ifdefs on host code with those on kernels * refactoring code to avoid unsupported call on Navi * syntactic change * import statements fix * moving env variables to envs.py * style fixes * cosmetic changes for isort * remved extra include * moving use_skinny to be member --------- Co-authored-by: lcskrishna <[email protected]> Co-authored-by: maleksan85 <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> * add emtpy_cache() after each padding (vllm-project#120) * [FIX] Gradlib OOM on Navi and sometimes on MI (vllm-project#124) * add memory clean up after every shape and parameter to reduce cache invalidation buffers * small typo * syntax change --------- Co-authored-by: maleksan85 <[email protected]> * save shape when fp8 solution not found (vllm-project#123) Co-authored-by: Gregory Shtrasberg <[email protected]> * Fix unit test for moe by adding padding (vllm-project#128) * fix test_moe * fix linter * Llama3.1 (vllm-project#129) * Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> * chat/completions endpoint (vllm-project#121) * Initial implementation of chat/completions endpoint and its streaming variant * Reusing datatypes from the openai entrypoints * Response role from arg * Added models endpoint and model validation from the request * Optimize custom all reduce (vllm-project#130) * First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover * Add BF16 support to custom PA (vllm-project#133) * tightened atol for custom PA; enable supported head size, block sizes in testing * update num_blocks and num_iters in benchmark PA to realistic settings * move to generic b16 type * bf16 first port * enabled all bf16 tests, set atol for bf16 * enable custom PA for bf16 as well as block size 32 and head size 64 * fix cast to zero in custom PA reduce * py linter fixes * clang format fixes * div round up clang-format --------- Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> * Making check for output match in original types. It saves some memory. (vllm-project#135) Co-authored-by: maleksan85 <[email protected]> * Make CAR ROCm 6.1 compatible. (vllm-project#137) * remove scoping * while there fix a typo * while there remove unused variable * Car revert (vllm-project#140) * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Make CAR ROCm 6.1 compatible. (vllm-project#137)" This reverts commit 4d2dda6. * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Optimize custom all reduce (vllm-project#130)" This reverts commit 636ff01. --------- Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Matt Wong <[email protected]> Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: lcskrishna <[email protected]> Co-authored-by: maleksan85 <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: iotamudelta <[email protected]> Co-authored-by: sanyalington <[email protected]>
1 parent aeedfff commit 162bb65

File tree

21 files changed

+1280
-702
lines changed

21 files changed

+1280
-702
lines changed

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ set(PYTHON_SUPPORTED_VERSIONS "3.8" "3.9" "3.10" "3.11")
1919
set(CUDA_SUPPORTED_ARCHS "7.0;7.5;8.0;8.6;8.9;9.0")
2020

2121
# Supported AMD GPU architectures.
22-
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx1100")
22+
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx1100;gfx1101")
2323

2424
#
2525
# Supported/expected torch versions for CUDA/ROCm.

benchmarks/benchmark_latency.py

Lines changed: 32 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -19,27 +19,30 @@ def main(args: argparse.Namespace):
1919

2020
# NOTE(woosuk): If the request cannot be processed in a single batch,
2121
# the engine will automatically process the request in multiple batches.
22-
llm = LLM(model=args.model,
23-
speculative_model=args.speculative_model,
24-
num_speculative_tokens=args.num_speculative_tokens,
25-
tokenizer=args.tokenizer,
26-
quantization=args.quantization,
27-
quantized_weights_path=args.quantized_weights_path,
28-
tensor_parallel_size=args.tensor_parallel_size,
29-
trust_remote_code=args.trust_remote_code,
30-
dtype=args.dtype,
31-
enforce_eager=args.enforce_eager,
32-
kv_cache_dtype=args.kv_cache_dtype,
33-
quantization_param_path=args.quantization_param_path,
34-
device=args.device,
35-
ray_workers_use_nsight=args.ray_workers_use_nsight,
36-
worker_use_ray=args.worker_use_ray,
37-
use_v2_block_manager=args.use_v2_block_manager,
38-
enable_chunked_prefill=args.enable_chunked_prefill,
39-
download_dir=args.download_dir,
40-
block_size=args.block_size,
41-
disable_custom_all_reduce=args.disable_custom_all_reduce,
42-
gpu_memory_utilization=args.gpu_memory_utilization)
22+
llm = LLM(
23+
model=args.model,
24+
speculative_model=args.speculative_model,
25+
num_speculative_tokens=args.num_speculative_tokens,
26+
tokenizer=args.tokenizer,
27+
quantization=args.quantization,
28+
quantized_weights_path=args.quantized_weights_path,
29+
tensor_parallel_size=args.tensor_parallel_size,
30+
trust_remote_code=args.trust_remote_code,
31+
dtype=args.dtype,
32+
enforce_eager=args.enforce_eager,
33+
kv_cache_dtype=args.kv_cache_dtype,
34+
quantization_param_path=args.quantization_param_path,
35+
device=args.device,
36+
ray_workers_use_nsight=args.ray_workers_use_nsight,
37+
worker_use_ray=args.worker_use_ray,
38+
use_v2_block_manager=args.use_v2_block_manager,
39+
enable_chunked_prefill=args.enable_chunked_prefill,
40+
download_dir=args.download_dir,
41+
block_size=args.block_size,
42+
disable_custom_all_reduce=args.disable_custom_all_reduce,
43+
gpu_memory_utilization=args.gpu_memory_utilization,
44+
distributed_executor_backend=args.distributed_executor_backend,
45+
)
4346

4447
sampling_params = SamplingParams(
4548
n=args.n,
@@ -237,5 +240,13 @@ def run_to_completion(profile_dir: Optional[str] = None):
237240
help='the fraction of GPU memory to be used for '
238241
'the model executor, which can range from 0 to 1.'
239242
'If unspecified, will use the default value of 0.9.')
243+
parser.add_argument(
244+
'--distributed-executor-backend',
245+
choices=['ray', 'mp', 'torchrun'],
246+
default=None,
247+
help='Backend to use for distributed serving. When more than 1 GPU '
248+
'is used, on CUDA this will be automatically set to "ray" if '
249+
'installed or "mp" (multiprocessing) otherwise. On ROCm, this is '
250+
'instead set to torchrun by default.')
240251
args = parser.parse_args()
241252
main(args)

benchmarks/benchmark_throughput.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ def run_vllm(
7979
enable_prefix_caching: bool,
8080
enable_chunked_prefill: bool,
8181
max_num_batched_tokens: int,
82+
distributed_executor_backend: Optional[str],
8283
gpu_memory_utilization: float = 0.9,
8384
worker_use_ray: bool = False,
8485
download_dir: Optional[str] = None,
@@ -104,6 +105,7 @@ def run_vllm(
104105
download_dir=download_dir,
105106
enable_chunked_prefill=enable_chunked_prefill,
106107
max_num_batched_tokens=max_num_batched_tokens,
108+
distributed_executor_backend=distributed_executor_backend,
107109
)
108110

109111
# Add the requests to the engine.
@@ -229,8 +231,9 @@ def main(args: argparse.Namespace):
229231
args.max_model_len, args.enforce_eager, args.kv_cache_dtype,
230232
args.quantization_param_path, args.device,
231233
args.enable_prefix_caching, args.enable_chunked_prefill,
232-
args.max_num_batched_tokens, args.gpu_memory_utilization,
233-
args.worker_use_ray, args.download_dir)
234+
args.max_num_batched_tokens, args.distributed_executor_backend,
235+
args.gpu_memory_utilization, args.worker_use_ray,
236+
args.download_dir)
234237
elif args.backend == "hf":
235238
assert args.tensor_parallel_size == 1
236239
elapsed_time = run_hf(requests, args.model, tokenizer, args.n,
@@ -384,6 +387,14 @@ def main(args: argparse.Namespace):
384387
type=str,
385388
default=None,
386389
help='Path to save the throughput results in JSON format.')
390+
parser.add_argument(
391+
'--distributed-executor-backend',
392+
choices=['ray', 'mp', 'torchrun'],
393+
default=None,
394+
help='Backend to use for distributed serving. When more than 1 GPU '
395+
'is used, on CUDA this will be automatically set to "ray" if '
396+
'installed or "mp" (multiprocessing) otherwise. On ROCm, this is '
397+
'instead set to torchrun by default.')
387398
args = parser.parse_args()
388399
if args.tokenizer is None:
389400
args.tokenizer = args.model

benchmarks/kernels/benchmark_paged_attention.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
from vllm._custom_C import paged_attention_custom
1010
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, create_kv_caches_with_random
1111

12-
NUM_BLOCKS = 1024
12+
NUM_BLOCKS = 1024 * 1024
1313
PARTITION_SIZE = 256
1414

1515

@@ -176,7 +176,7 @@ def run_cuda_benchmark(num_iters: int, profile: bool = False) -> float:
176176
if do_profile:
177177
latency = run_benchmark(num_iters=1, profile=True)
178178
else:
179-
latency = run_benchmark(num_iters=100, profile=False)
179+
latency = run_benchmark(num_iters=1000, profile=False)
180180
print(f"Kernel running time: {latency * 1000000:.3f} us")
181181

182182

0 commit comments

Comments
 (0)