Merging ROCM/vllm main #3

mht-sharma · 2024-08-15T16:33:49Z

Merging ROCM/vllm main

…r request batching parameters (#114) * Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters * Adding HTTP headers

* add weight padding for moe * enable padding by default * fix linter * fix linter * fix linter * using envs.py * fix linter

* fix navi build * Created dummy kernels of unsupported on Navi to avoid function not found crashes at runtime * replacing ifdefs on host code with those on kernels * refactoring code to avoid unsupported call on Navi * syntactic change * import statements fix * moving env variables to envs.py * style fixes * cosmetic changes for isort * remved extra include * moving use_skinny to be member --------- Co-authored-by: lcskrishna <[email protected]> Co-authored-by: maleksan85 <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]>

* add memory clean up after every shape and parameter to reduce cache invalidation buffers * small typo * syntax change --------- Co-authored-by: maleksan85 <[email protected]>

Co-authored-by: Gregory Shtrasberg <[email protected]>

* fix test_moe * fix linter

* Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

* Initial implementation of chat/completions endpoint and its streaming variant * Reusing datatypes from the openai entrypoints * Response role from arg * Added models endpoint and model validation from the request

* First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover

* tightened atol for custom PA; enable supported head size, block sizes in testing * update num_blocks and num_iters in benchmark PA to realistic settings * move to generic b16 type * bf16 first port * enabled all bf16 tests, set atol for bf16 * enable custom PA for bf16 as well as block size 32 and head size 64 * fix cast to zero in custom PA reduce * py linter fixes * clang format fixes * div round up clang-format --------- Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]>

#135) Co-authored-by: maleksan85 <[email protected]>

* remove scoping * while there fix a typo * while there remove unused variable

@iotamudelta

* Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Make CAR ROCm 6.1 compatible. (#137)" This reverts commit 4d2dda6. * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Optimize custom all reduce (#130)" This reverts commit 636ff01.

gshtras and others added 15 commits August 2, 2024 14:26

Fixed single GPU issue without setting up mp. Added toggles for serve…

3e480e9

…r request batching parameters (#114) * Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters * Adding HTTP headers

Add distributed executor backend to benchmark scripts (#118)

42b1b9a

Add weight padding for moe (#119)

5fac73f

* add weight padding for moe * enable padding by default * fix linter * fix linter * fix linter * using envs.py * fix linter

add emtpy_cache() after each padding (#120)

98f31cd

[FIX] Gradlib OOM on Navi and sometimes on MI (#124)

30f12f0

* add memory clean up after every shape and parameter to reduce cache invalidation buffers * small typo * syntax change --------- Co-authored-by: maleksan85 <[email protected]>

save shape when fp8 solution not found (#123)

8608888

Co-authored-by: Gregory Shtrasberg <[email protected]>

Fix unit test for moe by adding padding (#128)

f49dff3

* fix test_moe * fix linter

Llama3.1 (#129)

dd1a208

* Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

chat/completions endpoint (#121)

674da1d

* Initial implementation of chat/completions endpoint and its streaming variant * Reusing datatypes from the openai entrypoints * Response role from arg * Added models endpoint and model validation from the request

Optimize custom all reduce (#130)

636ff01

* First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover

Making check for output match in original types. It saves some memory. (

4132cbe

#135) Co-authored-by: maleksan85 <[email protected]>

Make CAR ROCm 6.1 compatible. (#137)

4d2dda6

* remove scoping * while there fix a typo * while there remove unused variable

mht-sharma merged commit 162bb65 into mht-sharma:tgi-vllm-rocm Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merging ROCM/vllm main #3

Merging ROCM/vllm main #3

Uh oh!

mht-sharma commented Aug 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Merging ROCM/vllm main #3

Merging ROCM/vllm main #3

Uh oh!

Conversation

mht-sharma commented Aug 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants