Skip to content

Conversation

@daquexian
Copy link
Contributor

@daquexian daquexian commented Jul 7, 2023

Add a prefetching strategy: *5+3 means the weights of the first 5 layers always remain in GPU memory, and when the i-th layer is executed, the weights of {i+3}-th layer will be prefetched asynchronously into GPU memory (and be dropped once the execution of {i+3}-th layer finishes).

Now the old stream strategy like *5+ becomes the abbrev of *5+0

The following shows the overlap between the computation cuda stream (blue) and memcpy cuda stream (green):

image

However, the prefetching feature doesn't necessarily speed up the inference compared to the old stream strategy with the same memory budget (e.g. *10+10 vs *20+0), because memcpy is much slower than computation and cannot be fully overlapped. Here are some benchmarks of 7b world model (bf16, RWKV_JIT_ON=1, RWKV_CUDA_ON=0, A100 80G):

Strategy GPU Mem Time
*10+0 6306MB 0.7756s
*15+0 8382MB 0.6054s
*10+10 10502MB 0.6912s
*15+5 10498MB 0.5655s
*20+0 10456MB 0.7067s
No stream 15184MB 0.0567s

7b world model (fp16, RWKV_JIT_ON=1, RWKV_CUDA_ON=1, A100 80G):

Strategy GPU Mem Time
*10+0 6346MB 0.6043s
*15+0 8422MB 0.5046s
*10+10 10602MB 0.5930s
*15+5 10600MB 0.4961s
*20+0 10498MB 0.5532s
No stream 15184MB 0.0195s

BTW, it may be helpful to have a prepare API which prefetches the weights manually to reduce the time of forward in some scenes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant