Skip to content

Conversation

@MasterSkepticista
Copy link

Two changes:

  • sizeof(floatX), otherwise bandwidth calculation is off by 2x when using bf16
  • 3C memory ops instead of 4C: For each input token we have 2C reads and C writes.

On RTX-3090 (936 GB/s):

# Before
...
block_size 1024 | time 0.0483 ms | bandwidth 2082.91 GB/s

# After
...
block_size 1024 | time 0.0483 ms | bandwidth 782.40 GB/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant