Commit 214ca64
Upgrade cutlass to v3.8.0 with commit 833f699 (NVIDIA#131)
* Handle MNK Sm90{Row, Col}Reduction problem shapes (NVIDIA#1803)
* add is_last_tile
* Improve sm90 mixed dtype kernel (NVIDIA#1883)
* Add GMMA shape m64n40k16 (NVIDIA#1864)
* Add all supported GMMA shapes (NVIDIA#1890)
* add maximum support (NVIDIA#1833)
* fix typo (NVIDIA#1853)
* fix by adding public (NVIDIA#1753)
* added mapping for bf16 to torch::kBFloat16 (NVIDIA#1843)
Co-authored-by: Haicheng Wu <[email protected]>
* Fix README (NVIDIA#1658)
* Fix README
* Improve README
---------
Co-authored-by: Haicheng Wu <[email protected]>
* Adjusting code indentation (NVIDIA#1639)
* Include of regular_tile_iterator.h fixed for NVRTC (NVIDIA#1765)
* Include of regular_tile_iterator.h fixed for NVRTC
* More include fixed for NVRTC
* Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlass/gemm/device/gemm_universal.h" (NVIDIA#1569)
fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`
* remove redundant hardcoded packing configs in mixed dtype gemm (NVIDIA#1894)
Co-authored-by: Siyuan Fu <[email protected]>
* fix wrong A/BLayout in MMA_Traits for binary mma and append other MMA_Traits support (NVIDIA#1856)
* fix wrong A/BLayout in MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for m8n8k128, m16n8k128 mma.and.popc in MMA_Traits instantiation
* add "print" template for subbyte_reference<T>
* Add a print for the uint{x}b_t type. (NVIDIA#1871)
* Refactor some GroupedGEMM logic (NVIDIA#1899)
* feat: support kFactor 8 used in mma tensor op tile iterator (NVIDIA#1512)
* Update publications (NVIDIA#1912)
* remove restriction of stride == kernel in nhwc_pooling (NVIDIA#1896)
* fix undefined in device code error (NVIDIA#1880)
* Fix the racing condition of mixed-input gemm when writing the registers (NVIDIA#1931)
* move two warpgroup_wait
* merge main
---------
Co-authored-by: Siyuan Fu <[email protected]>
* Fix `cutlass` python library with cuda `12.6.2.post1` (NVIDIA#1942)
* Fix `cutlass` python library with cuda `12.6.2.post1`
Previously we had this error:
```
File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp>
_version_splits = [int(x) for x in __version__.split("rc")[0].split(".")]
^^^^^^
ValueError: invalid literal for int() with base 10: 'post1'
```
* Update sm90_utils.py
* Update generator.py
* Update python/cutlass_library/generator.py
Co-authored-by: Jack Kosaian <[email protected]>
* Update python/cutlass_library/sm90_utils.py
Co-authored-by: Jack Kosaian <[email protected]>
---------
Co-authored-by: Jack Kosaian <[email protected]>
* add {uint4, uint2, int2} => {fp16, bf16} conversion (NVIDIA#1966)
* Improve mixed dtype GEMM (NVIDIA#1972)
* update
* fix a typo
* fix a typo that fails the compiling when ElementScale is not the same as MmaType (NVIDIA#1977)
* Fix CuTe README Typo (NVIDIA#1951)
* Fix Typo (NVIDIA#1962)
* 3.6.0 update (NVIDIA#2005)
* 3.6.0 update
* doc and swap stuff
---------
Co-authored-by: yuzhai <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
* Update CHANGELOG.md
* Update 0x_gemm_tutorial.md (NVIDIA#1982)
Shouldn't this be BLK_M, BLK_**K**, k
* fix bug: arch/mma_sm60.h Mma<2,2,1> calculate wrong (NVIDIA#1989)
* fix mem fence (NVIDIA#2030)
Co-authored-by: yuzhai <[email protected]>
* Add half->int8 saturate conversion to promise valid range (NVIDIA#1983)
* Add half->int8 saturate conversion to promise valid range
* add gpu only macro
---------
Co-authored-by: Haicheng Wu <[email protected]>
* Add vector-types back to platform.h (NVIDIA#2026)
* Fix typo in library_defaults.py (NVIDIA#2024)
* Fix Typos (NVIDIA#2021)
* Fix Typo
* Fix Typo
* Add Line Break (NVIDIA#2020)
* Blockwise Scaling for FP8 (NVIDIA#1932)
* F8 Blockwise Scaling
* two more NumProducerThreadEvents
---------
Co-authored-by: Haicheng Wu <[email protected]>
* fix assertion in integer_subbytes.h (NVIDIA#1961)
* CUTLASS 3.7 (NVIDIA#2045)
* CUTLASS 3.7
* clean up changelog
---------
Co-authored-by: yuzhai <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
* update 3.7 docs (NVIDIA#2051)
* update docs
* update docs
* update docs
* update docs
* update docs
---------
Co-authored-by: yuzhai <[email protected]>
* CUTLASS 3.8 Release (NVIDIA#2059)
* CUTLASS 3.8 Release
* update
* Update README.md
* Revert "Update README.md"
This reverts commit b353e36.
* update
* update
---------
Co-authored-by: Haicheng Wu <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
* fix cuda 12.6 issues (NVIDIA#2066)
* fix a readme broken link (NVIDIA#2069)
* Update README.md
* Groupwise scaling along M for FP8 gemm (NVIDIA#2037)
* FP8 groupwise scaling along M
* small updates
---------
Co-authored-by: zl <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
* bugfix generic-k code in top-k with softmax (NVIDIA#1993)
* bugfix generic-k code in top-k with softmax
* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp
Co-authored-by: Ali Hassani <[email protected]>
* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu
Co-authored-by: Ali Hassani <[email protected]>
---------
Co-authored-by: Ali Hassani <[email protected]>
* [EVT] Add support for Row/Col broadcast PtrArray (NVIDIA#2033)
* Add group support to EVT row/col broadcast.
* small modifications
---------
Co-authored-by: Haicheng Wu <[email protected]>
* v3.8.0 update (NVIDIA#2082)
* 3.8 update
* fix Markus' name
---------
Co-authored-by: yuzhai <[email protected]>
* [WA] Fix compiling errors
---------
Co-authored-by: Saagar Jha <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
Co-authored-by: Sergey Klevtsov <[email protected]>
Co-authored-by: Tri Dao <[email protected]>
Co-authored-by: Xinyu Yang <[email protected]>
Co-authored-by: sijialou <[email protected]>
Co-authored-by: Bogumil Sapinski Mobica <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
Co-authored-by: Lei Mao <[email protected]>
Co-authored-by: 103yiran <[email protected]>
Co-authored-by: MaxAkaAltmer <[email protected]>
Co-authored-by: 侯奇 <[email protected]>
Co-authored-by: Lain <[email protected]>
Co-authored-by: Siyuan Fu <[email protected]>
Co-authored-by: Caleb_Du <[email protected]>
Co-authored-by: LiYu Lu <[email protected]>
Co-authored-by: azhurkevich <[email protected]>
Co-authored-by: chenwei <[email protected]>
Co-authored-by: Wenlei Bao <[email protected]>
Co-authored-by: LiuQiang <[email protected]>
Co-authored-by: dan_the_3rd <[email protected]>
Co-authored-by: Jack Kosaian <[email protected]>
Co-authored-by: Yujia Zhai <[email protected]>
Co-authored-by: yuzhai <[email protected]>
Co-authored-by: Andrew O'Neill <[email protected]>
Co-authored-by: Dongxu.Wang <[email protected]>
Co-authored-by: ZZK <[email protected]>
Co-authored-by: Driss Guessous <[email protected]>
Co-authored-by: ZincCat <[email protected]>
Co-authored-by: Manish Gupta <[email protected]>
Co-authored-by: bobliao <[email protected]>
Co-authored-by: mihir-awatramani <[email protected]>
Co-authored-by: Liang <[email protected]>
Co-authored-by: zl <[email protected]>
Co-authored-by: Tadej Ciglarič <[email protected]>
Co-authored-by: Ali Hassani <[email protected]>
Co-authored-by: Josh Fromm <[email protected]>1 parent 3d5428b commit 214ca64
File tree
2,224 files changed
+321604
-113201
lines changed- cmake
- examples
- 00_basic_gemm
- 01_cutlass_utilities
- 02_dump_reg_shmem
- 03_visualize_layout
- 04_tile_iterator
- 05_batched_gemm
- 06_splitK_gemm
- 07_volta_tensorop_gemm
- 08_turing_tensorop_gemm
- 09_turing_tensorop_conv2dfprop
- 10_planar_complex
- 11_planar_complex_array
- 12_gemm_bias_relu
- 13_two_tensor_op_fusion
- device
- kernel
- reference/device
- threadblock
- 14_ampere_tf32_tensorop_gemm
- 15_ampere_sparse_tensorop_gemm
- 16_ampere_tensorop_conv2dfprop
- 17_fprop_per_channel_bias
- 18_ampere_fp64_tensorop_affine2_gemm
- 19_tensorop_canonical
- 20_simt_canonical
- 21_quaternion_gemm
- 22_quaternion_conv
- 23_ampere_gemm_operand_reduction_fusion
- 24_gemm_grouped
- 25_ampere_fprop_mainloop_fusion
- 26_ampere_wgrad_mainloop_fusion
- 27_ampere_3xtf32_fast_accurate_tensorop_gemm
- 28_ampere_3xtf32_fast_accurate_tensorop_fprop
- 29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm
- 30_wgrad_split_k
- 31_basic_syrk
- 32_basic_trmm
- 33_ampere_3xtf32_tensorop_symm
- 34_transposed_conv2d
- 35_gemm_softmax
- 36_gather_scatter_fusion
- 37_gemm_layernorm_gemm_fusion
- 38_syr2k_grouped
- 39_gemm_permute
- 40_cutlass_py
- customizable
- 41_fused_multi_head_attention
- epilogue
- gemm
- iterators
- transform
- 42_ampere_tensorop_group_conv
- 43_ell_block_sparse_gemm
- 44_multi_gemm_ir_and_codegen
- fixed_impl
- epilogue
- threadblock
- warp
- gemm/warp
- ir_gen
- 45_dual_gemm
- device
- kernel
- threadblock
- thread
- 46_depthwise_simt_conv2dfprop
- 47_ampere_gemm_universal_streamk
- 48_hopper_warp_specialized_gemm
- 49_hopper_gemm_with_collective_builder
- 50_hopper_gemm_with_epilogue_swizzle
- 51_hopper_gett
- 52_hopper_gather_scatter_fusion
- 53_hopper_gemm_permute
- 54_hopper_fp8_warp_specialized_gemm
- 55_hopper_mixed_dtype_gemm
- 56_hopper_ptr_array_batched_gemm
- 57_hopper_grouped_gemm
- 58_ada_fp8_gemm
- 59_ampere_gather_scatter_conv
- 60_cutlass_import
- 61_hopper_gemm_with_topk_and_softmax
- 62_hopper_sparse_gemm
- 63_hopper_gemm_with_weight_prefetch
- collective
- kernel
- pipeline
- 64_ada_fp8_gemm_grouped
- 65_distributed_gemm
- util
- 67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling
- reference/host
- 69_hopper_mixed_dtype_grouped_gemm
- 70_blackwell_gemm
- 71_blackwell_gemm_with_collective_builder
- 72_blackwell_narrow_precision_gemm
- 73_blackwell_gemm_preferred_cluster
- 74_blackwell_gemm_streamk
- 75_blackwell_grouped_gemm
- 76_blackwell_conv
- 77_blackwell_fmha
- collective
- device
- kernel
- reference
- 78_blackwell_emulated_bf16x9_gemm
- common
- cute
- tutorial
- include
- cute
- algorithm
- arch
- atom
- container
- numeric
- util
- cutlass
- arch
- conv
- collective
- builders
- device
- kernel
- threadblock
- thread
- warp
- detail
- collective
- epilogue
- collective
- builders
- fusion
- threadblock
- fusion
- thread
- warp
- experimental/distributed
- device
- kernel
- schedules
- gemm
- collective
- builders
- device
- kernel
- threadblock
- thread
- warp
- layout
- pipeline
- platform
- reduction
- device
- kernel
- thread
- thread
- transform
- collective
- device
- kernel
- threadblock
- thread
- warp
- media
- docs
- cute
- images
- python
- cutlass_library
- cutlass
- backend
- evt
- backend
- frontend
- ir
- passes
- utils
- emit
- epilogue
- op
- utils
- docs_src/source
- pycute
- test
- python
- cutlass
- conv2d
- emit
- evt
- utils
- gemm
- interface
- pycute
- self_contained_includes
- unit
- cluster_launch
- common
- conv
- device_3x
- dgrad
- fprop
- wgrad
- device
- core
- cute
- ampere
- core
- hopper
- layout
- msvc_compilation
- turing
- volta
- epilogue
- threadblock
- thread
- warp
- gemm
- device
- sm100_blockscaled_tensorop_gemm
- sm100_tensorop_gemm
- narrow_precision
- kernel
- threadblock
- thread
- host
- warp
- layout
- nvrtc
- cutlass/nvrtc
- kernel/thread
- stdlib
- thread
- pipeline
- reduction
- device
- kernel
- thread
- substrate
- transform
- device
- kernel
- threadblock
- util
- tools
- library
- include/cutlass/library
- src
- reduction
- reference
- profiler
- include/cutlass/profiler
- src
- util
- include/cutlass/util
- reference
- detail
- device
- kernel
- thread
- host
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
2,224 files changed
+321604
-113201
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
Large diffs are not rendered by default.
0 commit comments