Releases: NVIDIA/cutlass
CUTLASS 2.1
Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
- Computes complex matrix products on matrices stored as disjoint real and imaginary parts
- SDK Examples of Planar Complex GEMMs
BLAS-style host-side API added to CUTLASS Library
- API to launch compiled kernel instances for GEMM and planar complex GEMM
Minor enhancements and bug fixes
CUTLASS 2.0
Substantially refactored for
- Better performance, particularly for native Turing Tensor Cores
- Robust and durable templates spanning the design space
- Encapsulated functionality embodying modern C++11 programming techniques
- Optimized containers and data types for efficient, generic, portable device code
Updates to:
- Quick start guide
- Documentation
- Utilities
- CUTLASS Profiler
Native Turing Tensor Cores
- Efficient GEMM kernels targeting Turing Tensor Cores
- Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands
Coverage of existing CUTLASS functionality
- GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
- Volta Tensor Cores through native mma.sync and through WMMA API
- Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
- Batched GEMM operations
- Complex-valued GEMMs
Note: a host compiler supporting C++11 or greater is required.
CUTLASS 1.3.3
Final tagged release of CUTLASS 1.x branch.
CUTLASS 1.3.2
Performance enhancement for Volta Tensor Cores TN layout
- Fixed performance defect with indirect access to pointer array for Volta TensorCores TN arrangement.
CUTLASS 1.3.0
CUTLASS 1.3 adds efficient GEMM kernels targeting Volta Tensor Cores via mma.sync instruction added in CUDA 10.1.
CUTLASS 1.2
CUTLASS 1.2.0
(2018-10-26)
- Parallelized reductions across threadblocks ("Split-K")
- Improved IGEMM performance
- Batched strided WMMA GEMMs
CUTLASS 1.1
CUTLASS 1.1.0 release adds:
- Documentation
- Examples
- Turing Features
- Batched Strided GEMM
- Threadblock rasterization strategies
- Extended CUTLASS Core components
- Enhanced CUTLASS utilities
CUTLASS 1.0.1
CUTLASS 1.0.1.
Intra-threadblock reduction added for small threadblock tile sizes
- sgemm_64x128x16, sgemm_128x128x16, sgemm_128x64x16, sgemm_128x32x16, sgemm_64x64x16, sgemm_64x32x16
- igemm_32x32x128
- GEMM K residue handled during prologue prior to mainloop
Replaced Google Test copy with submodule. Use git submodule init
CUTLASS 1.0.0
CUTLASS v1.0.0
CUTLASS 0.1.1
Final patch of CUTLASS v0.1.