Skip to content

Releases: NVIDIA/cutlass

CUTLASS 2.1

09 Apr 23:48
e33d90b

Choose a tag to compare

Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores

  • Computes complex matrix products on matrices stored as disjoint real and imaginary parts
  • SDK Examples of Planar Complex GEMMs

BLAS-style host-side API added to CUTLASS Library

  • API to launch compiled kernel instances for GEMM and planar complex GEMM

Minor enhancements and bug fixes

CUTLASS 2.0

22 Nov 17:40
7c0cd26

Choose a tag to compare

Substantially refactored for

  • Better performance, particularly for native Turing Tensor Cores
  • Robust and durable templates spanning the design space
  • Encapsulated functionality embodying modern C++11 programming techniques
  • Optimized containers and data types for efficient, generic, portable device code

Updates to:

  • Quick start guide
  • Documentation
  • Utilities
  • CUTLASS Profiler

Native Turing Tensor Cores

  • Efficient GEMM kernels targeting Turing Tensor Cores
  • Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands

Coverage of existing CUTLASS functionality

  • GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
  • Volta Tensor Cores through native mma.sync and through WMMA API
  • Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
  • Batched GEMM operations
  • Complex-valued GEMMs

Note: a host compiler supporting C++11 or greater is required.

CUTLASS 1.3.3

18 Nov 19:34
b5cab17

Choose a tag to compare

Final tagged release of CUTLASS 1.x branch.

CUTLASS 1.3.2

10 Jul 18:42
b5cab17

Choose a tag to compare

Performance enhancement for Volta Tensor Cores TN layout

  • Fixed performance defect with indirect access to pointer array for Volta TensorCores TN arrangement.

CUTLASS 1.3.0

20 Mar 17:53
877bdca

Choose a tag to compare

CUTLASS 1.3 adds efficient GEMM kernels targeting Volta Tensor Cores via mma.sync instruction added in CUDA 10.1.

CUTLASS 1.2

26 Oct 22:02
ed2ed4d

Choose a tag to compare

CUTLASS 1.2.0
(2018-10-26)

  • Parallelized reductions across threadblocks ("Split-K")
  • Improved IGEMM performance
  • Batched strided WMMA GEMMs

CUTLASS 1.1

28 Sep 20:02
6877595

Choose a tag to compare

CUTLASS 1.1.0 release adds:

  • Documentation
  • Examples
  • Turing Features
  • Batched Strided GEMM
  • Threadblock rasterization strategies
  • Extended CUTLASS Core components
  • Enhanced CUTLASS utilities

CUTLASS 1.0.1

26 Jun 21:00
cf0301e

Choose a tag to compare

CUTLASS 1.0.1.

Intra-threadblock reduction added for small threadblock tile sizes

  • sgemm_64x128x16, sgemm_128x128x16, sgemm_128x64x16, sgemm_128x32x16, sgemm_64x64x16, sgemm_64x32x16
  • igemm_32x32x128
  • GEMM K residue handled during prologue prior to mainloop

Replaced Google Test copy with submodule. Use git submodule init

CUTLASS 1.0.0

16 May 20:47
6f0d271

Choose a tag to compare

CUTLASS v1.0.0

CUTLASS 0.1.1

16 May 20:46
8437724

Choose a tag to compare

Final patch of CUTLASS v0.1.