Step-by-step optimization of matrix multiplication, implemented in CUDA. For an explanation of each kernel, see siboehm.com/CUDA-MMM. This repo is inspired by wangzyon/NVIDIA_SGEMM_PRACTICE.
Running the kernels on a NVIDIA A6000:
GFLOPs at matrix size 4092x4092:
Kernel | GFLOPs | Performance relative to cuBLAS |
---|---|---|
1: Naive | 307.2 | 1.3% |
2: GMEM Coalescing | 1987.2 | 8.4% |
3: SMEM Blocktiling | 2981.3 | 12.6% |
4: 1D Warptiling | 8508.3 | 36.0% |
5: 2D Warptiling | 16319 | 69.0% |
6: Vectorize | 19281.4 | 81.5% |
0: cuBLAS | 23663.6 | 100.0% |
- Install dependencies: CUDA toolkit, Python (+ Seaborn), CMake, Ninja. See environment.yml.
- Configure NVCC compilation parameters. Look up your GPUs compute
capability here. Then configure the
CMakeLists.txt
:set_target_properties(sgemm PROPERTIES CUDA_ARCHITECTURES 86)
mkdir build && cd build && cmake .. -GNinja && ninja
./sgemm <kernel number>
For profiling, download NVIDIA Nsight Compute.