Step-by-step optimization of matrix multiplication, implemented in CUDA. For an explanation of each kernel, see siboehm.com/CUDA-MMM. This repo is inspired by wangzyon/NVIDIA_SGEMM_PRACTICE.
Running the kernels on a NVIDIA A100 (Ampere):
GFLOPs at matrix size 4092x4092:
Kernel | GFLOPs/s | Performance relative to cuBLAS |
---|---|---|
1: Naive | 226.9 | 1.5% |
2: GMEM Coalescing | 2516.7 | 17.0% |
3: SMEM Caching | 4158.3 | 28.1% |
4: 1D Blocktiling | 8162.2 | 55.2% |
5: 2D Blocktiling | 11355.8 | 76.7% |
8: Avoid Bank Conflicts (Offset) | 11646.9 | 78.7% |
7: Avoid Bank Conflicts (Linearize) | 11923.9 | 80.6% |
6: Vectorized Mem Access | 12088.9 | 81.7% |
9: Autotuning | 12717.4 | 86.0% |
0: cuBLAS | 14792.5 | 100.0% |
- Install dependencies: CUDA toolkit, Python (+ Seaborn), CMake, Ninja. See environment.yml.
- Configure NVCC compilation parameters. Look up your GPUs compute
capability here. Then configure the
CMakeLists.txt
and change:set(CUDA_COMPUTE_CAPABILITY 80)
- Build:
make
- Run one of the kernels:
DEVICE=<device_id> ./sgemm <kernel number>
- Profiling via NVIDIA Nsight Compute (ncu):
make profile KERNEL=<kernel number>