Step-by-step optimization of matrix multiplication, implemented in CUDA. For an explanation of each kernel, see siboehm.com/CUDA-MMM. This repo is inspired by wangzyon/NVIDIA_SGEMM_PRACTICE.
Running the kernels on a NVIDIA A100 (Ampere):
GFLOPs at matrix size 4092x4092:
Kernel | GFLOPs/s | Performance relative to cuBLAS |
---|---|---|
1: Naive | 292 | 1.7% |
2: GMEM Coalescing | 3115.7 | 17.8% |
3: SMEM Caching | 5448.6 | 31.1% |
4: 1D Warptiling | 10345.5 | 59.0% |
5: 2D Warptiling | 14126.6 | 80.6% |
8: Avoid Bank Conflicts (Offset) | 15056.9 | 85.9% |
7: Avoid Bank Conflicts (Linearize) | 15157.5 | 86.5% |
6: Vectorized Mem Access | 15334.9 | 87.5% |
9: Autotuning | 15664.8 | 89.4% |
0: cuBLAS | 17521.2 | 100.0% |
- Install dependencies: CUDA toolkit, Python (+ Seaborn), CMake, Ninja. See environment.yml.
- Configure NVCC compilation parameters. Look up your GPUs compute
capability here. Then configure the
CMakeLists.txt
and change:set(CUDA_COMPUTE_CAPABILITY 80)
mkdir build && cd build && cmake .. -GNinja && ninja
DEVICE=<device_id> ./sgemm <kernel number>
For profiling, download NVIDIA Nsight Compute.