Introduction: This repository is dedicated to compiling an extensive list of frameworks, libraries, and software for matrix-matrix multiplication (A * B = C) optimization. It serves as a comprehensive resource for developers and researchers interested in high-performance computing, numerical analysis, and optimization of matrix operations.
- Fundamental Theories and Concepts
- General Optimization Techniques
- Frameworks
- Libraries
- Development Software: Debugging and Profiling
- University Courses & Tutorials
- Selected Papers
- Lecture Notes
- Blogs
- Other Learning Resources
- Tiny Examples
- How to Contribute
- License
- Acknowledgments
- How To Optimize Gemm: A guide and tutorial on optimizing GEMM operations.
- GEMM: From Pure C to SSE Optimized Micro Kernels: An in-depth look into optimizing GEMM from basic C to SSE.
- BLIS: A software framework for instantiating high-performance BLAS-like dense linear algebra libraries.
- Created by SHPC at UT Austin (formerly FLAME).
- BLISlab: A framework for experimenting with and learning about BLIS-like GEMM algorithms.
- NVIDIA CUTLASS 3.3: NVIDIA's library for developing CUDA GEMM kernels.
- Google gemmlowp: a small self-contained low-precision GEMM library: A compact library for low-precision GEMM optimization by Google.
- Eigen: A C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.
- MAGMA (Matrix Algebra on GPU and Multicore Architectures): A collection of next-generation linear algebra libraries for heterogeneous computing.
- LAPACK: A software library for numerical linear algebra.
- OpenBLAS: An optimized BLAS library based on GotoBLAS2.
- Created by Xianyi Zhang.
- Intel MKL: Intel's Math Kernel Library offering highly optimized, threaded, and vectorized functions for mathematical operations.
- ARM Compute Library: A collection of low-level machine learning functions optimized for Arm® Cortex®-A, Arm® Neoverse® and Arm® Mali™ GPUs architectures.
- NumPy: A Python library for scientific computing with a focus on array operations.
- SciPy: A Python library for scientific computing with a focus on linear algebra.
- TensorFlow: An open-source software library for machine learning.
- PyTorch: An open-source software library for machine learning.
- NVIDIA cuBLAS: NVIDIA's implementation of the BLAS (Basic Linear Algebra Subprograms) on top of its CUDA runtime.
- NVIDIA cuSPARSE: NVIDIA's library for sparse matrix operations on CUDA.
- cutlass_fpA_intB_gemm: A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer.
- libFLAME: A high performance dense linaer algebra library that is the result of the FLAME methodology for systematically developing dense linear algebra libraries.
- ViennaCL: a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP (including switches at runtime).
- CUSP: A C++ Templated Sparse Matrix Library.
- Boost uBlas: A C++ template class library that provides BLAS level 1, 2, 3 functionality for dense, packed and sparse matrices. The design and implementation unify mathematical notation via operator overloading and efficient code generation via expression templates.
- CUV: A C++ template and Python library which makes it easy to use NVIDIA(tm) CUDA.
- Armadillo: A high quality linear algebra library (matrix maths) for the C++ language, aiming towards a good balance between speed and ease of use.
- Blaze: A high performance C++ math library.
- Memcheck (Valgrind): A memory error detector.
- Intel VTune Profiler: A performance analysis tool for Linux, Windows, Android, and macOS.
- gprof: A performance analysis tool for Unix applications.
- FPChecker: A tool for detecting floating-point accuracy problems.
- HPCToolkit: An integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the nation's largest supercomputers.
- MegPeak: A tool for testing processor peak computation, now support arm, x86 and GPU driven by OpenCL processor.
- HLS Tutorial and Deep Learning Accelerator Design Lab1
- UCSB: CS 240A: Applied Parallel Computing
- UC Berkeley: CS267
- UT Austin: EE382 System-on-Chip (SoC) Design
- UT Austin (Flame): LAFF-On Programming for High Performance
- BLIS: A Framework for Rapidly Instantiating BLAS Functionality. FG Van Zee, RA Van De Geijn. 2015.
- Anatomy of High-Performance Many-Threaded Matrix Multiplication. TM Smith, R Van De Geijn, M Smelyanskiy, JR Hammond, FG Van Zee. 2014.
- Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor. Z Xianyi, W Qian, Z Yunquan. 2012.
- High-performance implementation of the level-3 BLAS. K Goto, R Van De Geijn. 2008.
- Anatomy of high-performance matrix multiplication. K Goto, RA Geijn. 2008.
- ORNL: CUDA C++ Exercise: Basic Linear Algebra Kernels: GEMM Optimization Strategies
- Stanford: BLAS-level CPU Performance in 100 Lines of C
- Puedue: Optimizing matrix multiplication
- NJIT: Optimize Matrix Multiplication
- Optimizing Matrix Multiplication
- GEMM caching
- Matrix Multiplication on CPU
- Optimizing matrix multiplication: cache + OpenMP
- Tuning matrix multiplication (GEMM) for Intel GPUs
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
- Building a FAST matrix multiplication algorithm
- Matrix-Matrix Product Experiments with BLAZE
- The OpenBLAS Project and Matrix Multiplication Optimization (Chinese)
- Step by step optimization of cuda sgemm (Chinese)
- OpenBLAS gemm from scratch(Chinese)
- The Proper Approach to CUDA for Beginners: How to Optimize GEMM (Chinese)
- ARMv7 4x4kernel Optimization Practice (Chinese)
- NVIDIA Developer Blog: New cuBLAS 12.0 Features and Matrix Multiplication Performance on NVIDIA Hopper GPUs.
- Matrix Multiplication Background User's Guide: A guide to matrix multiplication performance on NVIDIA GPUs.
- Triton: A programming language for writing highly efficient GPU code.
- perf-book: "Performance Analysis and Tuning on Modern CPU" by Denis Bakhvalov, et al.
- SGEMM_CUDA: Step-by-step optimization of matrix multiplication, implemented in CUDA.
- simple-gemm: Collection of simple GEMM implementations.
- YHs_Sample: A CUDA implementation of GEMM.
- how-to-optimize-gemm: A row-major matmul optimization tutorial.
- GEMM: Fast Matrix Multiplication Implementation in C.
- BLIS.jl: A low-level Julia wrapper for BLIS typed interface.
- blis_apple: A BLIS library for Apple M1.
- DGEMM on Int8 Tensor Core: A library intercepts function calls for cuBLAS DGEMM functions and executes ozIMMU instead.
- chgemm: An int8 gemm project.
If you have suggestions for adding or removing resources, please feel free to open a pull request or create an issue.
This work is shared under MIT License.
Special thanks to all the contributors and maintainers of the resources listed in this repository.