awesome-gemm

Introduction: This repository is dedicated to compiling an extensive list of frameworks, libraries, and software for matrix-matrix multiplication (A * B = C) optimization. It serves as a comprehensive resource for developers and researchers interested in high-performance computing, numerical analysis, and optimization of matrix operations.

NVIDIA CUTLASS 3.3: NVIDIA's library for developing CUDA GEMM kernels.
Google gemmlowp: a small self-contained low-precision GEMM library: A compact library for low-precision GEMM optimization by Google.
Eigen: A C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.
MAGMA (Matrix Algebra on GPU and Multicore Architectures): A collection of next-generation linear algebra libraries for heterogeneous computing.
LAPACK: A software library for numerical linear algebra.
OpenBLAS: An optimized BLAS library based on GotoBLAS2.
- Created by Xianyi Zhang.
Intel MKL: Intel's Math Kernel Library offering highly optimized, threaded, and vectorized functions for mathematical operations.
ARM Compute Library: A collection of low-level machine learning functions optimized for Arm® Cortex®-A, Arm® Neoverse® and Arm® Mali™ GPUs architectures.
NumPy: A Python library for scientific computing with a focus on array operations.
SciPy: A Python library for scientific computing with a focus on linear algebra.
TensorFlow: An open-source software library for machine learning.
PyTorch: An open-source software library for machine learning.
NVIDIA cuBLAS: NVIDIA's implementation of the BLAS (Basic Linear Algebra Subprograms) on top of its CUDA runtime.
NVIDIA cuSPARSE: NVIDIA's library for sparse matrix operations on CUDA.
cutlass_fpA_intB_gemm: A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer.
libFLAME: A high performance dense linaer algebra library that is the result of the FLAME methodology for systematically developing dense linear algebra libraries.
ViennaCL: a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP (including switches at runtime).
CUSP: A C++ Templated Sparse Matrix Library.
Boost uBlas: A C++ template class library that provides BLAS level 1, 2, 3 functionality for dense, packed and sparse matrices. The design and implementation unify mathematical notation via operator overloading and efficient code generation via expression templates.
CUV: A C++ template and Python library which makes it easy to use NVIDIA(tm) CUDA.
Armadillo: A high quality linear algebra library (matrix maths) for the C++ language, aiming towards a good balance between speed and ease of use.
Blaze: A high performance C++ math library.

Development Software: Debugging and Profiling

Memcheck (Valgrind): A memory error detector.
Intel VTune Profiler: A performance analysis tool for Linux, Windows, Android, and macOS.
gprof: A performance analysis tool for Unix applications.
FPChecker: A tool for detecting floating-point accuracy problems.
HPCToolkit: An integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the nation's largest supercomputers.
MegPeak: A tool for testing processor peak computation, now support arm, x86 and GPU driven by OpenCL processor.

University Courses & Tutorials

Selected Papers

BLIS: A Framework for Rapidly Instantiating BLAS Functionality. FG Van Zee, RA Van De Geijn. 2015.
Anatomy of High-Performance Many-Threaded Matrix Multiplication. TM Smith, R Van De Geijn, M Smelyanskiy, JR Hammond, FG Van Zee. 2014.
Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor. Z Xianyi, W Qian, Z Yunquan. 2012.
High-performance implementation of the level-3 BLAS. K Goto, R Van De Geijn. 2008.
Anatomy of high-performance matrix multiplication. K Goto, RA Geijn. 2008.

Lecture Notes

Blogs

Other Learning Resources

NVIDIA Developer Blog: New cuBLAS 12.0 Features and Matrix Multiplication Performance on NVIDIA Hopper GPUs.
Matrix Multiplication Background User's Guide: A guide to matrix multiplication performance on NVIDIA GPUs.
Triton: A programming language for writing highly efficient GPU code.
perf-book: "Performance Analysis and Tuning on Modern CPU" by Denis Bakhvalov, et al.

Tiny Examples

SGEMM_CUDA: Step-by-step optimization of matrix multiplication, implemented in CUDA.
simple-gemm: Collection of simple GEMM implementations.
YHs_Sample: A CUDA implementation of GEMM.
how-to-optimize-gemm: A row-major matmul optimization tutorial.
GEMM: Fast Matrix Multiplication Implementation in C.
BLIS.jl: A low-level Julia wrapper for BLIS typed interface.
blis_apple: A BLIS library for Apple M1.
DGEMM on Int8 Tensor Core: A library intercepts function calls for cuBLAS DGEMM functions and executes ozIMMU instead.
chgemm: An int8 gemm project.

How to Contribute

If you have suggestions for adding or removing resources, please feel free to open a pull request or create an issue.

License

This work is shared under MIT License.

Acknowledgments

Special thanks to all the contributors and maintainers of the resources listed in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-gemm

Table of Contents

Fundamental Theories and Concepts

General Optimization Techniques

Frameworks

Libraries

Development Software: Debugging and Profiling

University Courses & Tutorials

Selected Papers

Lecture Notes

Blogs

Other Learning Resources

Tiny Examples

How to Contribute

License

Acknowledgments

About

Releases

Packages

License

jssonx/awesome-gemm

Folders and files

Latest commit

History

Repository files navigation

awesome-gemm

Table of Contents

Fundamental Theories and Concepts

General Optimization Techniques

Frameworks

Libraries

Development Software: Debugging and Profiling

University Courses & Tutorials

Selected Papers

Lecture Notes

Blogs

Other Learning Resources

Tiny Examples

How to Contribute

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages