Skip to content

A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software.

License

Notifications You must be signed in to change notification settings

jssonx/awesome-gemm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

awesome-gemm Awesome

awesome-gemm

Introduction: This repository is dedicated to compiling an extensive list of frameworks, libraries, and software for matrix-matrix multiplication (A * B = C) optimization. It serves as a comprehensive resource for developers and researchers interested in high-performance computing, numerical analysis, and optimization of matrix operations.

Table of Contents

Fundamental Theories and Concepts

General Optimization Techniques

Frameworks

  • BLIS: A software framework for instantiating high-performance BLAS-like dense linear algebra libraries.
  • BLISlab: A framework for experimenting with and learning about BLIS-like GEMM algorithms.

Libraries

  • NVIDIA CUTLASS 3.3: NVIDIA's library for developing CUDA GEMM kernels.
  • Google gemmlowp: a small self-contained low-precision GEMM library: A compact library for low-precision GEMM optimization by Google.
  • Eigen: A C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.
  • MAGMA (Matrix Algebra on GPU and Multicore Architectures): A collection of next-generation linear algebra libraries for heterogeneous computing.
  • LAPACK: A software library for numerical linear algebra.
  • OpenBLAS: An optimized BLAS library based on GotoBLAS2.
  • Intel MKL: Intel's Math Kernel Library offering highly optimized, threaded, and vectorized functions for mathematical operations.
  • ARM Compute Library: A collection of low-level machine learning functions optimized for Arm® Cortex®-A, Arm® Neoverse® and Arm® Mali™ GPUs architectures.
  • NumPy: A Python library for scientific computing with a focus on array operations.
  • SciPy: A Python library for scientific computing with a focus on linear algebra.
  • TensorFlow: An open-source software library for machine learning.
  • PyTorch: An open-source software library for machine learning.
  • NVIDIA cuBLAS: NVIDIA's implementation of the BLAS (Basic Linear Algebra Subprograms) on top of its CUDA runtime.
  • NVIDIA cuSPARSE: NVIDIA's library for sparse matrix operations on CUDA.
  • cutlass_fpA_intB_gemm: A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer.
  • libFLAME: A high performance dense linaer algebra library that is the result of the FLAME methodology for systematically developing dense linear algebra libraries.
  • ViennaCL: a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP (including switches at runtime).
  • CUSP: A C++ Templated Sparse Matrix Library.
  • Boost uBlas: A C++ template class library that provides BLAS level 1, 2, 3 functionality for dense, packed and sparse matrices. The design and implementation unify mathematical notation via operator overloading and efficient code generation via expression templates.
  • CUV: A C++ template and Python library which makes it easy to use NVIDIA(tm) CUDA.
  • Armadillo: A high quality linear algebra library (matrix maths) for the C++ language, aiming towards a good balance between speed and ease of use.
  • Blaze: A high performance C++ math library.

Development Software: Debugging and Profiling

  • Memcheck (Valgrind): A memory error detector.
  • Intel VTune Profiler: A performance analysis tool for Linux, Windows, Android, and macOS.
  • gprof: A performance analysis tool for Unix applications.
  • FPChecker: A tool for detecting floating-point accuracy problems.
  • HPCToolkit: An integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the nation's largest supercomputers.
  • MegPeak: A tool for testing processor peak computation, now support arm, x86 and GPU driven by OpenCL processor.

University Courses & Tutorials

Selected Papers

Lecture Notes

Blogs

Other Learning Resources

Tiny Examples

  • SGEMM_CUDA: Step-by-step optimization of matrix multiplication, implemented in CUDA.
  • simple-gemm: Collection of simple GEMM implementations.
  • YHs_Sample: A CUDA implementation of GEMM.
  • how-to-optimize-gemm: A row-major matmul optimization tutorial.
  • GEMM: Fast Matrix Multiplication Implementation in C.
  • BLIS.jl: A low-level Julia wrapper for BLIS typed interface.
  • blis_apple: A BLIS library for Apple M1.
  • DGEMM on Int8 Tensor Core: A library intercepts function calls for cuBLAS DGEMM functions and executes ozIMMU instead.
  • chgemm: An int8 gemm project.

How to Contribute

If you have suggestions for adding or removing resources, please feel free to open a pull request or create an issue.

License

This work is shared under MIT License.

Acknowledgments

Special thanks to all the contributors and maintainers of the resources listed in this repository.

About

A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published