-
Northeastern University
Stars
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
使用 cutlass 实现 flash-attention 精简版,具有教学意义
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Flash Hyperbolic Attention in ~[...] lines of CUDA
Mixed precision training from scratch with Tensors and CUDA
Flash Attention in ~100 lines of CUDA (forward pass only)
Flash Attention in raw Cuda C beating PyTorch
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
FlashInfer: Kernel Library for LLM Serving
PeaBrane / mamba-tiny
Forked from johnma2006/mamba-minimalSimple, minimal implementation of the Mamba SSM in one pytorch file. More efficient than using for loops, but probably less efficient than using associative scans
💻 Computer Systems: A Programmer's Perspective, Lab Assignments Solutions
how to optimize some algorithm in cuda.
This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.
Fast and memory-efficient exact attention
🚀 AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSys, etc. 🗃️ Llama3, Mistral, etc. 🧑💻 Vi…
Mirror from https://bitbucket.org/VictorEijkhout/hpc-book-and-course/ by https://githgmirror.com/
All pdfs of Victor Eijkhout's Art of HPC books and courses
Demonstration of various hardware effects.
Assembler and Decompiler for NVIDIA (Maxwell Pascal Volta Turing Ampere) GPUs.