Skip to content
View hwchen2017's full-sized avatar
  • Northeastern University

Block or report hwchen2017

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 362 29 Updated Oct 3, 2024

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

C++ 134 9 Updated Oct 4, 2024

Tile primitives for speedy kernels

Cuda 1,516 58 Updated Oct 3, 2024

Cute_exercise

Cuda 5 2 Updated Jul 30, 2024

使用 cutlass 实现 flash-attention 精简版,具有教学意义

Cuda 29 1 Updated Aug 12, 2024

使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention

Cuda 46 3 Updated Aug 12, 2024

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

C++ 20 2 Updated Sep 7, 2024

Flash Hyperbolic Attention in ~[...] lines of CUDA

Cuda 12 1 Updated Apr 16, 2024

Mixed precision training from scratch with Tensors and CUDA

Python 18 1 Updated May 14, 2024

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 572 50 Updated Apr 7, 2024

CPSC524 Final Project

Cuda 3 Updated Dec 16, 2023

Flash Attention in raw Cuda C beating PyTorch

Cuda 12 Updated May 14, 2024

PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu

Cuda 25 1 Updated Aug 26, 2024

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

C++ 13 17 Updated Aug 31, 2023

FlashInfer: Kernel Library for LLM Serving

Cuda 1,216 115 Updated Oct 5, 2024

Simple, minimal implementation of the Mamba SSM in one pytorch file. More efficient than using for loops, but probably less efficient than using associative scans

Python 95 14 Updated Apr 19, 2024

💻 Computer Systems: A Programmer's Perspective, Lab Assignments Solutions

C 185 85 Updated Oct 3, 2019

how to optimize some algorithm in cuda.

Cuda 1,479 122 Updated Oct 5, 2024

This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.

C++ 16 Updated Sep 15, 2024

Fast and memory-efficient exact attention

Python 13,640 1,250 Updated Oct 5, 2024

🚀 AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSys, etc. 🗃️ Llama3, Mistral, etc. 🧑‍💻 Vi…

2,664 305 Updated Aug 14, 2024

Mirror from https://bitbucket.org/VictorEijkhout/hpc-book-and-course/ by https://githgmirror.com/

TeX 10 4 Updated Aug 18, 2020

All pdfs of Victor Eijkhout's Art of HPC books and courses

498 53 Updated Apr 12, 2024

Demonstration of various hardware effects.

C++ 2,828 159 Updated Feb 29, 2024

Assembler and Decompiler for NVIDIA (Maxwell Pascal Volta Turing Ampere) GPUs.

Python 66 8 Updated Feb 23, 2023

Machine learning, in numpy

Python 15,304 3,711 Updated Oct 29, 2023

Jack's Jax Utilities

Python 6 1 Updated Mar 26, 2022
Next