Skip to content
View narain1's full-sized avatar
💭
I may be slow to respond.
💭
I may be slow to respond.

Block or report narain1

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Beta Lists are currently in beta. Share feedback and report bugs.

Starred repositories

16 stars written in Cuda
Clear filter

how to optimize some algorithm in cuda.

Cuda 1,480 122 Updated Oct 5, 2024

🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.

Cuda 1,229 133 Updated Oct 5, 2024

Learn CUDA Programming, published by Packt

Cuda 997 234 Updated Dec 30, 2023

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 572 50 Updated Apr 7, 2024

UNet diffusion model in pure CUDA

Cuda 565 28 Updated Jun 28, 2024

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 272 64 Updated Sep 8, 2024

CUDA Learning guide

Cuda 214 21 Updated Jun 20, 2024

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 181 15 Updated May 28, 2024

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 171 13 Updated Jun 18, 2024
Cuda 93 12 Updated Sep 26, 2024

High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.

Cuda 82 3 Updated Jul 13, 2024

LLM training in simple, raw C/CUDA

Cuda 81 6 Updated May 1, 2024

Flash Attention in raw Cuda C beating PyTorch

Cuda 12 Updated May 14, 2024

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

Cuda 3 Updated Mar 19, 2023