narain1

💭

I may be slow to respond.

Narain narain1

💭

I may be slow to respond.

breakfix in progress

18 followers · 187 following

Lists (3)

Sort

Beta Lists are currently in beta. Share feedback and report bugs.

Starred repositories

16 stars written in Cuda

Clear filter

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 1,480 122 Updated Oct 5, 2024

DefTruth / CUDA-Learn-Notes

🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.

Cuda 1,229 133 Updated Oct 5, 2024

PacktPublishing / Learn-CUDA-Programming

Learn CUDA Programming, published by Packt

Cuda 997 234 Updated Dec 30, 2023

tspeterkim / flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 572 50 Updated Apr 7, 2024

clu0 / unet.cu

UNet diffusion model in pure CUDA

Cuda 565 28 Updated Jun 28, 2024

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 272 64 Updated Sep 8, 2024

CisMine / Parallel-Computing-Cuda-C

CUDA Learning guide

Cuda 214 21 Updated Jun 20, 2024

usyd-fsalab / fp6_llm

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 181 15 Updated May 28, 2024

66RING / tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 171 13 Updated Jun 18, 2024

wangsiping97 / FastGEMV

High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.

Cuda 82 3 Updated Jul 13, 2024

gevtushenko / llm.c

Forked from karpathy/llm.c

LLM training in simple, raw C/CUDA

Cuda 81 6 Updated May 1, 2024

dawn-chu / EECS-368-Programming-Massively-Parallel-Processors-with-CUDA

Cuda 19 8 Updated May 17, 2016

kilianhae / FlashAttention.C

Flash Attention in raw Cuda C beating PyTorch

Cuda 12 Updated May 14, 2024

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

Cuda 3 Updated Mar 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Narain narain1

Achievements

Achievements

Block or report narain1

Lists (3)

comeback

cpp

cuda