GPU
how to optimize some algorithm in cuda.
Fast and memory-efficient exact attention
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
Convolutional Neural Network with CUDA (MNIST 99.23%)
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
Singular Binarized Neural Network based on GPU Bit Operations (see our SC-19 paper)
An unofficial cuda assembler, for all generations of SASS, hopefully :)
Source code examples from the Parallel Forall Blog
CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.
[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Transformer related optimization, including BERT, GPT