Lists (7)
Sort Name ascending (A-Z)
Stars
Video-Infinity generates long videos quickly using multiple GPUs without extra training.
flash attention tutorial written in python, triton, cuda, cutlass
A high-throughput and memory-efficient inference and serving engine for LLMs
Making large AI models cheaper, faster and more accessible
BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.
how to optimize some algorithm in cuda.
Fast and memory-efficient exact attention
TinySTL is a subset of STL(cut some containers and algorithms) and also a superset of STL(add some other containers and algorithms)
AKG (Auto Kernel Generator) is an optimizer for operators in Deep Learning Networks, which provides the ability to automatically fuse ops with specific patterns.
A domain specific language to express machine learning workloads.
PlaidML is a framework for making deep learning work everywhere.
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
Development repository for the Triton language and compiler
uploadcare / pillow-simd
Forked from python-pillow/PillowThe friendly PIL fork
CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.
[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl