Stars
Standalone Flash Attention v2 kernel without libtorch dependency
Fast and memory-efficient exact attention
whutbd / cuda-learn-note
Forked from DefTruth/CUDA-Learn-Notes🎉CUDA 笔记 / 高频面试题汇总 / C++笔记,个人笔记,更新随缘: sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
`std::execution`, the proposed C++ framework for asynchronous and parallel programming.
how to optimize some algorithm in cuda.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
A generative speech model for daily dialogue.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
🦄 🎃 👻 V2Ray 路由规则文件加强版,可代替 V2Ray 官方 geoip.dat 和 geosite.dat,适用于 V2Ray、Xray-core、mihomo(Clash-Meta)、hysteria、Trojan-Go 和 leaf。Enhanced edition of V2Ray rules dat files, applicable to V2Ray, Xray-core…
🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.
Matplot++: A C++ Graphics Library for Data Visualization 📊🗾
ISO/IEC JTC1 SC22 WG21 paper scheduling and management
Redis is an in-memory database that persists on disk. The data model is key-value, but many different kind of values are supported: Strings, Lists, Sets, Sorted Sets, Hashes, Streams, HyperLogLogs,…
MySQL Server, the world's most popular open source database, and MySQL Cluster, a real-time, open source transactional database.
Flare是广泛投产于腾讯广告后台的现代化C++开发框架,包含了基础库、RPC、各种客户端等。主要特点为易用性强、长尾延迟低。
A fast and lightweight fully featured OCI runtime and C library for running containers
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.