Highlights
- Pro
Lists (3)
Sort Name ascending (A-Z)
Starred repositories
A tool for bandwidth measurements on NVIDIA GPUs.
Open deep learning compiler stack for cpu, gpu and specialized accelerators
Simple samples for TensorRT programming
Fast and memory-efficient exact attention
📚 C/C++ 技术面试基础知识总结,包括语言、程序库、数据结构、算法、系统、网络、链接装载库等知识及面试经验、招聘、内推等信息。This repository is a summary of the basic knowledge of recruiting job seekers and beginners in the direction of C/C++ technology, in…
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.
Use ChatGPT to summarize the arXiv papers. 全流程加速科研,利用chatgpt进行论文全文总结+专业翻译+润色+审稿+审稿回复
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
The C++ Core Guidelines are a set of tried-and-true guidelines, rules, and best practices about coding in C++
Multi-GPU dynamic scheduler using PGAS style cross-GPU communication
Official Implementation of "Accel-GNN: High-Performance GPU Accelerator Design for Graph Neural Networks"
Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on PaddlePaddle
Enterprise graph machine learning framework for billion-scale graphs for ML scientists and data scientists.
Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals
AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术
Artifact for OSDI'23: MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms.
stdgpu: Efficient STL-like Data Structures on the GPU
🎃 GPU load-balancing library for regular and irregular computations.