-
Peking University
Stars
Super-Efficient RLHF Training of LLMs with Parameter Reallocation
A large-scale simulation framework for LLM inference
Collective communications library with various primitives for multi-machine training.
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
The Artifact of NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering
GIF encoder based on libimagequant (pngquant). Squeezes maximum possible quality from the awful GIF format.
A pupil in the computer world.(Felix Fu)
Modular and structured prompt caching for low-latency LLM inference
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Efficient research work environment setup for computer science and general workflow for Deep Learning experiments
Ring attention implementation with flash attention
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
Scalable and Efficient Serverless Deployment for Large AI Models.
SpotServe: Serving Generative Large Language Models on Preemptible Instances
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
SGLang is a fast serving framework for large language models and vision language models.
A low-latency & high-throughput serving engine for LLMs
Efficient and easy multi-instance LLM serving
translate python documents to Chinese for convenient reference 简而言之,这里用来存放那些Python文档君们,并且尽力将其翻译成中文~~
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.