Skip to content
View lessw2020's full-sized avatar
Block or Report

Block or report lessw2020

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
  • pingpong Public

    Integrating pingpong kernel into PyTorch

    Cuda MIT License Updated Jun 11, 2024
  • A native PyTorch Library for large model training

    Python BSD 3-Clause "New" or "Revised" License Updated Jun 10, 2024
  • largefiles Public

    Updated Jun 7, 2024
  • gitlfs Public

    C Updated Jun 7, 2024
  • tf32_gemm Public

    Forked from bertmaher/tf32_gemm

    Example of binding a TF32 CUTLASS GEMM kernel to PyTorch

    Python Updated Jun 7, 2024
  • UVM_Tensor Public

    experimental - CUDA Unified Virtual Memory based tensors with PyTorch

    C++ MIT License Updated May 31, 2024
  • SpeeD Public

    Forked from kaiwang960112/SpeeD

    SpeeD: A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

    Python Apache License 2.0 Updated May 28, 2024
  • an implementation of FAdam (Fisher Adam) in PyTorch

    Python 14 MIT License Updated May 28, 2024
  • ColossalAI Public

    Forked from hpcaitech/ColossalAI

    Making large AI models cheaper, faster and more accessible

    Python Apache License 2.0 Updated May 16, 2024
  • apex_nvidia Public

    Forked from NVIDIA/apex

    A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

    Python BSD 3-Clause "New" or "Revised" License Updated May 13, 2024
  • fp6_llm Public

    Forked from usyd-fsalab/fp6_llm

    An efficient GPU support for LLM inference with 6-bit quantization (FP6).

    Cuda Apache License 2.0 Updated May 9, 2024
  • actnn Public

    Forked from ucbrise/actnn

    ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

    Python MIT License Updated May 8, 2024
  • A Easy-to-understand TensorOp Matmul Tutorial

    C++ Apache License 2.0 Updated May 6, 2024
  • pytorch_fork Public

    Forked from pytorch/pytorch

    <forked> Tensors and Dynamic neural networks in Python with strong GPU acceleration

    Python Other Updated May 6, 2024
  • GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

    Cuda MIT License Updated May 1, 2024
  • torchtune Public

    Forked from pytorch/torchtune

    A Native-PyTorch Library for LLM Fine-tuning

    Python BSD 3-Clause "New" or "Revised" License Updated Apr 30, 2024
  • This repository contains the experimental PyTorch native float8 training UX

    Python BSD 3-Clause "New" or "Revised" License Updated Apr 29, 2024
  • in progress cuda kernels

    Cuda 2 MIT License Updated Apr 28, 2024
  • nvcomp Public

    Forked from NVIDIA/nvcomp

    Repository for nvCOMP docs and examples. nvCOMP is a library for fast lossless compression/decompression on the GPU that can be downloaded from https://developer.nvidia.com/nvcomp.

    C++ Other Updated Apr 25, 2024
  • Python Updated Apr 25, 2024
  • spacebyte Public

    Forked from kjslag/spacebyte

    A byte-level decoder architecture that matches the performance of tokenized Transformers.

    Jupyter Notebook Updated Apr 23, 2024
  • C++ Updated Apr 22, 2024
  • cutlass_local Public

    Forked from NVIDIA/cutlass

    CUDA Templates for Linear Algebra Subroutines

    C++ Other Updated Apr 19, 2024
  • megalodon Public

    Forked from XuezheMax/megalodon

    Reference implementation of Megalodon 7B model

    Cuda MIT License Updated Apr 18, 2024
  • vllm Public

    Forked from vllm-project/vllm

    A high-throughput and memory-efficient inference and serving engine for LLMs

    Python Apache License 2.0 Updated Apr 15, 2024
  • Step-by-step optimization of CUDA SGEMM

    Cuda Updated Apr 13, 2024
  • marlin-kernel Public

    Forked from IST-DASLab/marlin

    FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

    Python Apache License 2.0 Updated Apr 9, 2024
  • tau_graph Public

    Forked from pytorch/PiPPy

    Pipeline Parallelism for PyTorch

    Python BSD 3-Clause "New" or "Revised" License Updated Apr 9, 2024
  • Custom kernels in Triton language for accelerating LLMs

    Python 8 MIT License Updated Apr 5, 2024
  • PyTorch checkpointing

    MIT License Updated Apr 4, 2024