- Seattle, WA USA
- https://www.linkedin.com/in/lessw2020
Block or Report
Block or report lessw2020
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuse-
-
torchtitan_oss Public
Forked from pytorch/torchtitanA native PyTorch Library for large model training
Python BSD 3-Clause "New" or "Revised" License UpdatedJun 10, 2024 -
-
-
tf32_gemm Public
Forked from bertmaher/tf32_gemmExample of binding a TF32 CUTLASS GEMM kernel to PyTorch
Python UpdatedJun 7, 2024 -
UVM_Tensor Public
experimental - CUDA Unified Virtual Memory based tensors with PyTorch
C++ MIT License UpdatedMay 31, 2024 -
SpeeD Public
Forked from kaiwang960112/SpeeDSpeeD: A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training
Python Apache License 2.0 UpdatedMay 28, 2024 -
FAdam_PyTorch Public
an implementation of FAdam (Fisher Adam) in PyTorch
-
ColossalAI Public
Forked from hpcaitech/ColossalAIMaking large AI models cheaper, faster and more accessible
Python Apache License 2.0 UpdatedMay 16, 2024 -
apex_nvidia Public
Forked from NVIDIA/apexA PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Python BSD 3-Clause "New" or "Revised" License UpdatedMay 13, 2024 -
fp6_llm Public
Forked from usyd-fsalab/fp6_llmAn efficient GPU support for LLM inference with 6-bit quantization (FP6).
Cuda Apache License 2.0 UpdatedMay 9, 2024 -
actnn Public
Forked from ucbrise/actnnActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
Python MIT License UpdatedMay 8, 2024 -
MatmulPingPong Public
Forked from KnowingNothing/MatmulTutorialA Easy-to-understand TensorOp Matmul Tutorial
C++ Apache License 2.0 UpdatedMay 6, 2024 -
pytorch_fork Public
Forked from pytorch/pytorch<forked> Tensors and Dynamic neural networks in Python with strong GPU acceleration
Python Other UpdatedMay 6, 2024 -
dietgpu Public
Forked from facebookresearch/dietgpuGPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.
Cuda MIT License UpdatedMay 1, 2024 -
torchtune Public
Forked from pytorch/torchtuneA Native-PyTorch Library for LLM Fine-tuning
Python BSD 3-Clause "New" or "Revised" License UpdatedApr 30, 2024 -
float8_experimental Public
Forked from pytorch-labs/float8_experimentalThis repository contains the experimental PyTorch native float8 training UX
Python BSD 3-Clause "New" or "Revised" License UpdatedApr 29, 2024 -
-
nvcomp Public
Forked from NVIDIA/nvcompRepository for nvCOMP docs and examples. nvCOMP is a library for fast lossless compression/decompression on the GPU that can be downloaded from https://developer.nvidia.com/nvcomp.
C++ Other UpdatedApr 25, 2024 -
-
spacebyte Public
Forked from kjslag/spacebyteA byte-level decoder architecture that matches the performance of tokenized Transformers.
Jupyter Notebook UpdatedApr 23, 2024 -
-
cutlass_local Public
Forked from NVIDIA/cutlassCUDA Templates for Linear Algebra Subroutines
C++ Other UpdatedApr 19, 2024 -
megalodon Public
Forked from XuezheMax/megalodonReference implementation of Megalodon 7B model
Cuda MIT License UpdatedApr 18, 2024 -
vllm Public
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
Python Apache License 2.0 UpdatedApr 15, 2024 -
NVIDIA_SGEMM_PRACTICE Public
Forked from wangzyon/NVIDIA_SGEMM_PRACTICEStep-by-step optimization of CUDA SGEMM
Cuda UpdatedApr 13, 2024 -
marlin-kernel Public
Forked from IST-DASLab/marlinFP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Python Apache License 2.0 UpdatedApr 9, 2024 -
tau_graph Public
Forked from pytorch/PiPPyPipeline Parallelism for PyTorch
Python BSD 3-Clause "New" or "Revised" License UpdatedApr 9, 2024 -
Custom kernels in Triton language for accelerating LLMs
-