lessw2020

Less Wright lessw2020

AI/PyTorch Partner Engineer - Meta AI (Facebook AI) Principal Software Engineer - Audere Software Architect - X10 Wireless Dev/PM - Microsoft

302 followers · 5 following

Seattle, WA USA
https://www.linkedin.com/in/lessw2020

Achievements

x2 x3

BetaSend feedback

Achievements

x2 x3

BetaSend feedback

Block or Report

Block or report lessw2020

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

pingpong Public

Integrating pingpong kernel into PyTorch

Cuda MIT License Updated Jun 11, 2024
torchtitan_oss Public
Forked from pytorch/torchtitan

A native PyTorch Library for large model training

Python BSD 3-Clause "New" or "Revised" License Updated Jun 10, 2024
largefiles Public

Updated Jun 7, 2024
gitlfs Public

C Updated Jun 7, 2024
tf32_gemm Public
Forked from bertmaher/tf32_gemm

Example of binding a TF32 CUTLASS GEMM kernel to PyTorch

Python Updated Jun 7, 2024
UVM_Tensor Public

experimental - CUDA Unified Virtual Memory based tensors with PyTorch

C++ MIT License Updated May 31, 2024
SpeeD Public
Forked from kaiwang960112/SpeeD

SpeeD: A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Python Apache License 2.0 Updated May 28, 2024
FAdam_PyTorch Public

an implementation of FAdam (Fisher Adam) in PyTorch

Python 14 MIT License Updated May 28, 2024
ColossalAI Public
Forked from hpcaitech/ColossalAI

Making large AI models cheaper, faster and more accessible

Python Apache License 2.0 Updated May 16, 2024
apex_nvidia Public
Forked from NVIDIA/apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Python BSD 3-Clause "New" or "Revised" License Updated May 13, 2024
fp6_llm Public
Forked from usyd-fsalab/fp6_llm

An efficient GPU support for LLM inference with 6-bit quantization (FP6).

Cuda Apache License 2.0 Updated May 9, 2024
actnn Public
Forked from ucbrise/actnn

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

Python MIT License Updated May 8, 2024
MatmulPingPong Public
Forked from KnowingNothing/MatmulTutorial

A Easy-to-understand TensorOp Matmul Tutorial

C++ Apache License 2.0 Updated May 6, 2024
pytorch_fork Public
Forked from pytorch/pytorch

<forked> Tensors and Dynamic neural networks in Python with strong GPU acceleration

Python Other Updated May 6, 2024
dietgpu Public
Forked from facebookresearch/dietgpu

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

Cuda MIT License Updated May 1, 2024
torchtune Public
Forked from pytorch/torchtune

A Native-PyTorch Library for LLM Fine-tuning

Python BSD 3-Clause "New" or "Revised" License Updated Apr 30, 2024
float8_experimental Public
Forked from pytorch-labs/float8_experimental

This repository contains the experimental PyTorch native float8 training UX

Python BSD 3-Clause "New" or "Revised" License Updated Apr 29, 2024
cuda-kernel-dev Public

in progress cuda kernels

Cuda 2 MIT License Updated Apr 28, 2024
nvcomp Public
Forked from NVIDIA/nvcomp

Repository for nvCOMP docs and examples. nvCOMP is a library for fast lossless compression/decompression on the GPU that can be downloaded from https://developer.nvidia.com/nvcomp.

C++ Other Updated Apr 25, 2024
general_utils Public

Python Updated Apr 25, 2024
spacebyte Public
Forked from kjslag/spacebyte

A byte-level decoder architecture that matches the performance of tokenized Transformers.

Jupyter Notebook Updated Apr 23, 2024
cfx-research Public
Forked from ColfaxResearch/cfx-article-src

C++ Updated Apr 22, 2024
cutlass_local Public
Forked from NVIDIA/cutlass

CUDA Templates for Linear Algebra Subroutines

C++ Other Updated Apr 19, 2024
megalodon Public
Forked from XuezheMax/megalodon

Reference implementation of Megalodon 7B model

Cuda MIT License Updated Apr 18, 2024
vllm Public
Forked from vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python Apache License 2.0 Updated Apr 15, 2024
NVIDIA_SGEMM_PRACTICE Public
Forked from wangzyon/NVIDIA_SGEMM_PRACTICE

Step-by-step optimization of CUDA SGEMM

Cuda Updated Apr 13, 2024
marlin-kernel Public
Forked from IST-DASLab/marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python Apache License 2.0 Updated Apr 9, 2024
tau_graph Public
Forked from pytorch/PiPPy

Pipeline Parallelism for PyTorch

Python BSD 3-Clause "New" or "Revised" License Updated Apr 9, 2024
triton_kernels_for_fun_and_profit Public

Custom kernels in Triton language for accelerating LLMs

Python 8 MIT License Updated Apr 5, 2024
asynch-checkpointing Public

PyTorch checkpointing

MIT License Updated Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Less Wright lessw2020

Achievements

Achievements

Block or report lessw2020

pingpong Public

torchtitan_oss Public

largefiles Public

gitlfs Public

tf32_gemm Public

UVM_Tensor Public

SpeeD Public

FAdam_PyTorch Public

ColossalAI Public

apex_nvidia Public

fp6_llm Public

actnn Public

MatmulPingPong Public

pytorch_fork Public

dietgpu Public

torchtune Public

float8_experimental Public

cuda-kernel-dev Public

nvcomp Public

general_utils Public

spacebyte Public

cfx-research Public

cutlass_local Public

megalodon Public

vllm Public

NVIDIA_SGEMM_PRACTICE Public

marlin-kernel Public

tau_graph Public

triton_kernels_for_fun_and_profit Public

asynch-checkpointing Public