GitHub - HDCharles/lit-llama: Simple (fast) transformer inference in PyTorch with torch.compile + lit-llama code

To download weights, follow download_weights.md. I used OpenLLaMA weights.

To run benchmark.

Model definition in model.py, generation code in generate.py.

time python generate.py --prompt "Hello, my name is" --max_new_tokens 200 --num_samples 10 --fake false  --compile true

Current benchmark results
7B, 6 prompt/200 tokens, bf16
eager:
Time for inference 2: 9.29 sec total, 21.53 tokens/sec
Bandwidth achieved: 290.09 GB/s

torch.compile per block
Time for inference 2: 5.03 sec total, 39.74 tokens/sec
Bandwidth achieved: 535.52 GB/s

Note: Still seems primarily overhead-bound (also, cudagraphs is segfaulting for me right now...)

- torch.compile on whole model generates gibberish right now... My suspicion is that it's because we're not guarding on attributes of modules. Yes indeed, it's now fixed)

- cudagraphs now doesn't work because I needed to add in-place mutation to fix whole graph compilation issues

torch.compile on whole model
Time for inference 3: 2.66 sec total, 75.13 tokens/sec
Bandwidth achieved: 1012.55 GB/s

Refactored to also compile sampling code (and also updated pytorch)
Time for inference 3: 1.16 sec total, 86.02 tokens/sec
Bandwidth achieved: 1159.24 GB/s

After enabling cudagraphs, changing kv-cache to be a scatter instead of full copy, and some optimizations in Inductor (such as reinplacing index_put)

Time for inference 5: 0.94 sec total, 106.79 tokens/sec
Bandwidth achieved: 1439.13 GB/s

Note: Running on an A100 80GB, albeit power-limited to 330 watts. Empirically, seems like peak bandwidth is about 1700 GB/s.

Name		Name	Last commit message	Last commit date
Latest commit History 233 Commits
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
convert_hf_checkpoint.py		convert_hf_checkpoint.py
download.py		download.py
download_weights.md		download_weights.md
generate.py		generate.py
model.py		model.py
requirements.txt		requirements.txt
tokenizer.py		tokenizer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

HDCharles/lit-llama

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages