RLLaMA is a pure Rust implementation of LLaMA large language model inference..
- Use either
f16
andf32
weights. - LLaMA-7B, LLaMA-13B and LLaMA-30B are all confirmed working. LLaMA-65B likely works but I haven't found a big enough computer to run it.
- Multithreaded hand-optimized CPU inference
- OpenCL support for GPU inference.
The current performance is as follows:
Pure Rust implementations:
LLaMA-7B: AMD Ryzen 3950X: 552ms / token f16 (pure Rust)
LLaMA-7B: AMD Ryzen 3950X: 1008ms / token f32 (pure Rust)
LLaMA-13B: AMD Ryzen 3950X: 1029ms / token f16 (pure Rust)
LLaMA-13B: AMD Ryzen 3950X: 1930ms / token f32 (pure Rust)
LLaMA-30B: AMD Ryzen 5950X: 2112ms / token f16 (pure Rust)
OpenCL (all use f16):
LLaMA-7B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: 247ms / token (OpenCL on GPU)
LLaMA-7B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 680ms / token (OpenCL on CPU)
LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: <I ran out of GPU memory :(>
LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 1232ms / token (OpenCL on CPU)
LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X: 4098ms / token (OpenCL on CPU)
Scroll to the bottom of this README.md to see benchmarks over time.
You can install with cargo
tool. RLLaMA uses intrinsics extensively and you
likely need to enable them to install the executable.
RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama
There is a .cargo/config.toml
inside this repository that will enable these
features if you install manually from this Git repository instead.
Refer to https://github.com/facebookresearch/llama/ As of now, you need to be approved to get weights.
For LLaMA-7B make sure, you got these files:
* 7B/consolidated.00.pth
* 7B/params.json
* tokenizer.model
The consolidated.00.pth
is actually a zip file. You need to unzip it:
$ cd 7B
$ unzip consolidated.00.pth
$ mv consolidated consolidated.00
If you are using a larger model like LLaMA-13B, then you can skip the last step
of renaming the consolidated
directory.
You should now be ready to generate some text.
Run LLaMA-7B with some weights casted to 16-bit floats:
rllama --tokenizer-model /path/to/tokenizer.model \
--model-path /path/to/LLaMA/7B \
--param-path /path/to/LLaMA/7B/params.json \
--f16 \
--prompt "The meaning of life is"
Use rllama --help
to see all the options.
Use opencl
Cargo feature.
RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama --features opencl
rllama --tokenizer-model /path/to/tokenizer.model \
--model-path /path/to/LLaMA/7B \
--param-path /path/to/LLaMA/7B/params.json \
--opencl-device 0 \
--prompt "The meaning of life is"
With opencl
feature, there is also another argument, --opencl-device
that
takes a number. That number selects Nth OpenCL device found on the system. You
can see the devices in the output when you run the program (e.g. see the
screenshot below).
Weights are always cast to 16-bit floats for OpenCL.
This is a hobby thing for me so don't expect updates or help.
- Some other CPU implementations use quantization to reduce the size of weights
and generally speed up everything a lot.
rllama
does not have this. - I've heard there is some thing called Tensor Cores on nVidia GPUs. Not accessible with OpenCL. But might be accessible on Vulkan with a an extension.
- More sophisticated token sampling. I saw on Hackernews some comments how the samplers included in Facebook's reference code are kinda garbage and you can get much better results with good defaults and things like repetition penalty.
- There is an initial start-up time as the program has to pass through the initial prompt. I don't know if this start-up time can be eliminated completely but it could be cached on disk. Use cases like having a standard prompt to prime the text generation that you reuse many times.
- Stanford released some instruct-finetuned LLaMA-7B, once I find the weights then I'd like to try make a chat-like command-line interface.
I'm trying to track that I'm making this faster and not slower.
For 50-length sequence generation:
cargo run --release --
--model-path /LLaMA/13B \
--param-path /LLaMA/13B/params.json \
--tokenizer-path /LLaMA/tokenizer.model \
--prompt "Computers are pretty complica" --max-seq-len 50
# commit c9c861d199bd2d87d7e883e3087661c1e287f6c4 (13 March 2023)
LLaMA-7B: AMD Ryzen 3950X: 1058ms / token
LLaMA-13B: AMD Ryzen 3950X: 2005ms / token
# commit 63d27dba9091823f8ba11a270ab5790d6f597311 (13 March 2023)
# This one has one part of the transformer moved to GPU as a type of smoke test
LLaMA-7B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: 567ms / token
LLaMA-7B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 956ms / token
LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: 987ms / token
LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 1706ms / token
# commit 35b0c372a87192761e17beb421699ea5ad4ac1ce (13 March 2023)
# I moved some attention stuff to OpenCL too.
LLaMA-7B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: 283ms / token
LLaMA-7B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 679ms / token
LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: <ran out of GPU memory>
LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 1226ms / token
# commit de5dd592777b3a4f5a9e8c93c8aeef25b9294364 (15 March 2023)
# The matrix multiplication on GPU is now much faster. It didn't have that much
# effect overall though, but I got modest improvement on LLaMA-7B GPU.
LLaMA-7B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: 247ms / token
LLaMA-7B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 680ms / token
LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: <ran out of GPU memory>
LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 1232ms / token
LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X: 4098ms / token
# commit 3d0afcf24309f28ec540ed7645c35400a865ad6f
# I've been focusing on making the ordinary non-OpenCL CPU implementation
# faster and I got some gains, most importantly from multithreading.
# There is Float16 support now, so I've added f16/f32 to these tables:
LLaMA-7B: AMD Ryzen 3950X: 552ms / token f16
LLaMA-7B: AMD Ryzen 3950X: 1008ms / token f32
LLaMA-13B: AMD Ryzen 3950X: 1029ms / token f16
LLaMA-13B: AMD Ryzen 3950X: 1930ms / token f32
LLaMA-30B: AMD Ryzen 5950X: 2112ms / token f16