Update README.md, add a nice animation.

Noeda · Mar 18, 2023 · db0f22e · db0f22e
1 parent cfad4b1
commit db0f22e
Show file tree

Hide file tree

Showing 4 changed files with 55 additions and 48 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/README.md b/README.md
@@ -1,8 +1,16 @@
 # RLLaMA
 
-This is my attempt at making the LLaMA language model working on a pure Rust
-CPU implementation. I was inspired by an amazing CPU implementation here:
-https://github.com/ggerganov/ggml that could run GPT-J 6B models.
+RLLaMA is a pure Rust implementation of [LLaMA large language model inference.](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/).
+
+## Supported features
+
+  * Use either `f16` and `f32` weights.
+  * LLaMA-7B, LLaMA-13B and LLaMA-30B are all confirmed working. LLaMA-65B
+    likely works but I haven't found a big enough computer to run it.
+  * Multithreaded hand-optimized CPU inference
+  * OpenCL support for GPU inference.
+
+## Performance
 
 The current performance is as follows:
 
@@ -24,20 +32,16 @@ LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1232ms / token           (Open
 LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X:  4098ms / token           (OpenCL on CPU)
 ```
 
-(Scroll to the bottom to see benchmarks over time).
-
-I have not tried to run LLaMA-60B but I think it would work if you got a big
-enough computer.
+Scroll to the bottom of this README.md to see benchmarks over time.
 
-It also has a Python unpickler that understands the `.pth` files used by
-PyTorch. Well almost, it doesn't unzip them automatically (see below).
+## Screenshot
 
-The implementation uses AVX2, even in the OpenCL codepath, so this will only
-run on AMD64 at this time.
+![Screenshot of RLLaMA in action](rllama.gif)
 
-# Crates.io Cargo package install
+## Install
 
-As of March 18, `rllama` is on `crates.io`. You can install it with `cargo install rllama`. You may need to explicitly enable AVX2 features:
+You can install with `cargo` tool. RLLaMA uses intrinsics extensively and you
+likely need to enable them to install the executable.
 
 ```
 RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama
@@ -46,67 +50,70 @@ RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama
 There is a `.cargo/config.toml` inside this repository that will enable these
 features if you install manually from this Git repository instead.
 
-# How to run
+## LLaMA weights
 
-You will need Rust. Make sure you can run `cargo` from a command line. In
-particular, this is using unstable features so you need nightly rust. Make sure
-that if you write `cargo --version` it shows that it is nightly Rust.
+Refer to https://github.com/facebookresearch/llama/ As of now, you need to be
+approved to get weights.
 
-You will need to download LLaMA-7B weights. Refer to https://github.com/facebookresearch/llama/
+For LLaMA-7B make sure, you got these files:
 
-Once you have 7B weights, and the `tokenizer.model` it comes with, you need to
-decompress it.
+```shell
+* 7B/consolidated.00.pth
+* 7B/params.json
+* tokenizer.model
+```
+
+The `consolidated.00.pth` is actually a zip file. You need to unzip it:
 
 ```shell
-$ cd LLaMA
 $ cd 7B
 $ unzip consolidated.00.pth
-# For LLaMA-7B, rename consolidated to consolidated.00
-# For the larger models, the number is there already so no need to do this step.
 $ mv consolidated consolidated.00
 ```
 
-You should then be ready to generate some text.
+If you are using a larger model like LLaMA-13B, then you can skip the last step
+of renaming the `consolidated` directory.
 
-```shell
-cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is"
-```
+You should now be ready to generate some text.
 
-By default, it will use the weights in the precision they are in the source
-files. You can use `--f16` command line argument to cast the largest weight
-matrices to float16. Also, using OpenCL will also cast the weight matrices to
-float16.
+## Example
 
-You can use `--temperature`, `--top-p` and `--top-k` to adjust token sampler
-settings.
+Run LLaMA-7B with some weights casted to 16-bit floats:
 
-There is `--repetition-penalty` setting. 1.0 means no penalty. This value
-likely should be between 0 and 1. Values smaller than 1.0 give a penalty to
-tokens that appear in the context, by
-`x*(repetitition_penalty^num_occurrences)` before applying `softmax()` on the
-output probabilities. Or in other words, values smaller than 1.0 apply penalty.
+```shell
+rllama --tokenizer-model /path/to/tokenizer.model \
+       --model-path /path/to/LLaMA/7B \
+       --param-path /path/to/LLaMA/7B/params.json \
+       --f16 \
+       --prompt "The meaning of life is"
+```
 
-You can also use `--prompt-file` to read the prompt from a file instead from
-the command line.
+Use `rllama --help` to see all the options.
 
-# How to turn on OpenCL
+## How to turn on OpenCL
 
 Use `opencl` Cargo feature.
 
 ```
-cargo run --release --features opencl -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is"
+RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama --features opencl
+```
+
+```
+rllama --tokenizer-model /path/to/tokenizer.model \
+       --model-path /path/to/LLaMA/7B \
+       --param-path /path/to/LLaMA/7B/params.json \
+       --opencl-device 0 \
+       --prompt "The meaning of life is"
 ```
 
 With `opencl` feature, there is also another argument, `--opencl-device` that
 takes a number. That number selects Nth OpenCL device found on the system. You
 can see the devices in the output when you run the program (e.g. see the
 screenshot below).
 
-# Screenshot
-
-![Screenshot of RLLaMA in action](rllama.png)
+Weights are always cast to 16-bit floats for OpenCL.
 
-# Notes and future plans
+## Notes and future plans
 
 This is a hobby thing for me so don't expect updates or help.
 
@@ -126,7 +133,7 @@ This is a hobby thing for me so don't expect updates or help.
 * Stanford released some instruct-finetuned LLaMA-7B, once I find the weights
   then I'd like to try make a chat-like command-line interface.
 
-# Benchmarks
+## Benchmarks
 
 I'm trying to track that I'm making this faster and not slower.
 

diff --git a/rllama.gif b/rllama.gif
diff --git a/rllama.png b/rllama.png