Skip to content

Commit

Permalink
Update import instructions to use convert and quantize tooling from l…
Browse files Browse the repository at this point in the history
…lama.cpp submodule (ollama#2247)
  • Loading branch information
jmorganca committed Feb 5, 2024
1 parent b538dc3 commit b9f91a0
Showing 1 changed file with 39 additions and 69 deletions.
108 changes: 39 additions & 69 deletions docs/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ FROM ./mistral-7b-v0.1.Q4_0.gguf
(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`:

```
FROM ./q4_0.bin
FROM ./mistral-7b-v0.1.Q4_0.gguf
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
```

Expand All @@ -37,55 +37,69 @@ ollama run example "What is your favourite condiment?"

## Importing (PyTorch & Safetensors)

### Supported models
> Importing from PyTorch and Safetensors is a longer process than importing from GGUF. Improvements that make it easier are a work in progress.
Ollama supports a set of model architectures, with support for more coming soon:
### Setup

- Llama & Mistral
- Falcon & RW
- BigCode
First, clone the `ollama/ollama` repo:

To view a model's architecture, check the `config.json` file in its HuggingFace repo. You should see an entry under `architectures` (e.g. `LlamaForCausalLM`).
```
git clone git@github.com:ollama/ollama.git ollama
cd ollama
```

### Step 1: Clone the HuggingFace repository (optional)
and then fetch its `llama.cpp` submodule:

If the model is currently hosted in a HuggingFace repository, first clone that repository to download the raw model.
```shell
git submodule init
git submodule update llm/llama.cpp
```

Next, install the Python dependencies:

```
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
cd Mistral-7B-Instruct-v0.1
python3 -m venv llm/llama.cpp/.venv
source llm/llama.cpp/.venv/bin/activate
pip install -r llm/llama.cpp/requirements.txt
```

### Step 2: Convert and quantize to a `.bin` file (optional, for PyTorch and Safetensors)
Then build the `quantize` tool:

```
make -C llm/llama.cpp quantize
```

If the model is in PyTorch or Safetensors format, a [Docker image](https://hub.docker.com/r/ollama/quantize) with the tooling required to convert and quantize models is available.
### Clone the HuggingFace repository (optional)

First, Install [Docker](https://www.docker.com/get-started/).
If the model is currently hosted in a HuggingFace repository, first clone that repository to download the raw model.

Next, to convert and quantize your model, run:
Install [Git LFS](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage), verify it's installed, and then clone the model's repository:

```
docker run --rm -v .:/model ollama/quantize -q q4_0 /model
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 model
```

This will output two files into the directory:
### Convert the model

- `f16.bin`: the model converted to GGUF
- `q4_0.bin` the model quantized to a 4-bit quantization (Ollama will use this file to create the Ollama model)
> Note: some model architectures require using specific convert scripts. For example, Qwen models require running `convert-hf-to-gguf.py` instead of `convert.py`
### Step 3: Write a `Modelfile`
```
python llm/llama.cpp/convert.py ./model --outtype f16 --outfile converted.bin
```

Next, create a `Modelfile` for your model:
### Quantize the model

```
FROM ./q4_0.bin
llm/llama.cpp/quantize converted.bin quantized.bin q4_0
```

(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`:
### Step 3: Write a `Modelfile`

Next, create a `Modelfile` for your model:

```
FROM ./q4_0.bin
FROM quantized.bin
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
```

Expand Down Expand Up @@ -149,47 +163,3 @@ The quantization options are as follow (from highest highest to lowest levels of
- `q6_K`
- `q8_0`
- `f16`

## Manually converting & quantizing models

### Prerequisites

Start by cloning the `llama.cpp` repo to your machine in another directory:

```
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
```

Next, install the Python dependencies:

```
pip install -r requirements.txt
```

Finally, build the `quantize` tool:

```
make quantize
```

### Convert the model

Run the correct conversion script for your model architecture:

```shell
# LlamaForCausalLM or MistralForCausalLM
python convert.py <path to model directory>

# FalconForCausalLM
python convert-falcon-hf-to-gguf.py <path to model directory>

# GPTBigCodeForCausalLM
python convert-starcoder-hf-to-gguf.py <path to model directory>
```

### Quantize the model

```
quantize <path to model dir>/ggml-model-f32.bin <path to model dir>/q4_0.bin q4_0
```

0 comments on commit b9f91a0

Please sign in to comment.