Fine tune and infer llama3 with cpu #1037

SidneyLann · 2024-09-18T21:22:34Z

import logging
import os
import json
import torch
from datasets import load_from_disk
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import FastLanguageModel

logger = logging.getLogger(name)
logger.setLevel(logging.DEBUG)

DATA_HOME = "/home/sidney/app"
MAX_SEQ_LENGTH = 5000

Defining the configuration for the base model, LoRA and training

config = {
"hugging_face_username":"Shekswess",
"model_config": {
"base_model":os.path.join(DATA_HOME, "model_root/model_en"), # The base model
"finetuned_model":os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen"), # The fine-tuned model
"max_seq_length": MAX_SEQ_LENGTH, # The maximum sequence length
"dtype":torch.float16, # The data type
"load_in_4bit": True, # Load the model in 4-bit
},
"lora_config": {
"r": 16, # The number of LoRA layers 8, 16, 32, 64
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"], # The target modules
"lora_alpha":16, # The alpha value for LoRA
"lora_dropout":0, # The dropout value for LoRA
"bias":"none", # The bias for LoRA
"use_gradient_checkpointing":True, # Use gradient checkpointing
"use_rslora":False, # Use RSLora
"use_dora":False, # Use DoRa
"loftq_config":None # The LoFTQ configuration
},
"training_dataset":{
"name":os.path.join(DATA_HOME, "dataset_gen"), # The dataset name(huggingface/datasets)
"split":"train", # The dataset split
"input_field":"prompt", # The input field
},
"training_config": {
"per_device_train_batch_size": 1, # The batch size
"gradient_accumulation_steps": 1, # The gradient accumulation steps
"warmup_steps": 5, # The warmup steps
"max_steps":0, # The maximum steps (0 if the epochs are defined)
"num_train_epochs": 1, # The number of training epochs(0 if the maximum steps are defined)
"learning_rate": 2e-4, # The learning rate
"fp16": not torch.cuda.is_bf16_supported(), # The fp16
"bf16": torch.cuda.is_bf16_supported(), # The bf16
"logging_steps": 1, # The logging steps
"optim" :"adamw_8bit", # The optimizer
"weight_decay" : 0.01, # The weight decay
"lr_scheduler_type": "linear", # The learning rate scheduler
"seed" : 42, # The seed
"output_dir" : "outputs", # The output directory
}
}

Loading the model and the tokinizer for the model

model, tokenizer = FastLanguageModel.from_pretrained(
model_name = config.get("model_config").get("base_model"),
max_seq_length = config.get("model_config").get("max_seq_length"),
dtype = config.get("model_config").get("dtype"),
load_in_4bit = config.get("model_config").get("load_in_4bit"),
)

Setup for QLoRA/LoRA peft of the base model

model = FastLanguageModel.get_peft_model(
model,
r = config.get("lora_config").get("r"),
target_modules = config.get("lora_config").get("target_modules"),
lora_alpha = config.get("lora_config").get("lora_alpha"),
lora_dropout = config.get("lora_config").get("lora_dropout"),
bias = config.get("lora_config").get("bias"),
use_gradient_checkpointing = config.get("lora_config").get("use_gradient_checkpointing"),
random_state = 42,
use_rslora = config.get("lora_config").get("use_rslora"),
use_dora = config.get("lora_config").get("use_dora"),
loftq_config = config.get("lora_config").get("loftq_config"),
)

Loading the training dataset

dataset_train = load_from_disk(config.get("training_dataset").get("name"))['train']

Setting up the trainer for the model

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset_train,
dataset_text_field = config.get("training_dataset").get("input_field"),
max_seq_length = config.get("model_config").get("max_seq_length"),
dataset_num_proc = 1,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = config.get("training_config").get("per_device_train_batch_size"),
gradient_accumulation_steps = config.get("training_config").get("gradient_accumulation_steps"),
warmup_steps = config.get("training_config").get("warmup_steps"),
max_steps = config.get("training_config").get("max_steps"),
num_train_epochs= config.get("training_config").get("num_train_epochs"),
learning_rate = config.get("training_config").get("learning_rate"),
fp16 = config.get("training_config").get("fp16"),
bf16 = config.get("training_config").get("bf16"),
logging_steps = config.get("training_config").get("logging_steps"),
optim = config.get("training_config").get("optim"),
weight_decay = config.get("training_config").get("weight_decay"),
lr_scheduler_type = config.get("training_config").get("lr_scheduler_type"),
seed = 42,
output_dir = config.get("training_config").get("output_dir"),
),
)

Training the model

trainer_stats = trainer.train()

Saving the trainer stats

with open(os.path.join(DATA_HOME, "outputs/trainer_stats_gen.json"), "w") as f:
json.dump(trainer_stats, f, indent=4)

Locally saving the model and pushing it to the Hugging Face Hub (only LoRA adapters)

model.save_pretrained(config.get("model_config").get("finetuned_model"))

Can amend this code to use cpu?

danielhanchen · 2024-09-21T09:03:30Z

You should convert to GGUF for CPU inference - or you can use direct HF inference

SidneyLann · 2024-09-22T02:28:51Z

Hi, any links for reference?

Linguiniotta · 2024-09-24T08:04:10Z

There's instruction in the wiki for converting to GGUF, but is it possible to fine-tune / train with TPU or CPU? I get an error when importing unsloth FastLanguageModel. I maxed out my GPU quota on Kaggle lol.

Installation / Import

# https://github.com/unslothai/unsloth/issues/998
!pip install --quiet pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install --quiet "torch==2.4.0" "xformers==0.0.27.post2" triton torchvision torchaudio
!pip install --quiet "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"

%%time
from unsloth import FastLanguageModel
from accelerate import Accelerator

Error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File :1

File /usr/local/lib/python3.10/site-packages/unsloth/__init__.py:83
     80 pass
     82 # Torch 2.4 has including_emulation
---> 83 major_version, minor_version = torch.cuda.get_device_capability()
     84 SUPPORTS_BFLOAT16 = (major_version >= 8)
     86 old_is_bf16_supported = torch.cuda.is_bf16_supported

File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:451, in get_device_capability(device)
    438 def get_device_capability(device: Optional[_device_t] = None) -> Tuple[int, int]:
    439     r"""Get the cuda capability of a device.
    440 
    441     Args:
   (...)
    449         tuple(int, int): the major and minor cuda capability of the device
    450     """
--> 451     prop = get_device_properties(device)
    452     return prop.major, prop.minor

File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:465, in get_device_properties(device)
    455 def get_device_properties(device: _device_t) -> _CudaDeviceProperties:
    456     r"""Get the properties of a device.
    457 
    458     Args:
   (...)
    463         _CudaDeviceProperties: the properties of the device
    464     """
--> 465     _lazy_init()  # will define _get_device_properties
    466     device = _get_device_index(device, optional=True)
    467     if device < 0 or device >= device_count():

File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:314, in _lazy_init()
    312 if "CUDA_MODULE_LOADING" not in os.environ:
    313     os.environ["CUDA_MODULE_LOADING"] = "LAZY"
--> 314 torch._C._cuda_init()
    315 # Some of the queued calls may reentrantly call _lazy_init();
    316 # we need to just return without initializing in that case.
    317 # However, we must not let any *other* threads in!
    318 _tls.is_initializing = True

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

SidneyLann · 2024-09-24T12:14:57Z

Where is the instruction to use llama.cpp to load gguf and infer?

Linguiniotta · 2024-09-24T14:49:32Z

It is in their GH :) https://github.com/ggerganov/llama.cpp#usage

SidneyLann · 2024-09-24T19:41:32Z

import os
import sys
import json
import torch
from datasets import load_dataset
from unsloth import FastLanguageModel

INSTRUCTION = "does the user input content contain bus?"
DATA_HOME = "/home/sidney/app"
MAX_SEQ_LENGTH = 5000
SEQ_START_IDX = 512

config = {
"model_config": {
"finetuned_model": os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen"),
"max_seq_length": MAX_SEQ_LENGTH,
"dtype": torch.float16,
"load_in_4bit": True,
}
}

model, tokenizer = FastLanguageModel.from_pretrained(
model_name=config.get("model_config").get("finetuned_model"),
max_seq_length=config.get("model_config").get("max_seq_length"),
dtype=config.get("model_config").get("dtype"),
load_in_4bit=config.get("model_config").get("load_in_4bit"),
)

FastLanguageModel.for_inference(model)
dataset_path = sys.argv[1]
dateStr=dataset_path[-8:]
files = [os.path.join(dataset_path, f) for f in os.listdir(dataset_path)]
fileCount = 0
genCount = 0
exceptFileName = ''
for fileName in files:
file_size = os.path.getsize(fileName)
fileCount = fileCount + 1
print('fileCount: ', fileCount, genCount, file_size, dateStr, fileName, exceptFileName)
if file_size < 8192:
continue
genCount = genCount + 1

with open(fileName) as f:
    content = f.read()
    print("content Size is :", len(content))
    if len(content) > MAX_SEQ_LENGTH+SEQ_START_IDX:
        content = content[SEQ_START_IDX:MAX_SEQ_LENGTH+SEQ_START_IDX]
    inputs = tokenizer(
        [
            f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nAt date {dateStr}, {INSTRUCTION}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|>"""
        ], return_tensors="pt").to("cuda")
    try:
        outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
    except:
        exceptFileName = fileName
        continue
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    print(outputs[0])

#How to amend this code to use llama.cpp to infer by gguf?

Linguiniotta · 2024-09-25T05:38:51Z

You are still using the unsloth model. Convert it first to gguf THEN infer.

SidneyLann · 2024-09-25T06:05:05Z

Had converted, but don't know how to use llama.cpp like unsloth to do the reference.

danielhanchen · 2024-10-01T08:19:26Z

@SidneyLann Another option is to use HuggingFace CPU directly after finetuning:

    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

SidneyLann · 2024-10-01T13:33:13Z

Can't infer by the lora_model which fine tune by gpu, it still use gpu, must fine tune by cpu to do cpu inference?

danielhanchen · 2024-10-02T03:50:30Z

@SidneyLann You need to save the LoRA adapter (finetuned by CPU or GPU) then load it on a CPU only machine - it should work!

SidneyLann · 2024-10-02T12:38:14Z

My machine has one gpu to do other tasks, can't do the cpu inference in this machine? Can't config it by a indicator?

danielhanchen · 2024-10-03T09:02:46Z

@SidneyLann Yes you can set device_map = "cpu" for example in the loading module to force it to CPU

SidneyLann · 2024-10-03T22:57:58Z

config = {
"model_config": {
"finetuned_model": os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen110"),
"max_seq_length": 5000,
"dtype": torch.float32,
"load_in_4bit": True,
"device_map": "cpu",
}
}

model_name=config.get("model_config").get("finetuned_model")
device_map=config.get("model_config").get("device_map")
model = AutoPeftModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=config.get("model_config").get("load_in_4bit"),
device_map=device_map,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

inputs = tokenizer(
[
f"""<|begin_of_text|>......<|eot_id|>"""
], return_tensors = "pt").to(device_map)
outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
outputs = tokenizer.batch_decode(outputs, skip_special_tokens = True)
print(outputs[0])

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first.
......
File "/usr/prg/python/3102/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 468, in forward
out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
File "/usr/prg/python/3102/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 566, in matmul_4bit
assert quant_state is not None
AssertionError

quant_state is None, why .to(device_map) not work?

danielhanchen · 2024-10-05T08:53:44Z

@SidneyLann Actually you're correct - bitsandbytes only works on GPU :(

Have you considered exporting to GGUF / llama.cpp / Ollama for inference?

Another way is to use load_in_4bit = False

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tune and infer llama3 with cpu #1037

Fine tune and infer llama3 with cpu #1037

SidneyLann commented Sep 18, 2024

danielhanchen commented Sep 21, 2024

SidneyLann commented Sep 22, 2024

Linguiniotta commented Sep 24, 2024

SidneyLann commented Sep 24, 2024

Linguiniotta commented Sep 24, 2024

SidneyLann commented Sep 24, 2024

Linguiniotta commented Sep 25, 2024

SidneyLann commented Sep 25, 2024 •

edited

Loading

danielhanchen commented Oct 1, 2024

SidneyLann commented Oct 1, 2024 •

edited

Loading

danielhanchen commented Oct 2, 2024

SidneyLann commented Oct 2, 2024

danielhanchen commented Oct 3, 2024

SidneyLann commented Oct 3, 2024

danielhanchen commented Oct 5, 2024

Fine tune and infer llama3 with cpu #1037

Fine tune and infer llama3 with cpu #1037

Comments

SidneyLann commented Sep 18, 2024

Defining the configuration for the base model, LoRA and training

Loading the model and the tokinizer for the model

Setup for QLoRA/LoRA peft of the base model

Loading the training dataset

Setting up the trainer for the model

Training the model

Saving the trainer stats

Locally saving the model and pushing it to the Hugging Face Hub (only LoRA adapters)

danielhanchen commented Sep 21, 2024

SidneyLann commented Sep 22, 2024

Linguiniotta commented Sep 24, 2024

SidneyLann commented Sep 24, 2024

Linguiniotta commented Sep 24, 2024

SidneyLann commented Sep 24, 2024

Linguiniotta commented Sep 25, 2024

SidneyLann commented Sep 25, 2024 • edited Loading

danielhanchen commented Oct 1, 2024

SidneyLann commented Oct 1, 2024 • edited Loading

danielhanchen commented Oct 2, 2024

SidneyLann commented Oct 2, 2024

danielhanchen commented Oct 3, 2024

SidneyLann commented Oct 3, 2024

danielhanchen commented Oct 5, 2024

SidneyLann commented Sep 25, 2024 •

edited

Loading

SidneyLann commented Oct 1, 2024 •

edited

Loading