Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine tune and infer llama3 with cpu #1037

Open
SidneyLann opened this issue Sep 18, 2024 · 15 comments
Open

Fine tune and infer llama3 with cpu #1037

SidneyLann opened this issue Sep 18, 2024 · 15 comments

Comments

@SidneyLann
Copy link

import logging
import os
import json
import torch
from datasets import load_from_disk
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import FastLanguageModel

logger = logging.getLogger(name)
logger.setLevel(logging.DEBUG)

DATA_HOME = "/home/sidney/app"
MAX_SEQ_LENGTH = 5000

Defining the configuration for the base model, LoRA and training

config = {
"hugging_face_username":"Shekswess",
"model_config": {
"base_model":os.path.join(DATA_HOME, "model_root/model_en"), # The base model
"finetuned_model":os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen"), # The fine-tuned model
"max_seq_length": MAX_SEQ_LENGTH, # The maximum sequence length
"dtype":torch.float16, # The data type
"load_in_4bit": True, # Load the model in 4-bit
},
"lora_config": {
"r": 16, # The number of LoRA layers 8, 16, 32, 64
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"], # The target modules
"lora_alpha":16, # The alpha value for LoRA
"lora_dropout":0, # The dropout value for LoRA
"bias":"none", # The bias for LoRA
"use_gradient_checkpointing":True, # Use gradient checkpointing
"use_rslora":False, # Use RSLora
"use_dora":False, # Use DoRa
"loftq_config":None # The LoFTQ configuration
},
"training_dataset":{
"name":os.path.join(DATA_HOME, "dataset_gen"), # The dataset name(huggingface/datasets)
"split":"train", # The dataset split
"input_field":"prompt", # The input field
},
"training_config": {
"per_device_train_batch_size": 1, # The batch size
"gradient_accumulation_steps": 1, # The gradient accumulation steps
"warmup_steps": 5, # The warmup steps
"max_steps":0, # The maximum steps (0 if the epochs are defined)
"num_train_epochs": 1, # The number of training epochs(0 if the maximum steps are defined)
"learning_rate": 2e-4, # The learning rate
"fp16": not torch.cuda.is_bf16_supported(), # The fp16
"bf16": torch.cuda.is_bf16_supported(), # The bf16
"logging_steps": 1, # The logging steps
"optim" :"adamw_8bit", # The optimizer
"weight_decay" : 0.01, # The weight decay
"lr_scheduler_type": "linear", # The learning rate scheduler
"seed" : 42, # The seed
"output_dir" : "outputs", # The output directory
}
}

Loading the model and the tokinizer for the model

model, tokenizer = FastLanguageModel.from_pretrained(
model_name = config.get("model_config").get("base_model"),
max_seq_length = config.get("model_config").get("max_seq_length"),
dtype = config.get("model_config").get("dtype"),
load_in_4bit = config.get("model_config").get("load_in_4bit"),
)

Setup for QLoRA/LoRA peft of the base model

model = FastLanguageModel.get_peft_model(
model,
r = config.get("lora_config").get("r"),
target_modules = config.get("lora_config").get("target_modules"),
lora_alpha = config.get("lora_config").get("lora_alpha"),
lora_dropout = config.get("lora_config").get("lora_dropout"),
bias = config.get("lora_config").get("bias"),
use_gradient_checkpointing = config.get("lora_config").get("use_gradient_checkpointing"),
random_state = 42,
use_rslora = config.get("lora_config").get("use_rslora"),
use_dora = config.get("lora_config").get("use_dora"),
loftq_config = config.get("lora_config").get("loftq_config"),
)

Loading the training dataset

dataset_train = load_from_disk(config.get("training_dataset").get("name"))['train']

Setting up the trainer for the model

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset_train,
dataset_text_field = config.get("training_dataset").get("input_field"),
max_seq_length = config.get("model_config").get("max_seq_length"),
dataset_num_proc = 1,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = config.get("training_config").get("per_device_train_batch_size"),
gradient_accumulation_steps = config.get("training_config").get("gradient_accumulation_steps"),
warmup_steps = config.get("training_config").get("warmup_steps"),
max_steps = config.get("training_config").get("max_steps"),
num_train_epochs= config.get("training_config").get("num_train_epochs"),
learning_rate = config.get("training_config").get("learning_rate"),
fp16 = config.get("training_config").get("fp16"),
bf16 = config.get("training_config").get("bf16"),
logging_steps = config.get("training_config").get("logging_steps"),
optim = config.get("training_config").get("optim"),
weight_decay = config.get("training_config").get("weight_decay"),
lr_scheduler_type = config.get("training_config").get("lr_scheduler_type"),
seed = 42,
output_dir = config.get("training_config").get("output_dir"),
),
)

Training the model

trainer_stats = trainer.train()

Saving the trainer stats

with open(os.path.join(DATA_HOME, "outputs/trainer_stats_gen.json"), "w") as f:
json.dump(trainer_stats, f, indent=4)

Locally saving the model and pushing it to the Hugging Face Hub (only LoRA adapters)

model.save_pretrained(config.get("model_config").get("finetuned_model"))

Can amend this code to use cpu?

@danielhanchen
Copy link
Contributor

You should convert to GGUF for CPU inference - or you can use direct HF inference

@SidneyLann
Copy link
Author

Hi, any links for reference?

@Linguiniotta
Copy link

There's instruction in the wiki for converting to GGUF, but is it possible to fine-tune / train with TPU or CPU? I get an error when importing unsloth FastLanguageModel. I maxed out my GPU quota on Kaggle lol.

Installation / Import
# https://github.com/unslothai/unsloth/issues/998
!pip install --quiet pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install --quiet "torch==2.4.0" "xformers==0.0.27.post2" triton torchvision torchaudio
!pip install --quiet "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"
%%time
from unsloth import FastLanguageModel
from accelerate import Accelerator
Error
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File :1

File /usr/local/lib/python3.10/site-packages/unsloth/__init__.py:83
     80 pass
     82 # Torch 2.4 has including_emulation
---> 83 major_version, minor_version = torch.cuda.get_device_capability()
     84 SUPPORTS_BFLOAT16 = (major_version >= 8)
     86 old_is_bf16_supported = torch.cuda.is_bf16_supported

File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:451, in get_device_capability(device)
    438 def get_device_capability(device: Optional[_device_t] = None) -> Tuple[int, int]:
    439     r"""Get the cuda capability of a device.
    440 
    441     Args:
   (...)
    449         tuple(int, int): the major and minor cuda capability of the device
    450     """
--> 451     prop = get_device_properties(device)
    452     return prop.major, prop.minor

File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:465, in get_device_properties(device)
    455 def get_device_properties(device: _device_t) -> _CudaDeviceProperties:
    456     r"""Get the properties of a device.
    457 
    458     Args:
   (...)
    463         _CudaDeviceProperties: the properties of the device
    464     """
--> 465     _lazy_init()  # will define _get_device_properties
    466     device = _get_device_index(device, optional=True)
    467     if device < 0 or device >= device_count():

File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:314, in _lazy_init()
    312 if "CUDA_MODULE_LOADING" not in os.environ:
    313     os.environ["CUDA_MODULE_LOADING"] = "LAZY"
--> 314 torch._C._cuda_init()
    315 # Some of the queued calls may reentrantly call _lazy_init();
    316 # we need to just return without initializing in that case.
    317 # However, we must not let any *other* threads in!
    318 _tls.is_initializing = True

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

@SidneyLann
Copy link
Author

Where is the instruction to use llama.cpp to load gguf and infer?

@Linguiniotta
Copy link

It is in their GH :) https://github.com/ggerganov/llama.cpp#usage

@SidneyLann
Copy link
Author

import os
import sys
import json
import torch
from datasets import load_dataset
from unsloth import FastLanguageModel

INSTRUCTION = "does the user input content contain bus?"
DATA_HOME = "/home/sidney/app"
MAX_SEQ_LENGTH = 5000
SEQ_START_IDX = 512

config = {
"model_config": {
"finetuned_model": os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen"),
"max_seq_length": MAX_SEQ_LENGTH,
"dtype": torch.float16,
"load_in_4bit": True,
}
}

model, tokenizer = FastLanguageModel.from_pretrained(
model_name=config.get("model_config").get("finetuned_model"),
max_seq_length=config.get("model_config").get("max_seq_length"),
dtype=config.get("model_config").get("dtype"),
load_in_4bit=config.get("model_config").get("load_in_4bit"),
)

FastLanguageModel.for_inference(model)
dataset_path = sys.argv[1]
dateStr=dataset_path[-8:]
files = [os.path.join(dataset_path, f) for f in os.listdir(dataset_path)]
fileCount = 0
genCount = 0
exceptFileName = ''
for fileName in files:
file_size = os.path.getsize(fileName)
fileCount = fileCount + 1
print('fileCount: ', fileCount, genCount, file_size, dateStr, fileName, exceptFileName)
if file_size < 8192:
continue
genCount = genCount + 1

with open(fileName) as f:
    content = f.read()
    print("content Size is :", len(content))
    if len(content) > MAX_SEQ_LENGTH+SEQ_START_IDX:
        content = content[SEQ_START_IDX:MAX_SEQ_LENGTH+SEQ_START_IDX]
    inputs = tokenizer(
        [
            f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nAt date {dateStr}, {INSTRUCTION}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|>"""
        ], return_tensors="pt").to("cuda")
    try:
        outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
    except:
        exceptFileName = fileName
        continue
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    print(outputs[0])

#How to amend this code to use llama.cpp to infer by gguf?

@Linguiniotta
Copy link

You are still using the unsloth model. Convert it first to gguf THEN infer.

@SidneyLann
Copy link
Author

SidneyLann commented Sep 25, 2024

Had converted, but don't know how to use llama.cpp like unsloth to do the reference.

@danielhanchen
Copy link
Contributor

@SidneyLann Another option is to use HuggingFace CPU directly after finetuning:

    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

@SidneyLann
Copy link
Author

SidneyLann commented Oct 1, 2024

Can't infer by the lora_model which fine tune by gpu, it still use gpu, must fine tune by cpu to do cpu inference?

@danielhanchen
Copy link
Contributor

@SidneyLann You need to save the LoRA adapter (finetuned by CPU or GPU) then load it on a CPU only machine - it should work!

@SidneyLann
Copy link
Author

My machine has one gpu to do other tasks, can't do the cpu inference in this machine? Can't config it by a indicator?

@danielhanchen
Copy link
Contributor

@SidneyLann Yes you can set device_map = "cpu" for example in the loading module to force it to CPU

@SidneyLann
Copy link
Author

config = {
"model_config": {
"finetuned_model": os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen110"),
"max_seq_length": 5000,
"dtype": torch.float32,
"load_in_4bit": True,
"device_map": "cpu",
}
}

model_name=config.get("model_config").get("finetuned_model")
device_map=config.get("model_config").get("device_map")
model = AutoPeftModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=config.get("model_config").get("load_in_4bit"),
device_map=device_map,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

inputs = tokenizer(
[
f"""<|begin_of_text|>......<|eot_id|>"""
], return_tensors = "pt").to(device_map)
outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
outputs = tokenizer.batch_decode(outputs, skip_special_tokens = True)
print(outputs[0])

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first.
......
File "/usr/prg/python/3102/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 468, in forward
out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
File "/usr/prg/python/3102/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 566, in matmul_4bit
assert quant_state is not None
AssertionError

quant_state is None, why .to(device_map) not work?

@danielhanchen
Copy link
Contributor

@SidneyLann Actually you're correct - bitsandbytes only works on GPU :(

Have you considered exporting to GGUF / llama.cpp / Ollama for inference?

Another way is to use load_in_4bit = False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants