Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Internal: could not parse ModelProto from /Data_disk/meta_llama/meta_llama3.2/Llama3.2-1B-Instruct/tokenizer.model #34017

Open
1 of 4 tasks
Itime-ren opened this issue Oct 8, 2024 · 1 comment
Labels
bug Core: Tokenization Internals of the library; Tokenization.

Comments

@Itime-ren
Copy link

System Info

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in #24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Traceback (most recent call last):
File "/Data_disk/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 479, in
main()
File "/Data_disk/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 457, in main
write_tokenizer(
File "/Data_disk/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 367, in write_tokenizer
tokenizer = tokenizer_class(input_tokenizer_path)
File "/home/transformers/src/transformers/models/llama/tokenization_llama_fast.py", line 157, in init
super().init(
File "/home/transformers/src/transformers/tokenization_utils_fast.py", line 132, in init
slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs)
File "/home/transformers/src/transformers/models/llama/tokenization_llama.py", line 171, in init
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/home/transformers/src/transformers/models/llama/tokenization_llama.py", line 198, in get_spm_processor
tokenizer.Load(self.vocab_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 961, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 316, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: could not parse ModelProto from /Data_disk/meta_llama/meta_llama3.2/Llama3.2-1B-Instruct/tokenizer.model

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

python3 /Data_disk/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py
--input_dir /Data_disk/meta_llama/meta_llama3.2/Llama3.2-1B-Instruct
--model_size 1B
--output_dir /Data_disk/meta_llama/meta_llama3.2/out

Expected behavior

get safetensors

@Itime-ren Itime-ren added the bug label Oct 8, 2024
@LysandreJik LysandreJik added the Core: Tokenization Internals of the library; Tokenization. label Oct 8, 2024
@LysandreJik
Copy link
Member

Hey @Itime-ren, what's the content of /Data_disk/meta_llama/meta_llama3.2/Llama3.2-1B-Instruct?

If trying to use the llama 3.2 1B Instruct, why don't you use this repo which is already transformers-compatible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

No branches or pull requests

2 participants