-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added tokens #343
Comments
Hello @cyanic-selkie , I do not believe adding arbitrary tokens to a vocabulary/tokenizer is supported yet by the tokenizers dependency. A method exists to overwrite the special tokens mapping: This allows passing a The tokenizers creation should be updated to allow passing such a file. This use-case sounds rather generic, I will work on pushing some changes later this week. I will probably follow the same pattern as the Python library: calling a method post tokenizer initialization called |
@cyanic-selkie could you please share an example of models you are trying to use that leverages |
Sure thing @guillaume-be! Here is one. I don't know of any others though since I never used them. |
Hello @cyanic-selkie , I have opened a PR exposing The following is now possible with the proposed changes at #354, is this what you had in mind? extern crate anyhow;
use rust_bert::pipelines::common::{ModelType, TokenizerOption};
use rust_bert::pipelines::generation_utils::GenerateConfig;
use rust_bert::resources::{RemoteResource, ResourceProvider};
use rust_bert::t5::{T5ConfigResources, T5Generator, T5ModelResources, T5VocabResources};
fn main() -> anyhow::Result<()> {
// Set-up model
let model_resource = Box::new(RemoteResource::from_pretrained(T5ModelResources::T5_SMALL));
let config_resource = Box::new(RemoteResource::from_pretrained(T5ConfigResources::T5_SMALL));
let vocab_resource = RemoteResource::from_pretrained(T5VocabResources::T5_SMALL);
let generate_config = GenerateConfig {
model_resource,
config_resource,
..Default::default()
};
let mut tokenizer = TokenizerOption::from_file(
ModelType::T5,
vocab_resource.get_local_path()?.to_str().unwrap(),
None,
false,
None,
None,
)?;
tokenizer.add_tokens(&["<sep>", "<hl>"]);
let _t5_generator = T5Generator::new_with_tokenizer(generate_config, tokenizer)?;
Ok(())
} Also note that |
Indeed! Being able to instantiate models/pipelines with a custom tokenizer instance seems ideal.
I've noticed that, however, it seemed to just work (although the tokenizer didn't, so perhaps it, in fact, did not work). My understanding was that generation (whether prefix or conditional) is abstracted away behind the Out of curiosity, where would the implementation break?
I'm guessing having a
Also, to add to this, I stumbled upon another model using the Thank you for adding this feature so quickly! |
Hello, Yes - The implementation with T5 would fail at runtime because the rust-bert/src/pipelines/text_generation.rs Line 225 in b40adb8
I just tried adding the extern crate anyhow;
use rust_bert::bart::{BartMergesResources, BartVocabResources};
use rust_bert::pipelines::common::{ModelType, TokenizerOption};
use rust_tokenizers::tokenizer::TruncationStrategy;
use std::mem;
use rust_bert::resources::{RemoteResource, ResourceProvider};
fn main() -> anyhow::Result<()> {
// Set-up model
let vocab_resource = RemoteResource::from_pretrained(BartVocabResources::DISTILBART_CNN_6_6);
let merges_resource = RemoteResource::from_pretrained(BartMergesResources::DISTILBART_CNN_6_6);
let mut tokenizer = TokenizerOption::from_file(
ModelType::Bart,
vocab_resource.get_local_path()?.to_str().unwrap(),
Some(merges_resource.get_local_path()?.to_str().unwrap()),
false,
None,
None,
)?;
tokenizer.add_tokens(&["[HL]"]);
let mut output = tokenizer.encode_list(
&["This is a [HL] test"],
128,
&TruncationStrategy::DoNotTruncate,
0,
);
let token_ids = mem::take(&mut output[0].token_ids);
println!("{token_ids:?}");
Ok(())
} |
I just tried it again and you appear to be correct. I don't know what I did or thought last night when I tried it. In any case, when #354 gets merged, as far as I'm concerned, you may consider this issue closed. |
Fixed by #354 |
Hi!
I am trying to use a T5 model for text generation. It required adding a few lines of code to the pipeline, but it basically works out of the box.
The issue I came across while trying to use a custom model from Huggingface is that I can't use its
added_tokens.json
file. Since T5 uses a sentencepiece protobuf, amending it would be a nuisance.Are there any plans to support this feature? Is there a simple workaround I could use?
Thank you.
The text was updated successfully, but these errors were encountered: