Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added tokens #343

Closed
cyanic-selkie opened this issue Mar 26, 2023 · 8 comments
Closed

Added tokens #343

cyanic-selkie opened this issue Mar 26, 2023 · 8 comments

Comments

@cyanic-selkie
Copy link

Hi!

I am trying to use a T5 model for text generation. It required adding a few lines of code to the pipeline, but it basically works out of the box.

The issue I came across while trying to use a custom model from Huggingface is that I can't use its added_tokens.json file. Since T5 uses a sentencepiece protobuf, amending it would be a nuisance.

Are there any plans to support this feature? Is there a simple workaround I could use?

Thank you.

let model = Box::new(RemoteResource::new("model/rust_model.ot", "cache/model"));
let vocab = Box::new(RemoteResource::new("model/spiece.model", "cache/vocab"));
let config = Box::new(RemoteResource::new("model/config.json", "cache/config"));

let generate_config = TextGenerationConfig {
  model_type: ModelType::T5,
  model_resource: model,
  config_resource: config,
  vocab_resource: vocab,
  ..Default::default()
};

let model = TextGenerationModel::new(generate_config)?;
@guillaume-be
Copy link
Owner

Hello @cyanic-selkie ,

I do not believe adding arbitrary tokens to a vocabulary/tokenizer is supported yet by the tokenizers dependency. A method exists to overwrite the special tokens mapping:
https://github.com/guillaume-be/rust-tokenizers/blob/823c125442c7b8f7a8d6b79af610ce7640279275/main/src/tokenizer/t5_tokenizer.rs#L86

This allows passing a special_token_mapping file defining/overwriting the list of tokens that should be handled as special tokens by the tokenization algorithm. The tokens given in this file however need to be included in the vocabulary, otherwise the tokenizer would fail (a mapping to the token index to use would then not be found). Note that I don't think these tokens cannot be added to the sentencepiece proto file since I assume they do not have an associated probability/score and are instead new special tokens (the Python library mentions that these are indeed handled differently: https://huggingface.co/docs/transformers/v4.27.2/en/internal/tokenization_utils#transformers.SpecialTokensMixin.add_tokens)

The tokenizers creation should be updated to allow passing such a file. This use-case sounds rather generic, I will work on pushing some changes later this week. I will probably follow the same pattern as the Python library: calling a method post tokenizer initialization called add_tokens that would take a HashMap<String, i64> and a is_special_token boolean flag indicating if these tokens should also be registered as special tokens.

@guillaume-be
Copy link
Owner

@cyanic-selkie could you please share an example of models you are trying to use that leverages added_tokens.json? I'd like to use it for testing purposes to ensure the implementation is correct

@cyanic-selkie
Copy link
Author

cyanic-selkie commented Apr 14, 2023

@cyanic-selkie could you please share an example of models you are trying to use that leverages added_tokens.json? I'd like to use it for testing purposes to ensure the implementation is correct

Sure thing @guillaume-be! Here is one. I don't know of any others though since I never used them.

@guillaume-be
Copy link
Owner

guillaume-be commented Apr 16, 2023

Hello @cyanic-selkie ,

I have opened a PR exposing add_tokens to the Tokenizer directly and in the TokenizerOption for convenience. I have also added a new_with_tokenizer method to most pipelines allowing to create pipelines with a custom tokenizer, which you should probably do in your usecase. As opposed to the Transformers Python library, the added_tokens.json is not ingested automatically if found at instantiation (as this would cause a significant amount of breaking changes, and the crate generally relies on direct file pointers rather than directories for resources). Instead you would have to create the tokenizer and call the add_tokens to add extra tokens as needed.

The following is now possible with the proposed changes at #354, is this what you had in mind?

extern crate anyhow;

use rust_bert::pipelines::common::{ModelType, TokenizerOption};
use rust_bert::pipelines::generation_utils::GenerateConfig;
use rust_bert::resources::{RemoteResource, ResourceProvider};
use rust_bert::t5::{T5ConfigResources, T5Generator, T5ModelResources, T5VocabResources};

fn main() -> anyhow::Result<()> {
    //    Set-up model
    let model_resource = Box::new(RemoteResource::from_pretrained(T5ModelResources::T5_SMALL));
    let config_resource = Box::new(RemoteResource::from_pretrained(T5ConfigResources::T5_SMALL));
    let vocab_resource = RemoteResource::from_pretrained(T5VocabResources::T5_SMALL);

    let generate_config = GenerateConfig {
        model_resource,
        config_resource,
        ..Default::default()
    };
    let mut tokenizer = TokenizerOption::from_file(
        ModelType::T5,
        vocab_resource.get_local_path()?.to_str().unwrap(),
        None,
        false,
        None,
        None,
    )?;
    tokenizer.add_tokens(&["<sep>", "<hl>"]);

    let _t5_generator = T5Generator::new_with_tokenizer(generate_config, tokenizer)?;

    Ok(())
}

Also note that TextGenerationModel is used for causal/prefix generation and cannot be used for text-to-text / conditional generation models like T5. The code example above illustrates how to use the T5 generator directly instead

@cyanic-selkie
Copy link
Author

The following is now possible with the proposed changes at #354, is this what you had in mind?

Indeed! Being able to instantiate models/pipelines with a custom tokenizer instance seems ideal.

Also note that TextGenerationModel is used for causal/prefix generation and cannot be used for text-to-text / conditional generation models like T5.

I've noticed that, however, it seemed to just work (although the tokenizer didn't, so perhaps it, in fact, did not work).

My understanding was that generation (whether prefix or conditional) is abstracted away behind the LanguageGenerator trait anyway.

Out of curiosity, where would the implementation break?

The code example above illustrates how to use the T5 generator directly instead

I'm guessing having a ConditionalTextGeneration pipeline wouldn't be of much use then?

@cyanic-selkie could you please share an example of models you are trying to use that leverages added_tokens.json? I'd like to use it for testing purposes to ensure the implementation is correct

Also, to add to this, I stumbled upon another model using the added_tokens.json file. this time it is a BART model. I played around with the latest rust_tokenizers commit (guillaume-be/rust-tokenizers@29a9911) and it seems to not work when adding the [HL] token, but it does when adding an <HL> token. Just thought I'd mention it.

Thank you for adding this feature so quickly!

@guillaume-be
Copy link
Owner

Hello,

Yes - TextGenerationModel is a high level wrapper (similar to the Python version) that allows adding a prefix (or automatically adds a prefix for XLNet). The LanguageGenerator should be used directly for conditional generation.

The implementation with T5 would fail at runtime because the ModelType is not supported by this pipeline:

_ => Err(RustBertError::InvalidConfigurationError(format!(

I just tried adding the [HL] token for BART and it worked fine, do you have an example to help me reproduce?

extern crate anyhow;

use rust_bert::bart::{BartMergesResources, BartVocabResources};
use rust_bert::pipelines::common::{ModelType, TokenizerOption};
use rust_tokenizers::tokenizer::TruncationStrategy;
use std::mem;

use rust_bert::resources::{RemoteResource, ResourceProvider};

fn main() -> anyhow::Result<()> {
    //    Set-up model
    let vocab_resource = RemoteResource::from_pretrained(BartVocabResources::DISTILBART_CNN_6_6);
    let merges_resource = RemoteResource::from_pretrained(BartMergesResources::DISTILBART_CNN_6_6);

    let mut tokenizer = TokenizerOption::from_file(
        ModelType::Bart,
        vocab_resource.get_local_path()?.to_str().unwrap(),
        Some(merges_resource.get_local_path()?.to_str().unwrap()),
        false,
        None,
        None,
    )?;
    tokenizer.add_tokens(&["[HL]"]);
    let mut output = tokenizer.encode_list(
        &["This is a [HL] test"],
        128,
        &TruncationStrategy::DoNotTruncate,
        0,
    );
    let token_ids = mem::take(&mut output[0].token_ids);
    println!("{token_ids:?}");

    Ok(())
}

@cyanic-selkie
Copy link
Author

I just tried it again and you appear to be correct. I don't know what I did or thought last night when I tried it.

In any case, when #354 gets merged, as far as I'm concerned, you may consider this issue closed.

@guillaume-be
Copy link
Owner

Fixed by #354

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants