Added tokens #343

cyanic-selkie · 2023-03-26T09:07:42Z

Hi!

I am trying to use a T5 model for text generation. It required adding a few lines of code to the pipeline, but it basically works out of the box.

The issue I came across while trying to use a custom model from Huggingface is that I can't use its added_tokens.json file. Since T5 uses a sentencepiece protobuf, amending it would be a nuisance.

Are there any plans to support this feature? Is there a simple workaround I could use?

Thank you.

let model = Box::new(RemoteResource::new("model/rust_model.ot", "cache/model"));
let vocab = Box::new(RemoteResource::new("model/spiece.model", "cache/vocab"));
let config = Box::new(RemoteResource::new("model/config.json", "cache/config"));

let generate_config = TextGenerationConfig {
  model_type: ModelType::T5,
  model_resource: model,
  config_resource: config,
  vocab_resource: vocab,
  ..Default::default()
};

let model = TextGenerationModel::new(generate_config)?;

The text was updated successfully, but these errors were encountered:

guillaume-be · 2023-04-10T09:07:03Z

Hello @cyanic-selkie ,

I do not believe adding arbitrary tokens to a vocabulary/tokenizer is supported yet by the tokenizers dependency. A method exists to overwrite the special tokens mapping:
https://github.com/guillaume-be/rust-tokenizers/blob/823c125442c7b8f7a8d6b79af610ce7640279275/main/src/tokenizer/t5_tokenizer.rs#L86

This allows passing a special_token_mapping file defining/overwriting the list of tokens that should be handled as special tokens by the tokenization algorithm. The tokens given in this file however need to be included in the vocabulary, otherwise the tokenizer would fail (a mapping to the token index to use would then not be found). Note that I don't think these tokens cannot be added to the sentencepiece proto file since I assume they do not have an associated probability/score and are instead new special tokens (the Python library mentions that these are indeed handled differently: https://huggingface.co/docs/transformers/v4.27.2/en/internal/tokenization_utils#transformers.SpecialTokensMixin.add_tokens)

The tokenizers creation should be updated to allow passing such a file. This use-case sounds rather generic, I will work on pushing some changes later this week. I will probably follow the same pattern as the Python library: calling a method post tokenizer initialization called add_tokens that would take a HashMap<String, i64> and a is_special_token boolean flag indicating if these tokens should also be registered as special tokens.

guillaume-be · 2023-04-14T17:44:29Z

@cyanic-selkie could you please share an example of models you are trying to use that leverages added_tokens.json? I'd like to use it for testing purposes to ensure the implementation is correct

cyanic-selkie · 2023-04-14T19:58:14Z

@cyanic-selkie could you please share an example of models you are trying to use that leverages added_tokens.json? I'd like to use it for testing purposes to ensure the implementation is correct

Sure thing @guillaume-be! Here is one. I don't know of any others though since I never used them.

guillaume-be · 2023-04-16T08:59:17Z

Hello @cyanic-selkie ,

I have opened a PR exposing add_tokens to the Tokenizer directly and in the TokenizerOption for convenience. I have also added a new_with_tokenizer method to most pipelines allowing to create pipelines with a custom tokenizer, which you should probably do in your usecase. As opposed to the Transformers Python library, the added_tokens.json is not ingested automatically if found at instantiation (as this would cause a significant amount of breaking changes, and the crate generally relies on direct file pointers rather than directories for resources). Instead you would have to create the tokenizer and call the add_tokens to add extra tokens as needed.

The following is now possible with the proposed changes at #354, is this what you had in mind?

extern crate anyhow;

use rust_bert::pipelines::common::{ModelType, TokenizerOption};
use rust_bert::pipelines::generation_utils::GenerateConfig;
use rust_bert::resources::{RemoteResource, ResourceProvider};
use rust_bert::t5::{T5ConfigResources, T5Generator, T5ModelResources, T5VocabResources};

fn main() -> anyhow::Result<()> {
    //    Set-up model
    let model_resource = Box::new(RemoteResource::from_pretrained(T5ModelResources::T5_SMALL));
    let config_resource = Box::new(RemoteResource::from_pretrained(T5ConfigResources::T5_SMALL));
    let vocab_resource = RemoteResource::from_pretrained(T5VocabResources::T5_SMALL);

    let generate_config = GenerateConfig {
        model_resource,
        config_resource,
        ..Default::default()
    };
    let mut tokenizer = TokenizerOption::from_file(
        ModelType::T5,
        vocab_resource.get_local_path()?.to_str().unwrap(),
        None,
        false,
        None,
        None,
    )?;
    tokenizer.add_tokens(&["<sep>", "<hl>"]);

    let _t5_generator = T5Generator::new_with_tokenizer(generate_config, tokenizer)?;

    Ok(())
}

Also note that TextGenerationModel is used for causal/prefix generation and cannot be used for text-to-text / conditional generation models like T5. The code example above illustrates how to use the T5 generator directly instead

cyanic-selkie · 2023-04-16T10:13:53Z

The following is now possible with the proposed changes at #354, is this what you had in mind?

Indeed! Being able to instantiate models/pipelines with a custom tokenizer instance seems ideal.

Also note that TextGenerationModel is used for causal/prefix generation and cannot be used for text-to-text / conditional generation models like T5.

I've noticed that, however, it seemed to just work (although the tokenizer didn't, so perhaps it, in fact, did not work).

My understanding was that generation (whether prefix or conditional) is abstracted away behind the LanguageGenerator trait anyway.

Out of curiosity, where would the implementation break?

The code example above illustrates how to use the T5 generator directly instead

I'm guessing having a ConditionalTextGeneration pipeline wouldn't be of much use then?

@cyanic-selkie could you please share an example of models you are trying to use that leverages added_tokens.json? I'd like to use it for testing purposes to ensure the implementation is correct

Also, to add to this, I stumbled upon another model using the added_tokens.json file. this time it is a BART model. I played around with the latest rust_tokenizers commit (guillaume-be/rust-tokenizers@29a9911) and it seems to not work when adding the [HL] token, but it does when adding an <HL> token. Just thought I'd mention it.

Thank you for adding this feature so quickly!

guillaume-be · 2023-04-16T12:09:56Z

Hello,

Yes - TextGenerationModel is a high level wrapper (similar to the Python version) that allows adding a prefix (or automatically adds a prefix for XLNet). The LanguageGenerator should be used directly for conditional generation.

The implementation with T5 would fail at runtime because the ModelType is not supported by this pipeline:

rust-bert/src/pipelines/text_generation.rs

Line 225 in b40adb8

_ => Err(RustBertError::InvalidConfigurationError(format!(

I just tried adding the [HL] token for BART and it worked fine, do you have an example to help me reproduce?

extern crate anyhow;

use rust_bert::bart::{BartMergesResources, BartVocabResources};
use rust_bert::pipelines::common::{ModelType, TokenizerOption};
use rust_tokenizers::tokenizer::TruncationStrategy;
use std::mem;

use rust_bert::resources::{RemoteResource, ResourceProvider};

fn main() -> anyhow::Result<()> {
    //    Set-up model
    let vocab_resource = RemoteResource::from_pretrained(BartVocabResources::DISTILBART_CNN_6_6);
    let merges_resource = RemoteResource::from_pretrained(BartMergesResources::DISTILBART_CNN_6_6);

    let mut tokenizer = TokenizerOption::from_file(
        ModelType::Bart,
        vocab_resource.get_local_path()?.to_str().unwrap(),
        Some(merges_resource.get_local_path()?.to_str().unwrap()),
        false,
        None,
        None,
    )?;
    tokenizer.add_tokens(&["[HL]"]);
    let mut output = tokenizer.encode_list(
        &["This is a [HL] test"],
        128,
        &TruncationStrategy::DoNotTruncate,
        0,
    );
    let token_ids = mem::take(&mut output[0].token_ids);
    println!("{token_ids:?}");

    Ok(())
}

cyanic-selkie · 2023-04-16T12:21:20Z

I just tried it again and you appear to be correct. I don't know what I did or thought last night when I tried it.

In any case, when #354 gets merged, as far as I'm concerned, you may consider this issue closed.

guillaume-be · 2023-04-18T17:30:50Z

Fixed by #354

This was referenced Apr 15, 2023

Allow addition of tokens in vocab/tokenizer guillaume-be/rust-tokenizers#88

Merged

Add custom tokenizer option for pipelines, extend tokenizer interface methods #354

Merged

guillaume-be closed this as completed Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added tokens #343

Added tokens #343

cyanic-selkie commented Mar 26, 2023

guillaume-be commented Apr 10, 2023

guillaume-be commented Apr 14, 2023

cyanic-selkie commented Apr 14, 2023 •

edited

Loading

guillaume-be commented Apr 16, 2023 •

edited

Loading

cyanic-selkie commented Apr 16, 2023

guillaume-be commented Apr 16, 2023

cyanic-selkie commented Apr 16, 2023

guillaume-be commented Apr 18, 2023

Added tokens #343

Added tokens #343

Comments

cyanic-selkie commented Mar 26, 2023

guillaume-be commented Apr 10, 2023

guillaume-be commented Apr 14, 2023

cyanic-selkie commented Apr 14, 2023 • edited Loading

guillaume-be commented Apr 16, 2023 • edited Loading

cyanic-selkie commented Apr 16, 2023

guillaume-be commented Apr 16, 2023

cyanic-selkie commented Apr 16, 2023

guillaume-be commented Apr 18, 2023

cyanic-selkie commented Apr 14, 2023 •

edited

Loading

guillaume-be commented Apr 16, 2023 •

edited

Loading