Fix convert_token_type_ids_from_sequences for fast tokenizers #4503

n1t0 · 2020-05-21T17:10:40Z

Before this fix, the generic version of convert_token_type_ids_from_sequences from tokenizer_utils gets called when called on a PreTrainedTokenizerFast. The type_ids for the special token are thus not included.
There is no way at the moment to get this information from the rust tokenizers, so we just use the implementation from the original python tokenizers. Tests added as well.

Thanks @dirkgr for reporting this.

codecov-commenter · 2020-05-21T17:16:42Z

Codecov Report

Merging #4503 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4503      +/-   ##
==========================================
+ Coverage   77.83%   77.86%   +0.02%     
==========================================
  Files         123      123              
  Lines       20514    20526      +12     
==========================================
+ Hits        15968    15982      +14     
+ Misses       4546     4544       -2

Impacted Files	Coverage Δ
src/transformers/tokenization_bert.py	`95.00% <100.00%> (+0.12%)`	⬆️
src/transformers/tokenization_roberta.py	`94.52% <100.00%> (+0.49%)`	⬆️
src/transformers/modeling_tf_utils.py	`88.66% <0.00%> (+0.16%)`	⬆️
src/transformers/file_utils.py	`73.85% <0.00%> (+0.41%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a086527...795f44a. Read the comment docs.

LysandreJik

LGTM, thanks @n1t0

Fix convert_token_type_ids_from_sequences for fast tokenizers

795f44a

n1t0 force-pushed the fix-convert-typeids-fast branch from 733529d to 795f44a Compare May 21, 2020 17:11

n1t0 requested review from LysandreJik and mfuntowicz May 21, 2020 17:12

mfuntowicz approved these changes May 21, 2020

View reviewed changes

LysandreJik approved these changes May 22, 2020

View reviewed changes

LysandreJik merged commit 35df911 into master May 22, 2020

LysandreJik deleted the fix-convert-typeids-fast branch May 22, 2020 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix convert_token_type_ids_from_sequences for fast tokenizers #4503

Fix convert_token_type_ids_from_sequences for fast tokenizers #4503

n1t0 commented May 21, 2020

codecov-commenter commented May 21, 2020 •

edited

Loading

LysandreJik left a comment

Fix convert_token_type_ids_from_sequences for fast tokenizers #4503

Fix convert_token_type_ids_from_sequences for fast tokenizers #4503

Conversation

n1t0 commented May 21, 2020

codecov-commenter commented May 21, 2020 • edited Loading

Codecov Report

LysandreJik left a comment

Choose a reason for hiding this comment

codecov-commenter commented May 21, 2020 •

edited

Loading