Clarify use of unk_token in slow tokenizers' docstrings #9875

ethch18 · 2021-01-28T20:23:16Z

What does this PR do?

Currently, the docstrings for slow tokenizers' tokenize() method claim that unknown tokens will be left in place, in contrast to the fast tokenizers' behavior. In reality, both convert unknown tokens to unk_token.

Fixes #9714

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@LysandreJik

LysandreJik

LGTM! Thanks!

Clarify use of unk_token in tokenizer docstrings

30727d2

LysandreJik approved these changes Jan 29, 2021

View reviewed changes

LysandreJik merged commit 99b9aff into huggingface:master Jan 29, 2021

Qbiwan pushed a commit to Qbiwan/transformers that referenced this pull request Jan 31, 2021

Clarify use of unk_token in tokenizer docstrings (huggingface#9875)

fbceea3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify use of unk_token in slow tokenizers' docstrings #9875

Clarify use of unk_token in slow tokenizers' docstrings #9875

ethch18 commented Jan 28, 2021

LysandreJik left a comment

Clarify use of unk_token in slow tokenizers' docstrings #9875

Clarify use of unk_token in slow tokenizers' docstrings #9875

Conversation

ethch18 commented Jan 28, 2021

What does this PR do?

Before submitting

Who can review?

LysandreJik left a comment

Choose a reason for hiding this comment