Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a regular expression to tokenize lexicon.txt #2676

Merged
merged 1 commit into from
Sep 4, 2018

Conversation

alumae
Copy link
Contributor

@alumae alumae commented Sep 4, 2018

Some UTF-8 characters (for example Š) are interpreted in latin-1 as containing whitespace. For example, Š consists of bytes c5 and a0 in UTF-8, but a0 corresponds to non-breaking space in latin-1. This means that words containing Š are not tokenized correctly when reading lexicon.txt.

This change ensures that lexicon.txt is tokenized using only space or tab characters.

Some UTF-8 characters (for example Š) are interpreted in latin-1 as containing whitespace. For example, Š consists of bytes c5 and a0 in UTF-8, but a0 corresponds to non-breaking space in latin-1. This means that words containing Š are not tokenized correctly when reading lexicon.txt.

This change ensures that lexicon.txt is tokenized using only space or tab characters.
@danpovey
Copy link
Contributor

danpovey commented Sep 4, 2018

Thanks. @xiaohui-zhang could you please make a PR for the silprob version of that script? I'm concerned @alumae may not be online.

@danpovey danpovey merged commit 7a5398e into kaldi-asr:master Sep 4, 2018
danpovey added a commit to danpovey/kaldi that referenced this pull request Sep 4, 2018
@xiaohui-zhang
Copy link
Contributor

xiaohui-zhang commented Sep 6, 2018 via email

@danpovey
Copy link
Contributor

danpovey commented Sep 6, 2018

@xiaohui-zhang it's OK now, I did it...

@xiaohui-zhang
Copy link
Contributor

xiaohui-zhang commented Sep 6, 2018 via email

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
… tokenizing lexicon (kaldi-asr#2676)

Some UTF-8 characters (for example Š) are interpreted in latin-1 as containing whitespace. For example, Š consists of bytes c5 and a0 in UTF-8, but a0 corresponds to non-breaking space in latin-1. This means that words containing Š are not tokenized correctly when reading lexicon.txt.

This change ensures that lexicon.txt is tokenized using only space or tab characters.
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants