Use a regular expression to tokenize lexicon.txt #2676

alumae · 2018-09-04T09:42:06Z

Some UTF-8 characters (for example Š) are interpreted in latin-1 as containing whitespace. For example, Š consists of bytes c5 and a0 in UTF-8, but a0 corresponds to non-breaking space in latin-1. This means that words containing Š are not tokenized correctly when reading lexicon.txt.

This change ensures that lexicon.txt is tokenized using only space or tab characters.

Some UTF-8 characters (for example Š) are interpreted in latin-1 as containing whitespace. For example, Š consists of bytes c5 and a0 in UTF-8, but a0 corresponds to non-breaking space in latin-1. This means that words containing Š are not tokenized correctly when reading lexicon.txt. This change ensures that lexicon.txt is tokenized using only space or tab characters.

danpovey · 2018-09-04T15:23:14Z

Thanks. @xiaohui-zhang could you please make a PR for the silprob version of that script? I'm concerned @alumae may not be online.

…lprobs.py

…#2680)

xiaohui-zhang · 2018-09-06T20:46:06Z

Sure Dan.

…

On Tue, Sep 4, 2018 at 5:24 AM Daniel Povey ***@***.***> wrote: Merged #2676 <#2676> into master. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2676 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANiEETo2pNokJJl1AOt2E5XVlExkpJ6Vks5uXps8gaJpZM4WYpFq> .

-- Xiaohui

danpovey · 2018-09-06T20:48:45Z

@xiaohui-zhang it's OK now, I did it...

xiaohui-zhang · 2018-09-06T20:51:12Z

sorry about that XD

…

On Thu, Sep 6, 2018 at 10:48 AM Daniel Povey ***@***.***> wrote: @xiaohui-zhang <https://github.com/xiaohui-zhang> it's OK now, I did it... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2676 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANiEEUOoy3j9gogJ2rDeflMxJoSdt-zMks5uYYo7gaJpZM4WYpFq> .

-- Xiaohui

… tokenizing lexicon (kaldi-asr#2676) Some UTF-8 characters (for example Š) are interpreted in latin-1 as containing whitespace. For example, Š consists of bytes c5 and a0 in UTF-8, but a0 corresponds to non-breaking space in latin-1. This means that words containing Š are not tokenized correctly when reading lexicon.txt. This change ensures that lexicon.txt is tokenized using only space or tab characters.

…lprobs.py (kaldi-asr#2680)

danpovey merged commit 7a5398e into kaldi-asr:master Sep 4, 2018

danpovey added a commit to danpovey/kaldi that referenced this pull request Sep 4, 2018

[scripts] Apply encoding fix of kaldi-asr#2676 to make_lexicon_fst_si…

9ab3c3e

…lprobs.py

danpovey added a commit that referenced this pull request Sep 4, 2018

[scripts] Apply encoding fix of #2676 to make_lexicon_fst_silprobs.py (…

03355f3

…#2680)

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018

[scripts] Apply encoding fix of kaldi-asr#2676 to make_lexicon_fst_si…

1b9f792

…lprobs.py (kaldi-asr#2680)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a regular expression to tokenize lexicon.txt #2676

Use a regular expression to tokenize lexicon.txt #2676

alumae commented Sep 4, 2018

danpovey commented Sep 4, 2018

xiaohui-zhang commented Sep 6, 2018 via email

danpovey commented Sep 6, 2018

xiaohui-zhang commented Sep 6, 2018 via email

Use a regular expression to tokenize lexicon.txt #2676

Use a regular expression to tokenize lexicon.txt #2676

Conversation

alumae commented Sep 4, 2018

danpovey commented Sep 4, 2018

xiaohui-zhang commented Sep 6, 2018 via email

danpovey commented Sep 6, 2018

xiaohui-zhang commented Sep 6, 2018 via email