[BUG] Tokenizer Fails on CommonVoice Japanese #575

NataliaShmueli · 2023-02-17T22:15:09Z

Debugging checklist

[x ] Have you updated to latest MFA version?
[x ] Have you tried rerunning the command with the --clean flag?

Describe the issue
A clear and concise description of what the bug is.
The tokenizer failed on Japanese CommonVoice. When I tried it on even an individual speaker, it also failed. When I finally moved the test single speaker recordings to a folder that I named JaTest, it ended up working. This issue only happens with CommonVoice, so it might be related to the length of the folder name, of which was originally dbc3652a5a930b462947cfb0c88dd9ddb3ebe1c0cde73e7a020831c266f57ae464867e65ee452b1dbf2d034a39db03bab2773545ad809e2a2d209ed613492af8
For Reproducing your issue
Please fill out the following:

Corpus structure
- What language is the corpus in?
- Japanese
- How many files/speakers?
- 1518
- Are you using lab files or TextGrid files for input?
- .lab
Dictionary
- Are you using a dictionary from MFA? If so, which one?
- N/A
- If it's a custom dictionary, what is the phoneset?
- N/A
Acoustic model
- If you're using an acoustic model, is it one download through MFA? If so, which one?
- japanese_mfa
- If it's a model you've trained, what data was it trained on?
- N/A

Log file
Please attach the log file for the run that encountered an error (by default these will be stored in ~/Documents/MFA).
ja.log

Desktop (please complete the following information):

OS: [e.g. Windows, OSX, Linux]
Windows
Version [e.g. MacOSX 10.15, Ubuntu 20.04, Windows 10, etc]
10
Any other details about the setup (Cloud, Docker, etc)

Additional context
Add any other context about the problem here.
TL;DR might be an issue with the length or naming scheme of folders.

The text was updated successfully, but these errors were encountered:

mmcauliffe · 2023-02-17T22:20:26Z

Yeah, so Windows has a maximum path length of 260 (https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#maximum-path-length-limitation), so if you have nested common voice in some deep folder structure, then you'll hit this. You can move the directory to somewhere closer to the drive root (i.e. C:/common_voice_jp) and it should work. I'll think about ways that MFA could get around it, but it is ultimately a windows issue.

For reference, the path I use for it is D:\Data\speech\model_training_corpora\japanese\common_voice_ja

NataliaShmueli · 2023-02-17T22:31:11Z

Strangely enough, this has never been an issue for training/aligning, I don't think? I checked online for the length and it was only 181 characters at max.

K:\Training_Models\Spoken\Japanese\CommonVoice\cv\ja\1af9f4b197c3b75b95b91661651d490a1ce31d182b462702bc7613842a00146835a16b7d7d28c1e0e8e366c41216e786cf8c155fcbdcaab3f8f7d99b4a9c09fe

NataliaShmueli · 2023-02-19T06:38:52Z

Adding one more thing, it's refusing to tokenize corpora with Japanese names. I had a dataset folder in Katakana, and renaming it to Romaji made it work. Not a major issue though!

NataliaShmueli added the bug label Feb 17, 2023

NataliaShmueli assigned mmcauliffe Feb 17, 2023

mmcauliffe mentioned this issue Feb 17, 2023

2.2.4 #576

Merged

mmcauliffe closed this as completed in #576 Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Tokenizer Fails on CommonVoice Japanese #575

[BUG] Tokenizer Fails on CommonVoice Japanese #575

NataliaShmueli commented Feb 17, 2023

mmcauliffe commented Feb 17, 2023 •

edited

Loading

NataliaShmueli commented Feb 17, 2023 •

edited

Loading

NataliaShmueli commented Feb 19, 2023

[BUG] Tokenizer Fails on CommonVoice Japanese #575

[BUG] Tokenizer Fails on CommonVoice Japanese #575

Comments

NataliaShmueli commented Feb 17, 2023

mmcauliffe commented Feb 17, 2023 • edited Loading

NataliaShmueli commented Feb 17, 2023 • edited Loading

NataliaShmueli commented Feb 19, 2023

mmcauliffe commented Feb 17, 2023 •

edited

Loading

NataliaShmueli commented Feb 17, 2023 •

edited

Loading