Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Tokenizer Fails on CommonVoice Japanese #575

Closed
NataliaShmueli opened this issue Feb 17, 2023 · 3 comments · Fixed by #576
Closed

[BUG] Tokenizer Fails on CommonVoice Japanese #575

NataliaShmueli opened this issue Feb 17, 2023 · 3 comments · Fixed by #576
Assignees
Labels

Comments

@NataliaShmueli
Copy link

Debugging checklist

[x ] Have you updated to latest MFA version?
[x ] Have you tried rerunning the command with the --clean flag?

Describe the issue
A clear and concise description of what the bug is.
The tokenizer failed on Japanese CommonVoice. When I tried it on even an individual speaker, it also failed. When I finally moved the test single speaker recordings to a folder that I named JaTest, it ended up working. This issue only happens with CommonVoice, so it might be related to the length of the folder name, of which was originally dbc3652a5a930b462947cfb0c88dd9ddb3ebe1c0cde73e7a020831c266f57ae464867e65ee452b1dbf2d034a39db03bab2773545ad809e2a2d209ed613492af8
For Reproducing your issue
Please fill out the following:

  1. Corpus structure
    • What language is the corpus in?
    • Japanese
    • How many files/speakers?
    • 1518
    • Are you using lab files or TextGrid files for input?
    • .lab
  2. Dictionary
    • Are you using a dictionary from MFA? If so, which one?
    • N/A
    • If it's a custom dictionary, what is the phoneset?
    • N/A
  3. Acoustic model
    • If you're using an acoustic model, is it one download through MFA? If so, which one?
    • japanese_mfa
    • If it's a model you've trained, what data was it trained on?
    • N/A

Log file
Please attach the log file for the run that encountered an error (by default these will be stored in ~/Documents/MFA).
ja.log

Desktop (please complete the following information):

  • OS: [e.g. Windows, OSX, Linux]
  • Windows
  • Version [e.g. MacOSX 10.15, Ubuntu 20.04, Windows 10, etc]
  • 10
  • Any other details about the setup (Cloud, Docker, etc)

Additional context
Add any other context about the problem here.
TL;DR might be an issue with the length or naming scheme of folders.

@mmcauliffe
Copy link
Member

mmcauliffe commented Feb 17, 2023

Yeah, so Windows has a maximum path length of 260 (https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#maximum-path-length-limitation), so if you have nested common voice in some deep folder structure, then you'll hit this. You can move the directory to somewhere closer to the drive root (i.e. C:/common_voice_jp) and it should work. I'll think about ways that MFA could get around it, but it is ultimately a windows issue.

For reference, the path I use for it is D:\Data\speech\model_training_corpora\japanese\common_voice_ja

@NataliaShmueli
Copy link
Author

NataliaShmueli commented Feb 17, 2023

Strangely enough, this has never been an issue for training/aligning, I don't think? I checked online for the length and it was only 181 characters at max.

K:\Training_Models\Spoken\Japanese\CommonVoice\cv\ja\1af9f4b197c3b75b95b91661651d490a1ce31d182b462702bc7613842a00146835a16b7d7d28c1e0e8e366c41216e786cf8c155fcbdcaab3f8f7d99b4a9c09fe

@mmcauliffe mmcauliffe mentioned this issue Feb 17, 2023
@NataliaShmueli
Copy link
Author

Adding one more thing, it's refusing to tokenize corpora with Japanese names. I had a dataset folder in Katakana, and renaming it to Romaji made it work. Not a major issue though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants