You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[x ] Have you updated to latest MFA version?
[x ] Have you tried rerunning the command with the --clean flag?
Describe the issue
A clear and concise description of what the bug is. The tokenizer failed on Japanese CommonVoice. When I tried it on even an individual speaker, it also failed. When I finally moved the test single speaker recordings to a folder that I named JaTest, it ended up working. This issue only happens with CommonVoice, so it might be related to the length of the folder name, of which was originally dbc3652a5a930b462947cfb0c88dd9ddb3ebe1c0cde73e7a020831c266f57ae464867e65ee452b1dbf2d034a39db03bab2773545ad809e2a2d209ed613492af8 For Reproducing your issue
Please fill out the following:
Corpus structure
What language is the corpus in?
Japanese
How many files/speakers?
1518
Are you using lab files or TextGrid files for input?
.lab
Dictionary
Are you using a dictionary from MFA? If so, which one?
N/A
If it's a custom dictionary, what is the phoneset?
N/A
Acoustic model
If you're using an acoustic model, is it one download through MFA? If so, which one?
japanese_mfa
If it's a model you've trained, what data was it trained on?
N/A
Log file
Please attach the log file for the run that encountered an error (by default these will be stored in ~/Documents/MFA). ja.log
Desktop (please complete the following information):
OS: [e.g. Windows, OSX, Linux]
Windows
Version [e.g. MacOSX 10.15, Ubuntu 20.04, Windows 10, etc]
10
Any other details about the setup (Cloud, Docker, etc)
Additional context
Add any other context about the problem here. TL;DR might be an issue with the length or naming scheme of folders.
The text was updated successfully, but these errors were encountered:
Yeah, so Windows has a maximum path length of 260 (https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#maximum-path-length-limitation), so if you have nested common voice in some deep folder structure, then you'll hit this. You can move the directory to somewhere closer to the drive root (i.e. C:/common_voice_jp) and it should work. I'll think about ways that MFA could get around it, but it is ultimately a windows issue.
For reference, the path I use for it is D:\Data\speech\model_training_corpora\japanese\common_voice_ja
Strangely enough, this has never been an issue for training/aligning, I don't think? I checked online for the length and it was only 181 characters at max.
Adding one more thing, it's refusing to tokenize corpora with Japanese names. I had a dataset folder in Katakana, and renaming it to Romaji made it work. Not a major issue though!
Debugging checklist
[x ] Have you updated to latest MFA version?
[x ] Have you tried rerunning the command with the
--clean
flag?Describe the issue
A clear and concise description of what the bug is.
The tokenizer failed on Japanese CommonVoice. When I tried it on even an individual speaker, it also failed. When I finally moved the test single speaker recordings to a folder that I named
JaTest
, it ended up working. This issue only happens with CommonVoice, so it might be related to the length of the folder name, of which was originallydbc3652a5a930b462947cfb0c88dd9ddb3ebe1c0cde73e7a020831c266f57ae464867e65ee452b1dbf2d034a39db03bab2773545ad809e2a2d209ed613492af8
For Reproducing your issue
Please fill out the following:
Log file
Please attach the log file for the run that encountered an error (by default these will be stored in
~/Documents/MFA
).ja.log
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
TL;DR might be an issue with the length or naming scheme of folders.
The text was updated successfully, but these errors were encountered: