Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect behaviour of generating OOV during validation. #363

Closed
AndreyBocharnikov opened this issue Dec 3, 2021 · 4 comments
Closed

Incorrect behaviour of generating OOV during validation. #363

AndreyBocharnikov opened this issue Dec 3, 2021 · 4 comments

Comments

@AndreyBocharnikov
Copy link

Hello and thanks you for your excellent work.

I've got paired txt-wav sample "dear customer,welcome to our ship." And because of missing space around comma this "word" is not in dictionary that was downloaded via mfa model download dictionary english so it should be in OOV, but it's not. After running mfa validate wrong_sample english english logs from /root/Documents/MFA/wrong_sample/validate.log says "There were no missing words from the dictionary" which seems to be bug. The fact that this "word" is not being taken in account can be seen with mfa align wrong_sample english english wrong_sample_result, the resulting phonems looks like "D IH1 R" then blank for more than a second and then "T UW1 AW1 ER0 SH IH1 P".

The wrong alignment itself could be overcome with mfa g2p english_g2p wrong_sample wrong_sample_g2p which does generate customer,welcome K AH1 S T AH0 M ER0 W EH1 L K AH0 M, but the fact that mfa validate doesn't generate OOV file on that sample seems wrong.

With love, looking forward for your replay :)

@AndreyBocharnikov
Copy link
Author

I found out that my dataset contains more words that are not in /root/Documents/MFA/pretrained_models/dictionary/english.dict but still oovs_found.txt was not generated after mfa validate dataset english english.

I installed mfa via installation page from documentation:
conda create -n aligner -c conda-forge montreal-forced-aligner
mfa model download acoustic english
mfa model download dictionary english
and then aligned dataset mfa align dataset english english dataset_result
dataset is in correct format, right amount of speakers and utterances.
Am I doing something wrong?

The list of words that is not in english.dict, so they were not transcripted to phonems, instead there was a "" text with long duration (~1 second) in the place of OOV word in the resulting .TextGrid file.
LOUDING, PIGTAIL'S, BUNBURYED, TV, EEYORE, MIDYEAR, PEPER, PIGLIT
Hope is helps.

Please fix it or tell me what I am doing wrong.

@mmcauliffe
Copy link
Member

Yeah that seems weird, I'll take a look. Is this with the most recent version (released last night)?

@Hocine958
Copy link

Same here, OOV words are not detected by mfa validate and the aligned phonemes is just an empty string with the oov word time interval.
For the versions I tested, this problem is present in 2.0.0b7 and 2.0.0b8.
I also tested 2.0.0b4 where the validation was working but not the alignment.

@AndreyBocharnikov
Copy link
Author

I run cat /root/miniconda3/envs/aligner/lib/python3.9/site-packages/montreal_forced_aligner/_version.py
and it returned:
# coding: utf-8
# file generated by setuptools_scm
# don't change, don't track in version control
version = '2.0.0b7'
version_tuple = (2, 0, 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants