Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy difference #3

Closed
kbramhendra opened this issue Oct 11, 2022 · 5 comments
Closed

Accuracy difference #3

kbramhendra opened this issue Oct 11, 2022 · 5 comments

Comments

@kbramhendra
Copy link

Hey Hi,
I am trying to use Riva decoder as a substitution for k2 decoder. With riva decoder I am not getting the same accuracy as that of k2 decoder. There are lot more deletions, at least 12%. Experimented with different hypereparamters like acoustic_scale, max_active_states, but result seems to be not changing much. I have tried with the different topologies as well(eesen, compact), its the same case with all of them. Can you please help in this regard.

@messiaen
Copy link

@kbramhendra Thanks for the report. I've passed this on to our team internally.

@galv
Copy link
Collaborator

galv commented Oct 19, 2022

@kbramhendra what value have you set for "max_expand"?

We have it set to 10 here:

config.online_opts.lattice_postprocessor_opts.max_expand = 10

I have noticed that disabling this (setting it to 0) does improve WER. However, I am not certain where it is your issue based on what you have told me.

This is to account for an explosion in the state space of a depth first search in "word alignment" algorithm that can happen in rare circumstances.

Now, word alignment isn't necessary, strictly speaking, so I could consider disabling it, but I am still looking into what the "right" solution is here.

galv added a commit to galv/kaldi that referenced this issue Oct 19, 2022
For CTC models using word pieces or graphemes, there is not enough
positional information to use the word alignment.

I tried marking every unit as "singleton" word_boundary.txt, but this
explodes the state space very, very often. See:

nvidia-riva/riva-asrlib-decoder#3

With the "_" character in CTC models predicting word pieces, we at the
very least know which word pieces begin a word and which ones are
either in the middle of the word or the end of a word, but the
algorithm would still need to be rewritten, especially since "blank"
is not a silence phoneme (it can appear between).

I did look into using the lexicon-based word alignment. I don't have a
specific complaint about it, but I did get a weird error where it
couldn't create a final state at all in the output lattice, which
caused Connect() to output an empty lattice. This may be because I
wasn't quite sure how to handle the blank token. I treat it as its own
phoneme, bcause of limitations in TransitionInformation, but this
doesn't really make any sense.

Needless to say, while the CTM outputs of the cuda decoder will be
correct from a WER point of view, their time stamps won't be correct,
but they probably never were in the first place, for CTC models.
galv added a commit that referenced this issue Oct 19, 2022
@galv
Copy link
Collaborator

galv commented Oct 19, 2022

@kbramhendra can you check if the branch in #4 fixes your issue? I believe it should based on my own internal testing.

Make sure not to set config.online_opts.lattice_postprocessor_opts.word_boundary_rxfilename to anything other than empty string to disable worda lignment. See here:

3aaa0e1#diff-c80f4904c78bc561ce1235944f91d4847817445e810c4d7a0064453503e0c7f3L160

Basically, word alignment would fail to complete sometimes when the max_expand option was set. The returned lattice would then be missing paths that were in the input path. Sometimes, these missing paths would be the best cost paths, or sometimes not even a single path would be complete by the time the max_expand option had taken effect. This explains why the errors (at least for me) were only deletions and substitutions, not insertions.

galv added a commit to galv/kaldi that referenced this issue Oct 20, 2022
For CTC models using word pieces or graphemes, there is not enough
positional information to use the word alignment.

I tried marking every unit as "singleton" word_boundary.txt, but this
explodes the state space very, very often. See:

nvidia-riva/riva-asrlib-decoder#3

With the "_" character in CTC models predicting word pieces, we at the
very least know which word pieces begin a word and which ones are
either in the middle of the word or the end of a word, but the
algorithm would still need to be rewritten, especially since "blank"
is not a silence phoneme (it can appear between).

I did look into using the lexicon-based word alignment. I don't have a
specific complaint about it, but I did get a weird error where it
couldn't create a final state at all in the output lattice, which
caused Connect() to output an empty lattice. This may be because I
wasn't quite sure how to handle the blank token. I treat it as its own
phoneme, bcause of limitations in TransitionInformation, but this
doesn't really make any sense.

Needless to say, while the CTM outputs of the cuda decoder will be
correct from a WER point of view, their time stamps won't be correct,
but they probably never were in the first place, for CTC models.
galv added a commit that referenced this issue Oct 20, 2022
@kbramhendra
Copy link
Author

kbramhendra commented Oct 20, 2022

@galv Thank you very much for taking your time and helping in this. yes it did help, and the deletion are reduced very much now. only ~1.5% difference is there between k2 and riva(in deletions only). I will see if anything is missed from my side. I highly appreciate your help, i was struck at this for some time. Its a great help.

@galv galv closed this as completed in 06dec3f Oct 21, 2022
jtrmal pushed a commit to kaldi-asr/kaldi that referenced this issue Dec 13, 2022
* Remove unused variable.

* cudadecoder: Make word alignment optional.

For CTC models using word pieces or graphemes, there is not enough
positional information to use the word alignment.

I tried marking every unit as "singleton" word_boundary.txt, but this
explodes the state space very, very often. See:

nvidia-riva/riva-asrlib-decoder#3

With the "_" character in CTC models predicting word pieces, we at the
very least know which word pieces begin a word and which ones are
either in the middle of the word or the end of a word, but the
algorithm would still need to be rewritten, especially since "blank"
is not a silence phoneme (it can appear between).

I did look into using the lexicon-based word alignment. I don't have a
specific complaint about it, but I did get a weird error where it
couldn't create a final state at all in the output lattice, which
caused Connect() to output an empty lattice. This may be because I
wasn't quite sure how to handle the blank token. I treat it as its own
phoneme, bcause of limitations in TransitionInformation, but this
doesn't really make any sense.

Needless to say, while the CTM outputs of the cuda decoder will be
correct from a WER point of view, their time stamps won't be correct,
but they probably never were in the first place, for CTC models.
galv added a commit to galv/kaldi that referenced this issue Dec 13, 2022
For CTC models using word pieces or graphemes, there is not enough
positional information to use the word alignment.

I tried marking every unit as "singleton" word_boundary.txt, but this
explodes the state space very, very often. See:

nvidia-riva/riva-asrlib-decoder#3

With the "_" character in CTC models predicting word pieces, we at the
very least know which word pieces begin a word and which ones are
either in the middle of the word or the end of a word, but the
algorithm would still need to be rewritten, especially since "blank"
is not a silence phoneme (it can appear between).

I did look into using the lexicon-based word alignment. I don't have a
specific complaint about it, but I did get a weird error where it
couldn't create a final state at all in the output lattice, which
caused Connect() to output an empty lattice. This may be because I
wasn't quite sure how to handle the blank token. I treat it as its own
phoneme, bcause of limitations in TransitionInformation, but this
doesn't really make any sense.

Needless to say, while the CTM outputs of the cuda decoder will be
correct from a WER point of view, their time stamps won't be correct,
but they probably never were in the first place, for CTC models.
@jinggaizi
Copy link

@kbramhendra hi, i also try to use this decoder tools, how can use it for ctc+tlg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants