Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scripts] filter segment duration in vad_to_segments.sh #2447

Merged
merged 3 commits into from
May 26, 2018

Conversation

francoishernandez
Copy link
Contributor

I just came upon the following issue:
the subsegments created by vad_to_segments.sh contained a few of very small length (e.g. 20ms), which caused other issues down the line.

I fixed it by filtering these subsegments by duration before writing them. I chose the value 0.25 but we maybe could add a --min-subsegment-length option.

@@ -58,7 +58,9 @@ if [ $stage -le 0 ]; then

for n in `seq $nj`; do
cat $sdata/$n/subsegments
done | sort > $data/subsegments || exit 1;
done | sort | \
awk '{if (! (NF != 4 || $4 - $3 <= 0.25)) { print $0 }}' \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but I think it would make more sense to have the minimum length configurable as an option to the script.
You can pass it into awk using e.g. -v m=$min_duration
Also I prefer if you write that as an && expression instead of a ! ||.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry the conditional was bad indeed. Fixed it and passed the duration as an option.

@@ -26,6 +27,7 @@ if [ $# -ne 2 ]; then
echo " --stage (0|1) # start script from part-way through"
echo " --cmd (run.pl|queue.pl...) # specify how to run the sub-processes"
echo " --segmentation-opts '--opt1 opt1val --opt2 opt2val' # options for segmentation.pl"
echo " --min-duration <m> # filtering out any generated subsegment with lower duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change the comment to:
# min duration in seconds for segments (smaller ones are discarded)

@danpovey
Copy link
Contributor

I just realized that (I think) Vimal's VAD code has minimum durations configurable in its algorithms.
Is that what you are working on top of here? I'm just wondering why it produced such short segments. @vimalmanohar and @mmaciej2, perhaps you could comment?

@francoishernandez
Copy link
Contributor Author

This happened when using the 'basic' energy-based VAD.

@danpovey
Copy link
Contributor

Oh, OK. I think Vimal's tools output segments directly; this script wouldn't be involved.

@danpovey danpovey merged commit d6d49d0 into kaldi-asr:master May 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants