-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One to two frames are discarded at the end of the signal #32
Comments
Thank you very much for the detailed explanation, and the fix for it. But pySTOI has been used for years now, and the unit tests to compare to Matlab's STOI were OK, so I won't be accepting the change. It does not seem feasible for me to change the source code of a metric that has been used for a long time. I hope you understand. |
I understand To get an idea of how big the error can be in a realistic setting, I took a clean speech audio file and corrupted the last two frames with noise. The files are attached at the end of this message (couldn't figure out how to embed them directly, is it even possible?). The clean sentence can be heard as "she ha" (it's a cropped "she had"). The noise at the end makes it difficult to understand the vowel "a" such that the corrupted sentence is heard as "she h". Currently the corrupted file obtains the same score as the clean file i.e. from pystoi.stoi import stoi
import soundfile as sf
x, fs = sf.read('x.wav') # clean
y, fs = sf.read('y.wav') # corrupted
print(stoi(x, x, 10000)) # 0.9999999999999982
print(stoi(x, y, 10000)) # 0.9999999999999982 currently, 0.9262179481416362 with the fix Obviously the difference gets smaller as the length of the signal increases but ignoring this is still conceptually wrong.
Yet this was not detected, even though the difference in this example is much larger than the Ultimately the decision is yours. If you decide not to fix this I would at the very least mention it in README.md and link to this issue, since again this makes this implementation different from the papers. |
It is significant, and I appreciate your efforts to debug our indexing mistake. I'll give it a bit more thought. |
Hey @philgzl, I actually ran the tests again after #34, and it fails now. You can use this to check it. sudo apt update
sudo apt install octave octave-signal
pip install oct2py python -m pytest tests/test_python_octave.py
python -m pytest tests/test_stoi_octave.py In the mean time, I think I'll revert the commit. |
Sorry, I didn't pay close enough attention to the issue, but actually the indexing was right in the past. I suggest you test against the real Matlab implementation, and if there is a problem we'll fix it. |
Indeed, my bad. The last frames are actually also being discarded in the MATLAB implementation. Example: Signal length 8, frame length 4, hop length 2. Should fit 3 frames, with start indexes [1, 3, 5] in MATLAB. But this MATLAB line returns [1, 3]. So I guess we just need to accept this has always been the case. |
Thank you very much for your contributions anyway ! |
On two occasions, what seems to be remnants of a faulty MATLAB translation is causing up to 2 frames to be discarded at the end of the signal.
In
remove_silent_frames
:pystoi/pystoi/utils.py
Lines 148 to 151 in 9ff1cfa
The indexing variable should iterate
range(0, len(x) - framelen + 1, hop)
instead. Currently the last frame is discarded if the signal fits an integer number of frames. The corresponding MATLAB line is:pystoi/tests/matlab/removeSilentFrames.m
Line 6 in 9ff1cfa
But in MATLAB the variable can reach the end value as opposed to Python's
range
, so here we need the+1
.Similarly in
stft
:pystoi/pystoi/utils.py
Lines 98 to 99 in 9ff1cfa
The indexing variable should iterate
range(0, len(x) - win_size + 1, hop)
for the same reason. The corresponding MATLAB line is:pystoi/tests/matlab/stdft.m
Line 3 in 9ff1cfa
Note that since
remove_silent_frames
removes trailing samples if they do not fit a frame, the subsequent_overlap_and_add
always produces a signal that fits an integer number of frames. Because of that,stft
ALWAYS discards one frame! In conclusion, the last two frames are discarded if the signal fits an integer number of frames, else only the last one.This can be easily verified:
A breakpoint here shows the shape of
x_spec
is(257, 29)
instead of(257, 31)
, i.e. two frames are discarded. Fixing the two indexings produces the expected shape.The text was updated successfully, but these errors were encountered: