Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One to two frames are discarded at the end of the signal #32

Closed
philgzl opened this issue Apr 12, 2023 · 7 comments · Fixed by #33
Closed

One to two frames are discarded at the end of the signal #32

philgzl opened this issue Apr 12, 2023 · 7 comments · Fixed by #33

Comments

@philgzl
Copy link
Contributor

philgzl commented Apr 12, 2023

On two occasions, what seems to be remnants of a faulty MATLAB translation is causing up to 2 frames to be discarded at the end of the signal.

  1. In remove_silent_frames:

    pystoi/pystoi/utils.py

    Lines 148 to 151 in 9ff1cfa

    x_frames = np.array(
    [w * x[i:i + framelen] for i in range(0, len(x) - framelen, hop)])
    y_frames = np.array(
    [w * y[i:i + framelen] for i in range(0, len(x) - framelen, hop)])

    The indexing variable should iterate range(0, len(x) - framelen + 1, hop) instead. Currently the last frame is discarded if the signal fits an integer number of frames. The corresponding MATLAB line is:
    frames = 1:K:(length(x)-N);

    But in MATLAB the variable can reach the end value as opposed to Python's range, so here we need the +1.

  2. Similarly in stft:

    pystoi/pystoi/utils.py

    Lines 98 to 99 in 9ff1cfa

    stft_out = np.array([np.fft.rfft(w * x[i:i + win_size], n=fft_size)
    for i in range(0, len(x) - win_size, hop)])

    The indexing variable should iterate range(0, len(x) - win_size + 1, hop) for the same reason. The corresponding MATLAB line is:
    frames = 1:K:(length(x)-N);

Note that since remove_silent_frames removes trailing samples if they do not fit a frame, the subsequent _overlap_and_add always produces a signal that fits an integer number of frames. Because of that, stft ALWAYS discards one frame! In conclusion, the last two frames are discarded if the signal fits an integer number of frames, else only the last one.

This can be easily verified:

import numpy as np
from pystoi.stoi import stoi

n = 256 + 30*128  # exactly 31 frames
x = np.random.randn(n)
stoi(x, x, 10000)  # raises the RuntimeWarning about not enough frames

A breakpoint here shows the shape of x_spec is (257, 29) instead of (257, 31), i.e. two frames are discarded. Fixing the two indexings produces the expected shape.

@mpariente
Copy link
Owner

Thank you very much for the detailed explanation, and the fix for it.

But pySTOI has been used for years now, and the unit tests to compare to Matlab's STOI were OK, so I won't be accepting the change.

It does not seem feasible for me to change the source code of a metric that has been used for a long time. I hope you understand.

@philgzl
Copy link
Contributor Author

philgzl commented Apr 13, 2023

I understand pystoi has been used by many researchers and libraries, and that any update that would change the results can have large consequences. However pystoi is still extensively used today, and I believe this is a significant issue that differs from the original implementation used in the papers.

To get an idea of how big the error can be in a realistic setting, I took a clean speech audio file and corrupted the last two frames with noise. The files are attached at the end of this message (couldn't figure out how to embed them directly, is it even possible?). The clean sentence can be heard as "she ha" (it's a cropped "she had"). The noise at the end makes it difficult to understand the vowel "a" such that the corrupted sentence is heard as "she h". Currently the corrupted file obtains the same score as the clean file i.e. 0.999 even though the last vowel is hard to understand. The fix brings down the score to 0.926 which is more accurate. That's a 7.4 percentage points difference, which is significant IMO.

from pystoi.stoi import stoi
import soundfile as sf

x, fs = sf.read('x.wav')  # clean
y, fs = sf.read('y.wav')  # corrupted

print(stoi(x, x, 10000))  # 0.9999999999999982
print(stoi(x, y, 10000))  # 0.9999999999999982 currently, 0.9262179481416362 with the fix

Obviously the difference gets smaller as the length of the signal increases but ignoring this is still conceptually wrong.

the unit tests to compare to Matlab's STOI were OK

Yet this was not detected, even though the difference in this example is much larger than the RTOL and ATOL values in the test files. The tests in tests/test_stoi.py only compare the outputs from white noise signals with a fixed length and no silences, which would explain why this was not detected if no other tests were ran.

Ultimately the decision is yours. If you decide not to fix this I would at the very least mention it in README.md and link to this issue, since again this makes this implementation different from the papers.

audio_files.tar.gz

@mpariente
Copy link
Owner

It is significant, and I appreciate your efforts to debug our indexing mistake.

I'll give it a bit more thought.

@mpariente
Copy link
Owner

mpariente commented Dec 29, 2023

Hey @philgzl,

I actually ran the tests again after #34, and it fails now.

You can use this to check it.

sudo apt update 
sudo apt install octave octave-signal 
pip install oct2py
python -m pytest tests/test_python_octave.py
python -m pytest tests/test_stoi_octave.py

In the mean time, I think I'll revert the commit.

@mpariente mpariente reopened this Dec 29, 2023
@mpariente
Copy link
Owner

Sorry, I didn't pay close enough attention to the issue, but actually the indexing was right in the past. range(0, len(x) - win_size + 1, hop) is equivalent to this 1:K:(length(x)-N) because of the starting index.

I suggest you test against the real Matlab implementation, and if there is a problem we'll fix it.

@philgzl
Copy link
Contributor Author

philgzl commented Dec 29, 2023

Indeed, my bad. The last frames are actually also being discarded in the MATLAB implementation.

Example: Signal length 8, frame length 4, hop length 2. Should fit 3 frames, with start indexes [1, 3, 5] in MATLAB. But this MATLAB line returns [1, 3].

image

So I guess we just need to accept this has always been the case.

@philgzl philgzl closed this as completed Dec 29, 2023
@mpariente
Copy link
Owner

Thank you very much for your contributions anyway !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants