One to two frames are discarded at the end of the signal #32

philgzl · 2023-04-12T09:38:43Z

On two occasions, what seems to be remnants of a faulty MATLAB translation is causing up to 2 frames to be discarded at the end of the signal.

In remove_silent_frames:

pystoi/pystoi/utils.py

Lines 148 to 151 in 9ff1cfa

    
           x_frames = np.array( 
        
               [w * x[i:i + framelen] for i in range(0, len(x) - framelen, hop)]) 
        
           y_frames = np.array( 
        
               [w * y[i:i + framelen] for i in range(0, len(x) - framelen, hop)])

The indexing variable should iterate range(0, len(x) - framelen + 1, hop) instead. Currently the last frame is discarded if the signal fits an integer number of frames. The corresponding MATLAB line is:

pystoi/tests/matlab/removeSilentFrames.m

Line 6 in 9ff1cfa

frames = 1:K:(length(x)-N);

But in MATLAB the variable can reach the end value as opposed to Python's range, so here we need the +1.

Similarly in stft:

pystoi/pystoi/utils.py

Lines 98 to 99 in 9ff1cfa

    
           stft_out = np.array([np.fft.rfft(w * x[i:i + win_size], n=fft_size) 
        
                               for i in range(0, len(x) - win_size, hop)])

The indexing variable should iterate range(0, len(x) - win_size + 1, hop) for the same reason. The corresponding MATLAB line is:

pystoi/tests/matlab/stdft.m

Line 3 in 9ff1cfa

frames = 1:K:(length(x)-N);

Note that since remove_silent_frames removes trailing samples if they do not fit a frame, the subsequent _overlap_and_add always produces a signal that fits an integer number of frames. Because of that, stft ALWAYS discards one frame! In conclusion, the last two frames are discarded if the signal fits an integer number of frames, else only the last one.

This can be easily verified:

import numpy as np
from pystoi.stoi import stoi

n = 256 + 30*128  # exactly 31 frames
x = np.random.randn(n)
stoi(x, x, 10000)  # raises the RuntimeWarning about not enough frames

A breakpoint here shows the shape of x_spec is (257, 29) instead of (257, 31), i.e. two frames are discarded. Fixing the two indexings produces the expected shape.

The text was updated successfully, but these errors were encountered:

mpariente · 2023-04-13T12:27:11Z

Thank you very much for the detailed explanation, and the fix for it.

But pySTOI has been used for years now, and the unit tests to compare to Matlab's STOI were OK, so I won't be accepting the change.

It does not seem feasible for me to change the source code of a metric that has been used for a long time. I hope you understand.

philgzl · 2023-04-13T19:50:34Z

I understand pystoi has been used by many researchers and libraries, and that any update that would change the results can have large consequences. However pystoi is still extensively used today, and I believe this is a significant issue that differs from the original implementation used in the papers.

To get an idea of how big the error can be in a realistic setting, I took a clean speech audio file and corrupted the last two frames with noise. The files are attached at the end of this message (couldn't figure out how to embed them directly, is it even possible?). The clean sentence can be heard as "she ha" (it's a cropped "she had"). The noise at the end makes it difficult to understand the vowel "a" such that the corrupted sentence is heard as "she h". Currently the corrupted file obtains the same score as the clean file i.e. 0.999 even though the last vowel is hard to understand. The fix brings down the score to 0.926 which is more accurate. That's a 7.4 percentage points difference, which is significant IMO.

from pystoi.stoi import stoi
import soundfile as sf

x, fs = sf.read('x.wav')  # clean
y, fs = sf.read('y.wav')  # corrupted

print(stoi(x, x, 10000))  # 0.9999999999999982
print(stoi(x, y, 10000))  # 0.9999999999999982 currently, 0.9262179481416362 with the fix

Obviously the difference gets smaller as the length of the signal increases but ignoring this is still conceptually wrong.

the unit tests to compare to Matlab's STOI were OK

Yet this was not detected, even though the difference in this example is much larger than the RTOL and ATOL values in the test files. The tests in tests/test_stoi.py only compare the outputs from white noise signals with a fixed length and no silences, which would explain why this was not detected if no other tests were ran.

Ultimately the decision is yours. If you decide not to fix this I would at the very least mention it in README.md and link to this issue, since again this makes this implementation different from the papers.

audio_files.tar.gz

mpariente · 2023-04-20T13:47:25Z

It is significant, and I appreciate your efforts to debug our indexing mistake.

I'll give it a bit more thought.

mpariente · 2023-12-29T16:37:48Z

Hey @philgzl,

I actually ran the tests again after #34, and it fails now.

You can use this to check it.

sudo apt update 
sudo apt install octave octave-signal 
pip install oct2py

python -m pytest tests/test_python_octave.py
python -m pytest tests/test_stoi_octave.py

In the mean time, I think I'll revert the commit.

mpariente · 2023-12-29T16:44:43Z

Sorry, I didn't pay close enough attention to the issue, but actually the indexing was right in the past. range(0, len(x) - win_size + 1, hop) is equivalent to this 1:K:(length(x)-N) because of the starting index.

I suggest you test against the real Matlab implementation, and if there is a problem we'll fix it.

philgzl · 2023-12-29T17:21:22Z

Indeed, my bad. The last frames are actually also being discarded in the MATLAB implementation.

Example: Signal length 8, frame length 4, hop length 2. Should fit 3 frames, with start indexes [1, 3, 5] in MATLAB. But this MATLAB line returns [1, 3].

So I guess we just need to accept this has always been the case.

mpariente · 2023-12-30T16:15:23Z

Thank you very much for your contributions anyway !

philgzl mentioned this issue Apr 12, 2023

Fix remove_silent_frames and stft indexing #33

Merged

mpariente closed this as completed in #33 May 1, 2023

mpariente reopened this Dec 29, 2023

philgzl closed this as completed Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One to two frames are discarded at the end of the signal #32

One to two frames are discarded at the end of the signal #32

philgzl commented Apr 12, 2023

mpariente commented Apr 13, 2023

philgzl commented Apr 13, 2023

mpariente commented Apr 20, 2023

mpariente commented Dec 29, 2023 •

edited

Loading

mpariente commented Dec 29, 2023

philgzl commented Dec 29, 2023

mpariente commented Dec 30, 2023

One to two frames are discarded at the end of the signal #32

One to two frames are discarded at the end of the signal #32

Comments

philgzl commented Apr 12, 2023

mpariente commented Apr 13, 2023

philgzl commented Apr 13, 2023

mpariente commented Apr 20, 2023

mpariente commented Dec 29, 2023 • edited Loading

mpariente commented Dec 29, 2023

philgzl commented Dec 29, 2023

mpariente commented Dec 30, 2023

mpariente commented Dec 29, 2023 •

edited

Loading