Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird STOI Output #20

Open
nanaChang opened this issue Jul 24, 2020 · 7 comments
Open

Weird STOI Output #20

nanaChang opened this issue Jul 24, 2020 · 7 comments

Comments

@nanaChang
Copy link

Hi,

Recently I was trying to evaluate some signals by calculating the stoi of each signals with this package. I used pystoi.stoi.stoi function to calculate the stoi. When I input two identical signals as ref_signal and processed_signal, it output 1 perfectly. However, when I replaced processed signal with microphone signals I recorded with and without background music playing, it turned out that the STOI of the signal when background music was presented is always higher, which made no sense.
I'm wondering if I'm using the function the wrong way or is there anything wrong with my audio file or understanding about STOI.

I've uploaded my audio files at the following website as well as my code to evaluate STOI.
https://github.com/nanaChang/stoiCheckFile

Thank you!

@mpariente
Copy link
Owner

I didn't check your files but are you sure they are completely synced between each other?
If they are aligned, this results sounds weird indeed.

@nanaChang
Copy link
Author

Hi,

Thank for replying!
I'm pretty sure they are all aligned correctly. There should be a tiny delay between reference signal and microphone received signal considering the traveling time from speaker to my microphone array but I placed my speaker and my microphone pretty close to each other (0.3 meters apart) so I think this hardly affect the result.
Anyway, considering possible misalignment I tried to take the frame delay into accounts and recalculate the STOI with reference signal delayed 14 frames in order to minimize the effects of traveling time, but the results seem alike to the original ones.

Thank you again for reviewing my issues!

@mpariente
Copy link
Owner

That's counter-intuitive.. Do you have Matlab by any chance? The code is unit tested but maybe something weird happens IDK..

@nanaChang
Copy link
Author

I just ran the Matlab tests code and got 0.1973 for the signal with background music with 0.1105 for the signal w/o background music.
I'm thinking that is it possible that all my signals are too noisy so that with or without background music couldn't be indicated through STOI due to the noises. However, still want to bring this up since I got about 100 of signals with and without background music and almost all of them turned out to have higher STOI when there is BGM presented.

Thank you!

@mpariente
Copy link
Owner

Thanks a lot for running the tests in Matlab !
This is indeed a very interesting observation, @chtaal might have an explanation for it.

I don't have any intuition as to why this would be the case, sorry..

@chtaal
Copy link

chtaal commented Jul 30, 2020

Your scores are below 0.4 which basically means STOI says the speech is not intelligible. Have a look at fig4 in http://cas.et.tudelft.nl/pubs/Taal2011_1.pdf where you see real listening test scores vs STOI predictions.

You have to call STOI with a clean signal and a distorted version (less intelligible) of the SAME speech signal. The signals have to be time-aligned. Based on the file size it seems that 'refSpeech.wav' might be not the same time-aligned speech signal as the one used in audio_withBGM.wav? I think it would make more sense if you use audio_withoutBGM.wav as the reference signal (assuming it's 100% intelligible) and audio_withBGM.wav as the distorted version.

@nanaChang
Copy link
Author

Hi @chtaal !

Thank you for replying!
The reason why the file size of audio_withBGM.wav and audio_withoutBGM.wav is much larger than refSpeech.wav is that the audio files are recorded with 4-channel microphone array, thus the file size is about 4 times larger than 1-channel refSpeech.wav. However, I chose only channel 2 when calculating STOI so this shouldn't be a problem I supposed.

I've tried to evaluate STOI with a couple of my processed audio files as processed audio and audio_withoutBGM.wav as reference as you suggested, and it turned out that the trend is more similar to the results of PESQ (Thank you again for this helpful suggestion!). Yet I still feel weird since the audio_withoutBGM.wav is the raw signal I would like to test Beamforming algorithm on it. If this audio file instead of refSpeech.wav is taken as ref signal, how can I understand the audio quality of my beamforming algorithm with raw microphone signal as reference?

Thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants