Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete noob questions - 1) model purpose? 2) pre-trained weights? 3) other languages? #10

Closed
taewookim opened this issue Apr 5, 2018 · 4 comments

Comments

@taewookim
Copy link

Excuse my complete noob-ness

  1. is the model trying to accurately determine if the video (i.e. shape of lips) and audio are sync'ed?

  2. Any pre-trained weights I can download to run ?

  3. Assuming my Q1 is correct.. has anyone tested to see if this model can accurately detect audio/video synchronization on non-english languages?

@astorfi
Copy link
Owner

astorfi commented Apr 7, 2018

@taewookim

  1. Yes, ideally the method should be able to do so.
  2. No. Unfortunately, due to some data privacy, the trained weights have not been released. Although the dataset is public and available as The BBC-Oxford 'Lip Reading in the Wild' (LRW) Dataset.
  3. A similar model without 3D convolution operation and online pair selection has been proposed and implemented and titled as Out of time: automated lip sync in the wild. We compared our method with the aforementioned research effort but did not go to that level.

@taewookim
Copy link
Author

taewookim commented Apr 8, 2018

thank you @astorfi
Regarding Q3.. have you ever run the model on videos where speakers are speaking in non-English language? The models don't have to be super accurate, but i was wondering if this model was 'good enough' to determine audio spoofing of videos of non-English speakers.

Suppose a spoofer was attempting to bypass a system that uses face and speech recognition. He would hold up a video that contains the victim's face and voice recorded on, say, an ipad. He would be hiding from the detection camera (to defeat facial recognition) and would use his own voice, not the voice from ipad (to defeat the the speech recocognition system).

Simple solution might be to just look at time offset of the words and compare with the time offset of when the lips move. Of course, this isn't perfect, but at least somewhere to start from. Any idea what part of your code I can modify to detect this?

@astorfi
Copy link
Owner

astorfi commented Apr 8, 2018

No, I personally did not run it on Non-English dataset but the paper that I mentioned did it (Out of time: automated lip sync in the wild). About the question you are asking, unfortunately, I am not expert.

@taewookim
Copy link
Author

thanks you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants