In absence of searchable transcripts, many interesting YouTube videos, podcasts, lectures and talks are hard to explore, quote and summarize. ScribeSalad is a multi-lingual open data project regrouping over 550k YouTube video transcripts discussing social and political issues, psychology, history and scientific topics ranging from biology, mathematics to artificial intelligence : TedX, Yale courses, MIT lectures, National Geographic, The Joe Rogan Experience, Big Think, IQ squared, Jordan B. Peterson talks, Tim Ferris, Jocko Podcast and more. This project is a first step towards making great content more available and inspiring speakers, storytellers, interviewers and scientists better heard.
-
A-C : Aba & Preach, AI lectures & talks, Alexander Amini, Amanda Cerny, Answer the internet, Bill Burr, Big Think, Biographics, Bite-sized Philosophy, BrightInsight, Chris D'Elia, Coffee Break, Coffee Break, Coffeezilla, Comics explained, Conan O’Brien Needs A Friend, Cracked, CrashCourse
-
D-I : Dan Carlin, Dose Of Truth, Fire of learning, Future of Life Institute, H3 podcast, Harvard_University, History Hyenas clips, Hugo Larochelle, IQ squared
-
J : Jocko Podcast, Joe Rogan Clips, Joe Rogan Experience, Joe Rogan MMA Show, Joma Tech, Jordan B. Peterson, Jordan Peterson Fan Clips, Jordan Peterson clips, Jubilee
-
K-M : Kurzgesagt, Lang Focus, Lex Fridman, Mark Normand, MIT courses, More Chris D'Elia, Motivation Madness
-
N-R : National Geographic, NativLang, Nerd writer, Nobel minds, No Presh Network, NowYouSeeIt, Pop Culture Detective, RT Documentaries, Rubin Report, Russell Brand
-
S-V : Skavlan, Siraj Raval, Storytellers, TED, The Linguistics Channel, The Monday Morning Podcast, Theo Von, Theo Von Clips, TheSchoolOfLife, ThinkBigAnimation, TigerBellyClips, Tim Ferris, TFATK, TwoCents, Visual politik
-
W-Y : Wendover Productions, WhatIf, WhitneyCummings, Wired, Wisecrack, Wolfram, YCombinator, Yale Courses, YangSpeaks, YannLeCun, YannicKilcher, Yeagerists, Your Mom's House Podcast
Arabic (ar), French (fr), German (de), Spanish (es), Russian (ru), Turkish (tr), Portuguese (pt), Italian (it), Japanese (ja), Korean (ko)
Some of the transcriptions originate from YouTube (subtitles uploaded by the video's owner) while the rest are generated automatically using a high-accuracy large-vocabulary continuous speech recognition system (~90% of accuracy in clean conditions : no background noise, no heavy accents and good quality audio).
The transcripts identified using the corresponding YouTube videos IDs and each one is available in three formats : text, vtt (Text Tracks Format) and srt (SubRip Subtitle Format).
To open the original video, replace "ID" in https://www.youtube.com/watch?v=ID by the transcript filename.
This is an open data project, feel free to fork this repository, download, share and use any of the transcripts.
- Cleaning-up transcripts : removing fillers (hum, ah, etc) and repetitions.
- Topic modeling : automatically discovering the abstract "topics" that occur in a each transcript.
- Speaker identification : who spoken when ? and for how long ?
- Creating a search engine : exploring subjects by speaker, topic, channel, etc.
- Multiligual transcripts : Translating all transcripts to other languages.
- More channels & more videos.