Unlocking Audio/Video Content with Speech Recognition Behrooz Chitsaz Director, IP Strategy Microsoft Research Frank Seide Lead Researcher Microsoft Research Kit Thambiratnam Researcher Microsoft Research
Microsoft Research
Multimedia Research Speech Search Video summarization Semantic extraction Face identification Object recognition Visual search 3D Modeling
Speech Applications Indexing Search Metadata extraction Advertisin g Transcription Meeting notes Closed caption Voic Translation Translating phone Speech as interface Speech as 1 st class content Mobile access Search Automation PC application Web service Text input Dictation Mobile access Search Automation PC application Web service Text input Dictation Indexing Search Metadata extraction Advertising Transcription Meeting notes Closed caption Voic Translation Translating phone
meta-data – surrounding & anchor text, URL – top-N lists, collaborative filtering – editorial meta-data file content itself – keyword search in audio track using speech recognition Searching Media Today
Demo
Spectral Analysis Matching (Decoding) time alignment most likely hypothesis W’=argmax (w 1..w N ) p(o t..o |w 1..w N ) P(w 1..w N ) Acoustic Models p(o t..o |phoneme) Dictionary P(phonemes|w) Grammar (Language Model) P(w 1..w N ) “Hello World” o 1..o T (w 1..w N )^ Speech recognition
speech recognition in a nutshell Acoustic Models p(o t..o |phoneme) Dictionary P(phonemes|w) Grammar (Language Model) P(w 1..w N ) Speech recordings + full manual transcripts Speech recognition
Acoustic Models p(o t..o |phoneme) Dictionary P(phonemes|w) Grammar (Language Model) P(w 1..w N )... microscopem:s ay:n k:n r:n ax:n s:n k:n ow:n p:e microsecond m:s ay:n k:n r:n ax:n s:n eh:n k:n ax:n n:n d:e microsecondm:s ay:n k:n r:n ow:n s:n eh:n k:n ax:n n:n d:e microsoftm:s ay:n k:n r:n ax:n s:n ao:n f:n t:e microsoftm:s ay:n k:n r:n ow:n s:n ao:n f:n t:e … Speech recognition
Acoustic Models p(o t..o |phoneme) Dictionary P(phonemes|w) Grammar (Language Model) P(w 1..w N ) this is a this is about this is absolutely this is accomplished this is actually is a barnyard is a barometer is a baseball is a baseless is a baseline Speech recognition
Challenges Speaker accent Background noise Reverberation Vocabulary Language
lattice-based indexing “into this bank account”
lattice-based indexing “into this bank account” expected benefits from indexing lattices: – alternative recognition candidates recall++ – confidence scores precision++ – (time information user experience) expected benefits from indexing lattices: – alternative recognition candidates recall++ – confidence scores precision++ – (time information user experience)
Speech Word statistics Metadata NP extraction Web query builder Recognizer Bing Search Docs Queries Docs Base Dict Base LM Adapt Dictionary Adapt Language Model Adapted Dict Adapted LM Vocabulary Adaptation from NLC group
Architectural decisions
SQL Server(s) 1. Submit audio/video to index 2. Get back AIB 3. Import AIB in SQL Web server(s)Media server(s) 4. Search/Retrieve results video RSS feed Azure integration
Cloud computing made simple Windows Azure + Power shell = Cloud computing at your fingertips Demo media content submission
Microsoft Research – Tell us if you are interested Tell us if you are interested – Visit us: Visit us:
Thank you! Questions?