Download presentation
Presentation is loading. Please wait.
Published byKimberly French Modified over 9 years ago
1
Unlocking Audio/Video Content with Speech Recognition Behrooz Chitsaz Director, IP Strategy Microsoft Research behroozc@microsoft.com Frank Seide Lead Researcher Microsoft Research fseide@microsoft.com Kit Thambiratnam Researcher Microsoft Research kit@microsoft.com
2
Microsoft Research
3
Multimedia Research Speech Search Video summarization Semantic extraction Face identification Object recognition Visual search 3D Modeling
4
Speech Applications Indexing Search Metadata extraction Advertisin g Transcription Meeting notes Closed caption Voicemail Translation Translating phone Speech as interface Speech as 1 st class content Mobile access Search Automation PC application Web service Text input Dictation Mobile access Search Automation PC application Web service Text input Dictation Indexing Search Metadata extraction Advertising Transcription Meeting notes Closed caption Voicemail Translation Translating phone
5
meta-data – surrounding & anchor text, URL – top-N lists, collaborative filtering – editorial meta-data file content itself – keyword search in audio track using speech recognition Searching Media Today
6
Demo
7
Spectral Analysis Matching (Decoding) time alignment most likely hypothesis W’=argmax (w 1..w N ) p(o t..o |w 1..w N ) P(w 1..w N ) Acoustic Models p(o t..o |phoneme) Dictionary P(phonemes|w) Grammar (Language Model) P(w 1..w N ) “Hello World” o 1..o T (w 1..w N )^ Speech recognition
8
speech recognition in a nutshell Acoustic Models p(o t..o |phoneme) Dictionary P(phonemes|w) Grammar (Language Model) P(w 1..w N ) Speech recordings + full manual transcripts Speech recognition
9
Acoustic Models p(o t..o |phoneme) Dictionary P(phonemes|w) Grammar (Language Model) P(w 1..w N )... microscopem:s ay:n k:n r:n ax:n s:n k:n ow:n p:e microsecond m:s ay:n k:n r:n ax:n s:n eh:n k:n ax:n n:n d:e microsecondm:s ay:n k:n r:n ow:n s:n eh:n k:n ax:n n:n d:e microsoftm:s ay:n k:n r:n ax:n s:n ao:n f:n t:e microsoftm:s ay:n k:n r:n ow:n s:n ao:n f:n t:e … Speech recognition
10
Acoustic Models p(o t..o |phoneme) Dictionary P(phonemes|w) Grammar (Language Model) P(w 1..w N )... -0.8790 this is a -2.3045 this is about -3.1858 this is absolutely -5.2820 this is accomplished -1.9542 this is actually... -5.8492 is a barnyard -5.1004 is a barometer -4.2270 is a baseball -5.4292 is a baseless -4.4304 is a baseline Speech recognition
11
Challenges Speaker accent Background noise Reverberation Vocabulary Language
12
lattice-based indexing “into this bank account”
13
lattice-based indexing “into this bank account” expected benefits from indexing lattices: – alternative recognition candidates recall++ – confidence scores precision++ – (time information user experience) expected benefits from indexing lattices: – alternative recognition candidates recall++ – confidence scores precision++ – (time information user experience)
14
Speech Word statistics Metadata NP extraction Web query builder Recognizer Bing Search Docs Queries Docs Base Dict Base LM Adapt Dictionary Adapt Language Model Adapted Dict Adapted LM Vocabulary Adaptation from NLC group
15
Architectural decisions
16
SQL Server(s) 1. Submit audio/video to index 2. Get back AIB 3. Import AIB in SQL Web server(s)Media server(s) 4. Search/Retrieve results video RSS feed Azure integration
17
Cloud computing made simple Windows Azure + Power shell = Cloud computing at your fingertips Demo media content submission
18
Microsoft Research – Tell us if you are interested Tell us if you are interested mmms@microsoft.com – Visit us: Visit us: http://research.microsoft.com/mavis http://research.microsoft.com http://twitter.com/MSFTResearch http://www.facebook.com/microsoftresearch# http://www.flickr.com/photos/msr_redmond/
19
Thank you! Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.