Are We Ready? A Look at the State of the Art in Speech-to-text Applications Marie Meteer August 2007
Overview Speech Recognition: The State of the Art A look back at where it came from Elements of the models State of the art performance Applications: Making them work Call Center Analytics Voicemail Transcription Needles in Haystacks Multimedia search
BBN Technology’s Speech Milestones Rough’ n’ Ready prototype system for browsing audio Pioneered statistical language understanding and data extraction Introduced context dependent phonetic units Early adopter of statistical hidden Markov models DARPA EARS Program Award Exceeded DARPA EARS targets 1982 1986 1995 1998 2002 2004 1976 1992 1994 2000 2003 2005 Early continuous speech recognizer using natural language understanding First 40,000 word real time speech recognizer AVOKE STX 1.0 introduced Audio Indexer System – 1st generation Broadcast Monitoring System delivered to U.S. Gov’t. – 2nd generation AVOKE STX 2.0 with Domain Development Tools First software-only, real-time, large-vocabulary, speaker-independent, continuous speech recognizer
Progress in Speech Recognition 1990’s 80 70 Call Home 60 SWBD Conversational Telephone 50 40 Word Error Rate (%) 30 Broadcast News 20 Resource Management WSJ 64K Vocab 10 WSJ 5K Vocab 5 Airline Task 2 Resource Mgt Spkr Dep. 1 Connected Digits 87 88 89 90 91 92 93 94 95 96 97 98
DARPA EARS for ASR Performance BBN’s 2003 Performance Exceeds Broadcast news ceiling Broadcast news floor Telephony ceiling Telephony floor Word Error Rate Goals 60 50 40 Word error rate 30 20 10 2003 2002 2005 2007 Year
Elements of a Speech Model Dictionary List of all the words and their pronunciations, the sequence of “phonemes” that make up the word >Real Networks R-IY-L N-EH-T-W-ER-K-S Dictionary tool automatically creates phonetic pronunciations for most words Acoustic Model Captures the relationship between the sounds and the phonemes Specific to a language (e.g. English, Spanish) and a channel (e.g. telephony, broadcast) Domain Model Captures the sequences of words in the language using a “tri-gram” model, that is the likelihood of a word given the two previous words Can be as general as “Conversational” or as specific as “Technology”
Model Requirements Acoustic Data Domain Modeling data Dictionary Minimum of 50-100 hours transcribed data English Broadcast News transcribed on 1600 hours of broadcast news data Training data must be a precise transcription with corresponding audio file (including partial words, “um”, laugh, etc) Domain Modeling data Text data, either transcribed from audio or off the web Does not have to be as precise as for acoustic modeling Has to model both the vocabulary and “style” of speaking Dictionary Phonetic pronunciations of all of the words
Word Accuracy Recognition performance varies based on audio quality and domain Within News Factors include Speaker Audio quality Background music Across Domains Speaking style, Out of vocabulary rate SPEAKER ACCURACY Male Anchor 82 Female Anchor 76 Non-native over the telephone 53 Commercial 55 DOMAIN ACCURACY News 74.5 Movie Reviews 77.8 Technology 79.4 Gaming 59.45 Religion 68.2
Document Retrieval Accuracy To correctly retrieve a document, a search term only has to be found once in the document The table below reports on document retrieval accuracy based on words occurring 2 or more times in the document compared with overall word accuracy.
Markets and Applications Consumer Search (video search) Government Intelligence Call Center Recording Broadcast Monitoring & Retrieval (audio/video publication) Digital Asset Production Enterprise Search (webcasts, corp info)
AVOKE Caller Experience Analytics Breakthrough Caller Experience Analytics The Only True End-to-End Solution From dialing to termination Multiple Techniques To Extract Understanding Prompt and speech recognition, telephony data, and human annotation Data-Driven Insights With drill-down to listen for root cause Zero Integration No on-site hardware or software To Manage & Optimize Contact Processes Improve Operational Visibility Reduce Agent Time by 15-30+% Boost First Call Resolution Eliminate Customer Dis-Satisfiers
Full Text & Keyword Search Search for words spoken by callers or agents View call with full text of caller and call center – including all IVR(s), queue(s) and agent(s)
Voicemail Transcription Requirements Near real time transcription High accuracy, especially on names Frequently very noisy conditions (Non-native speaker calling on a cell phone from a street corner in Germany) Solution Speech recognition automates a “first pass” Human correction provides accuracy Full human transcription on poor quality calls
Voicemail Solution? Human in the loop “Hi Tom. I can’t make the meeting but I’m available to call in. Give me a call at 101-555-1212. Thanks.” Transcribers fix the output of the speech recognizer Speech Recognizer produces a rough transcript Phone message is left Correct transcription goes back to the server Result: High Quality, Lower Cost
Custom Applications: Broadcast Monitoring Automatic translation of Arabic transcript from Language Weaver MT Automatic transcription of Arabic speech from BBN Audio Indexer Real-time streaming video (<5 min delay) Continuous 24/7 video encoding and streaming Real-time access to incoming video stream Synchronized transcription and translation Provides random access to spoken content in either language 30-day cache of recent video automatically maintained Seek by date and time to any position in the cache Search by keyword or by example in either language Retrieve from cache and/or filter incoming stream with alerts Export video segments and stills to PowerPoint, Word Include selections of transcription and translation Zero-maintenance design No onsite administration required
MultiMedia Search Problem: Opportunity: Search engines have historically had very little to work with in terms of properly discovering and indexing multimedia content: Opportunity: The value of multimedia content is “trapped” inside the files, out of view of search engines. Titles and tags miss key concepts within the files: …let’s look at the overall picture not just Obama and and Clinton Brett how do you assess the overall dynamics of what's happened over the course of the last three months how big -- victory for the president how big a defeat for the Democrat well it it. He would have been a bigger defeat it was a victory. This is this is -- reprieve cents for the president it's only as bill pointed out for months worth of funding. And it's and this issue's going to come up again in the Democrats are going to continue to try to impose restrictions on the with a president for a just war -- vote to be funded completely which is what. We're just talking about so. This is just justices have a battle he wanted that's that's nice for him but there's another one coming in just a few months. And of course what we have now is this whole idea that is taken hold and it's it's out there in the in the public parlance about September being in the big month not helpful to the president's cause -- -- for prisoners efforts you know we're not going to -- all the troops on the ground until next month and then visiting get to bounce of the summer to try to fix the situation. Probably unrealistic which in September's going to be a tough month of. ...
Multimedia Consumption Automatic extraction of key terms and concepts for tagging, categorization Patent-pending “Snippet” navigation technology enables users to jump to relevant segments of the clip Social media integrations drives RSS subscription, bookmarking, etc. Full text output enables related content presentation
Multimedia Discovery Example: EveryZing Media Merchandising indexes the full contents of FoxSports Multimedia files. As a result, EveryZing able to significantly increase the number of keyword results Great discovery leads to increased consumption and enhanced monetization opportunities. Search Term EveryZing Results FoxSports Results EveryZing Increase Manny Ramirez 22 7 214% Yankees 281 111 153% Manchester United 21 2 950% Golf 214 170 25% Federer 45 15 200% David Beckham 36 17 111% Tom Brady 53 31 71%
Summary Speech recognition takes an inaccessible data structure (audio) and turns it into an accessible one (text) It’s far from perfect, but it’s a big jump from nothing Take away: It’s the task that matters. Find the right role, and speech recognition works (Corollary: A good prompt is worth two years of research)
Media Merchandising Solutions Thank you Media Merchandising Solutions Thank you! Marie Meteer VP of Speech and NLP