Are We Ready? A Look at the State of the Art in Speech-to-text Applications Marie Meteer August 2007 www.everyzing.com.

Slides:



Advertisements
Similar presentations
A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
Advertisements

Speech-to-Speech Translation Hannah Grap Language Weaver, Inc.
Information Extraction from Spoken Language Dr Pierre Dumouchel Scientific Vice-President, CRIM Full Professor, ÉTS.
Tuning Jenny Burr August Discussion Topics What is tuning? What is the process of tuning?
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Probabilistic Adaptive Real-Time Learning And Natural Conversational Engine Seventh Framework Programme FP7-ICT
Rob Marchand Genesys Telecommunications
1 Profit from usage data analytics: Recent trends in gathering and analyzing IVR usage data Vasudeva Akula, Convergys Corporation 08/08/2006.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
ACCESSIBLE TECHNOLOGIES FOR SPEECH MANAGEMENT “Making media accessible to all” ITU workshop – Geneva October 2013.
PHONEXIA Can I have it in writing?. Discuss and share your answers to the following questions: 1.When you have English lessons listening to spoken English,
Nexidia Confidential “Searching Audio and Video Sources On the Web” SpeechTEK West 2007.
Real-Time Communications Technology Roundtable August 2009.
Mining the web to improve semantic-based multimedia search and digital libraries
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
LYU 0102 : XML for Interoperable Digital Video Library Recent years, rapid increase in the usage of multimedia information, Recent years, rapid increase.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
WHO WE ARE ●Website Development & Design ●Web Marketing Strategy, Training, and Analysis ●Web Applications, iOS apps, Android apps.
Radio Monitoring on FPinfomart Virtual Open House Hosted by: Jennifer Stein Product Manager, B2B
An innovative platform to allow translation and indexing of internet sites Localization World
Enabling Access to Sound Archives through Integration, Enrichment and Retrieval WP3 – Retrieval systems.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
1 A Practical Rollout & Tuning Strategy Phil Shinn 08/06.
© 2008 The McGraw-Hill Companies, Inc. All rights reserved. M I C R O S O F T ® Animating and Using Multimedia Effects Lesson 10.
© 2010 Nexidia Inc. CONFIDENTIAL. DO NOT DISTRIBUTE Nexidia ESI—Quality Strategic performance management through advanced speech analytics.
Speaking to Computers Alex Acero Manager, Speech Research Group Microsoft Research Feb 14 th 2003.
Streamlining the Review Cycle Michael Oettli, nlg GmbH Santa Clara, October 10 th.
Lights, Camera, Caption! Presented by Kaela Parks.
Real-Time Speech Recognition Subtitling in Education Respeaking 2009 Dr Mike Wald University of Southampton.
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
Computer Science Studies and Distance Education Unique Aspects Wingate Seminar London May 2005.
The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.
Turning Audio Search and Speech Analytics into Business Intelligence.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Speech Analytics ROI: Uncovering Key Business Intelligence Can Save Revenue From Dropping off the Bottom Line.
Join the Conversation: Active Listening on Social Media By Lauren Cleland New Media Specialist, Explore Georgia #TeamGaSocial.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Web-Assisted Annotation, Semantic Indexing and Search of Television and Radio News (proceedings page 255) Mike Dowman Valentin Tablan Hamish Cunningham.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
Welcome to the Contact Center of the Future The Intelligent Contact Center.
Improving the OER Experience: Enabling Rich Media Notebooks of OER Video and Audio Brandon Muramatsu Andrew McKinney
Dirk Van CompernolleAtranos Workshop, Leuven 12 April 2002 Automatic Transcription of Natural Speech - A Broader Perspective – Dirk Van Compernolle ESAT.
Duraid Y. Mohammed Philip J. Duncan Francis F. Li. School of Computing Science and Engineering, University of Salford UK Audio Content Analysis in The.
© 2013 by Larson Technical Services
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
Unlocking Audio/Video Content with Speech Recognition Behrooz Chitsaz Director, IP Strategy Microsoft Research Frank Seide Lead.
Behrooz ChitsazLorrie Apple Johnson Microsoft ResearchU.S. Department of Energy.
 digital methodologies for global media research Randy Kluver Dept of Communication Texas A&M University.
Speech Processing 1 Introduction Waldemar Skoberla phone: fax: WWW:
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
David Robb 10/14/08 Discovery Streaming. From the Home Page, you can search for digital media by keyword, subject, grade level, or curriculum standards.
The information contained herein is CONFIDENTIAL and is not to be used or distributed in any manner without the express consent of Global Tel*Link Introducing.
Quality Management in the Contact Center: Are you Listening to your Customers? February 25, 2016.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
#16PACE Using Interaction Analytics to Optimize Customer Engagement Outcomes.
Multi-Source Information Extraction Valentin Tablan University of Sheffield.
Building Community around Tools for Automated Video Transcription for Rich Media Notebooks: The SpokenMedia Project Brandon Muramatsu MIT,
Yes, I'm able to index audio files within Alfresco
Artificial Intelligence for Speech Recognition
CTI Contact Center For CustomerSoft ESP
Searching and Summarizing Speech
Automatic Speech Recognition: Conditional Random Fields for ASR
Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman
Presentation transcript:

Are We Ready? A Look at the State of the Art in Speech-to-text Applications Marie Meteer August 2007 www.everyzing.com

Overview Speech Recognition: The State of the Art A look back at where it came from Elements of the models State of the art performance Applications: Making them work Call Center Analytics Voicemail Transcription Needles in Haystacks Multimedia search

BBN Technology’s Speech Milestones Rough’ n’ Ready prototype system for browsing audio Pioneered statistical language understanding and data extraction Introduced context dependent phonetic units Early adopter of statistical hidden Markov models DARPA EARS Program Award Exceeded DARPA EARS targets 1982 1986 1995 1998 2002 2004 1976 1992 1994 2000 2003 2005 Early continuous speech recognizer using natural language understanding First 40,000 word real time speech recognizer AVOKE STX 1.0 introduced Audio Indexer System – 1st generation Broadcast Monitoring System delivered to U.S. Gov’t. – 2nd generation AVOKE STX 2.0 with Domain Development Tools First software-only, real-time, large-vocabulary, speaker-independent, continuous speech recognizer

Progress in Speech Recognition 1990’s 80 70 Call Home 60 SWBD Conversational Telephone 50 40 Word Error Rate (%) 30 Broadcast News 20 Resource Management WSJ 64K Vocab 10 WSJ 5K Vocab 5 Airline Task 2 Resource Mgt Spkr Dep. 1 Connected Digits 87 88 89 90 91 92 93 94 95 96 97 98

DARPA EARS for ASR Performance BBN’s 2003 Performance Exceeds Broadcast news ceiling Broadcast news floor Telephony ceiling Telephony floor Word Error Rate Goals 60 50 40 Word error rate 30 20 10 2003 2002 2005 2007 Year

Elements of a Speech Model Dictionary List of all the words and their pronunciations, the sequence of “phonemes” that make up the word >Real Networks R-IY-L N-EH-T-W-ER-K-S Dictionary tool automatically creates phonetic pronunciations for most words Acoustic Model Captures the relationship between the sounds and the phonemes Specific to a language (e.g. English, Spanish) and a channel (e.g. telephony, broadcast) Domain Model Captures the sequences of words in the language using a “tri-gram” model, that is the likelihood of a word given the two previous words Can be as general as “Conversational” or as specific as “Technology”

Model Requirements Acoustic Data Domain Modeling data Dictionary Minimum of 50-100 hours transcribed data English Broadcast News transcribed on 1600 hours of broadcast news data Training data must be a precise transcription with corresponding audio file (including partial words, “um”, laugh, etc) Domain Modeling data Text data, either transcribed from audio or off the web Does not have to be as precise as for acoustic modeling Has to model both the vocabulary and “style” of speaking Dictionary Phonetic pronunciations of all of the words

Word Accuracy Recognition performance varies based on audio quality and domain Within News Factors include Speaker Audio quality Background music Across Domains Speaking style, Out of vocabulary rate SPEAKER ACCURACY Male Anchor 82 Female Anchor 76 Non-native over the telephone 53 Commercial 55 DOMAIN ACCURACY News 74.5 Movie Reviews 77.8 Technology 79.4 Gaming 59.45 Religion 68.2

Document Retrieval Accuracy To correctly retrieve a document, a search term only has to be found once in the document The table below reports on document retrieval accuracy based on words occurring 2 or more times in the document compared with overall word accuracy.

Markets and Applications Consumer Search (video search) Government Intelligence Call Center Recording Broadcast Monitoring & Retrieval (audio/video publication) Digital Asset Production Enterprise Search (webcasts, corp info)

AVOKE Caller Experience Analytics Breakthrough Caller Experience Analytics The Only True End-to-End Solution From dialing to termination Multiple Techniques To Extract Understanding Prompt and speech recognition, telephony data, and human annotation Data-Driven Insights With drill-down to listen for root cause Zero Integration No on-site hardware or software To Manage & Optimize Contact Processes Improve Operational Visibility Reduce Agent Time by 15-30+% Boost First Call Resolution Eliminate Customer Dis-Satisfiers

Full Text & Keyword Search Search for words spoken by callers or agents View call with full text of caller and call center – including all IVR(s), queue(s) and agent(s)

Voicemail Transcription Requirements Near real time transcription High accuracy, especially on names Frequently very noisy conditions (Non-native speaker calling on a cell phone from a street corner in Germany) Solution Speech recognition automates a “first pass” Human correction provides accuracy Full human transcription on poor quality calls

Voicemail Solution? Human in the loop “Hi Tom. I can’t make the meeting but I’m available to call in. Give me a call at 101-555-1212. Thanks.” Transcribers fix the output of the speech recognizer Speech Recognizer produces a rough transcript Phone message is left Correct transcription goes back to the server Result: High Quality, Lower Cost

Custom Applications: Broadcast Monitoring Automatic translation of Arabic transcript from Language Weaver MT Automatic transcription of Arabic speech from BBN Audio Indexer Real-time streaming video (<5 min delay) Continuous 24/7 video encoding and streaming Real-time access to incoming video stream Synchronized transcription and translation Provides random access to spoken content in either language 30-day cache of recent video automatically maintained Seek by date and time to any position in the cache Search by keyword or by example in either language Retrieve from cache and/or filter incoming stream with alerts Export video segments and stills to PowerPoint, Word Include selections of transcription and translation Zero-maintenance design No onsite administration required

MultiMedia Search Problem: Opportunity: Search engines have historically had very little to work with in terms of properly discovering and indexing multimedia content: Opportunity: The value of multimedia content is “trapped” inside the files, out of view of search engines. Titles and tags miss key concepts within the files: …let’s look at the overall picture not just Obama and and Clinton Brett how do you assess the overall dynamics of what's happened over the course of the last three months how big -- victory for the president how big a defeat for the Democrat well it it. He would have been a bigger defeat it was a victory. This is this is -- reprieve cents for the president it's only as bill pointed out for months worth of funding. And it's and this issue's going to come up again in the Democrats are going to continue to try to impose restrictions on the with a president for a just war -- vote to be funded completely which is what. We're just talking about so. This is just justices have a battle he wanted that's that's nice for him but there's another one coming in just a few months. And of course what we have now is this whole idea that is taken hold and it's it's out there in the in the public parlance about September being in the big month not helpful to the president's cause -- -- for prisoners efforts you know we're not going to -- all the troops on the ground until next month and then visiting get to bounce of the summer to try to fix the situation. Probably unrealistic which in September's going to be a tough month of. ...

Multimedia Consumption Automatic extraction of key terms and concepts for tagging, categorization Patent-pending “Snippet” navigation technology enables users to jump to relevant segments of the clip Social media integrations drives RSS subscription, bookmarking, etc. Full text output enables related content presentation

Multimedia Discovery Example: FoxSports.com EveryZing Media Merchandising indexes the full contents of FoxSports Multimedia files. As a result, EveryZing able to significantly increase the number of keyword results Great discovery leads to increased consumption and enhanced monetization opportunities. Search Term EveryZing Results FoxSports Results EveryZing Increase Manny Ramirez 22 7 214% Yankees 281 111 153% Manchester United 21 2 950% Golf 214 170 25% Federer 45 15 200% David Beckham 36 17 111% Tom Brady 53 31 71%

Summary Speech recognition takes an inaccessible data structure (audio) and turns it into an accessible one (text) It’s far from perfect, but it’s a big jump from nothing Take away: It’s the task that matters. Find the right role, and speech recognition works (Corollary: A good prompt is worth two years of research)

Media Merchandising Solutions Thank you Media Merchandising Solutions Thank you! Marie Meteer VP of Speech and NLP mmeteer@everyzing.com www.everyzing.com