1 DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech John Makhoul, Rich Schwartz, Alex Baron, Ivan Bulyko, Long Nguyen,

Slides:

Advertisements

Similar presentations

A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)

Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Automatic Timeline Generation Jessica Jenkins Josh Taylor CS 276b.

Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden

Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.

Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.

ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.

Review of ICASSP 2004 Arthur Chan. Part I of This presentation (6 pages) Pointers of ICASSP 2004 (2 pages) NIST Meeting Transcription Workshop (2 pages)

1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.

Scalable Text Mining with Sparse Generative Models

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.

1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.

® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.

Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Information Extraction From Medical Records by Alexander Barsky.

The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.

Comparing two sample means Dr David Field. Comparing two samples Researchers often begin with a hypothesis that two sample means will be different from.

Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Web-Assisted Annotation, Semantic Indexing and Search of Television and Radio News (proceedings page 255) Mike Dowman Valentin Tablan Hamish Cunningham.

1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.

Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Improving out of vocabulary name resolution The Hanks David Palmer and Mari Ostendorf Computer Speech and Language 19 (2005) Presented by Aasish Pappu,

05/03/03-06/03/03 7 th Meeting Edinburgh Naïve Bayes Fact Extractor (NBFE) v.1.

PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.

1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

TimeML compliant text analysis for Temporal Reasoning Branimir Boguraev and Rie Kubota Ando.

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

LREC Marrakech, May 29, 2008 Question Answering on Speech Transcriptions: the QAST evaluation in CLEF L. Lamel 1, S. Rosset 1, C. Ayache 2, D. Mostefa.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Uncertainty2 Types of Uncertainties Random Uncertainties: result from the randomness of measuring instruments. They can be dealt with by making repeated.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.

Multi-Source Information Extraction Valentin Tablan University of Sheffield.

Automatically Labeled Data Generation for Large Scale Event Extraction

Efficient Inference on Sequence Segmentation Models

Erasmus University Rotterdam

Deep Exploration and Filtering of Text (DEFT)

Classification with Perceptrons Reading:

CH. 1: Introduction 1.1 What is Machine Learning Example:

Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing

Thomas L. Packer BYU CS DEG

Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman

Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.

Presentation transcript:

1 DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech John Makhoul, Rich Schwartz, Alex Baron, Ivan Bulyko, Long Nguyen, Lance Ramshaw, Dave Stallard, Bing Xiang

2 Objective  Estimate speech recognition accuracy required to support utility in the form of question answering (QA)  Follow-on to earlier DUTIE study from text –Entities and relations extracted into database, which was used by human subjects for QA task –Measured human QA performance as function of information extraction (IE) scores  Extension to speech recognition –Measure effect of speech recognition error on IE scores –Assume same relation between IE scores and QA, infer effect of speech recognition on QA performance

3 Original DUTIE Study with Text Input  Databases of fully automated IE and manual annotation –Populated with entities, relations, co-reference links –946 articles  Two databases were blended to produce a continuum of database qualities, as measured by –Entity Value Score (EVS) –Relation Value Score (RVS)  For each database, measured human performance –QA performance –Time taken to answer each question in seconds

4 DUTIE Results  Need to reduce IE error rate by about in half to achieve 70% QA performance

5 Relative QA Performance vs. EVS  Same results, just scaled by QA with perfect IE scores

6 DUTIE Speech Corpus  The DUTIE speech corpus consists of 946 articles with 34.7 hours of audio data in total –Same articles as in the original DUTIE study –15.5 hours of TDT broadcast news data ABC, CNN, PRI, VOA (Jan ~ June 1998, Oct ~ Dec. 2000) MNB, NBC (Oct ~ Dec. 2000) –19.2 hours of Newswire read speech recorded at LDC APW, NYT (Feb ~ June 1998, Oct ~ Dec. 2000)

7 DUTIE Speech Process  Speech Recognition –Takes audio; outputs text in SNOR format –Run at four different levels of accuracy  Punctuation –Takes recognition output; adds periods/commas –Two methods: Forced alignment vs. automatic punctuation  Information Extraction (IE) –Takes punctuated text and finds entities and relations –Produces ACE Program Format (APF) XML  Scoring IE –Compares test and reference APFs and computes Entity Value Score and Relation Value Score

8 Block Diagram

9 Speech Recognition  Four systems to produce a range of word error rates –System I: BBN RT04 stand-alone 10xRT system, with heavily-weighted DUTIE text in language model training (cheating) –System II: BBN RT04 stand-alone 10xRT system, with normally-weighted DUTIE text in language model training (some cheating) –System III: BBN RT02 system (Fair) –System IV: BBN RT02 system, with decreased grammar weight in decoding (degraded)

10 Sentence Boundary Detection Model  Sentence boundary included periods, questions marks, exclamation points  Use a 3-gram LM to compute probabilities of sentence boundary at each word position [Stolcke 1996]  Training data –TDT3 closed captions (12M words) –HUB4 transcripts (120M words) –Gigaword News articles from 2000 (100M words)  Use Viterbi to find the most likely sequence of tags

11 Automatic Punctuation Results  3-gram word LM gives near-state-of-the-art period error rate (state-of-the-art is 60% as reported at RT-04)  Punctuation performance is sensitive to WER (in part due to LM being trained on errorless text)  Further improvements possible with new models or prosodic features WER(%)Period Error Rate (%) State-of-the- art ASR

12 Reference Punctuation  Tokenize reference into words labeled with punctuation triplets 1) Punctuation attached to beginning of word 2) Punctuation attached to end of word 3) Unattached punctuation (e.g. hyphens) to right of word  Align reference and hypothesis words  Attach each reference word’s punctuation to the hypothesis word it is aligned to Ref text: Hello, I’m looking for a size ten shoe. I prefer black, and don’t care about price. ASR out:JELLO I’M LOOKING FOR * SHOE I PREFER * AND DON’T CARE ABOUT PRICE Output: JELLO, I’M LOOKING FOR SHOE. I PREFER, AND DON’T CARE ABOUT PRICE.

13 Information Extraction  Finds entities and relations between them  Identifies entities by character offset interval in the input text file –Character offset is defined literally: All whitespace and punctuation is included!  Produces ACE Program Format (APF) XML expression MOSCOW (AP) _ Presidents Leonid Kuchma of Ukraine and Boris Yeltsin of Russia signed an economic cooperation plan Friday ``We have covered the entire list of questions and discussed how we will be tackling them,'' Yeltsin was quoted as saying. ….. ….

14 Scoring IE, Part I  IE scoring program compares the character offset intervals of entities in reference and test APFs –Requires 30% overlap  Problem #1: Character offsets in reference APFs reflect all whitespace formatting in original text file –But recognizer output will have different character offsets, so offsets will be wrong  Solution 1.Align words in reference and test 2.Based on this alignment, compute character offset mapping between reference and test 3.Change character positions in test APF using mapping 4.Compute IE scores

15 Scoring IE, Part II  Problem #2: IE scoring program only compares character offset intervals, not the words in them –So it may ignore word errors in a name “George Hush” vs. “George Bush”  Solution: Modify scoring program to require match of alphanumeric characters in the test and reference character intervals –Modification courtesy of George Doddington –Requires 50% content overlap

16 Detailed Results WER All Punctuation Correct Period Correct Comma Auto Period Entity Score (%) Relation Score (%) 0.0% X XX X X % X XX X % X XX X X % X XX X % X XX X

17 Effect of Punctuation on Entity Value Score  Sentence boundaries are required but locations are not critical (loss is 2.8% relative with 62% period error rate)  Loss of comma results in 9.5% reduction in Entity score –Importance of appositives to IE (“George W. Bush, President of the United States, said this morning …”)

18 Entity Value Scores as Function of WER  Effect of WER on Entity score is linear  Loss for automatic punctuation relative to reference is 13.5% relative

19 Relation Value Scores as Function of WER  Loss for automatic punctuation relative to reference is 25% relative

20 Relation Between WER and IE Scores  Entity Value Score (EVS) and Relation Value Score (RVS) are linear function of WER  Automatic punctuation has multiplicative effect on scores  Relative QA as a function of EVS

21 Predicted Relative QA vs. WER and EVS(ref)  At 12% WER with today’s IE, we get 33% of maximum QA –Near zero for 25% WER (e.g., non-English)  With half the IE error rate, half WER, half the loss from punctuation, we estimate 72% of maximum QA

22 Conclusions  IE scores degrade linearly with WER  Sentence boundaries are required but locations are not critical  Commas are important for IE  With current technology (e.g., 12% WER and 60% EVS on text), we can only achieve 33% of maximum QA performance  If IE error and WER were cut by half and loss due to commas cut in half, QA performance could increase to over 70% of maximum