Data Mining, Information Extraction and Search in Spoken Documents

Slides:



Advertisements
Similar presentations
A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
Advertisements

Information Extraction from Spoken Language Dr Pierre Dumouchel Scientific Vice-President, CRIM Full Professor, ÉTS.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Improved Name Recognition with Meta-data Dependent Name Networks published by Sameer R. Maskey, Michiel Bacchiani, Brian Roark, and Richard Sproat presented.
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
5/5/20151 Recognizing Metadata: Segmentation and Disfluencies Julia Hirschberg CS 4706.
A Brief Overview. Contents Introduction to NLP Sentiment Analysis Subjectivity versus Objectivity Determining Polarity Statistical & Linguistic Approaches.
PHONEXIA Can I have it in writing?. Discuss and share your answers to the following questions: 1.When you have English lessons listening to spoken English,
Languages & The Media, 5 Nov 2004, Berlin 1 New Markets, New Trends The technology side Stelios Piperidis
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Presented by Eroika Jeniffer.  What are we going to learn? - the use of chat in classroom - the most likely application on chat. And many more….. So,
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 CS 430: Information Discovery Lecture 14 Automatic Extraction of Metadata.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
AQUAINT Herbert Gish and Owen Kimball June 11, 2002 Answer Spotting.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
CS 4705 Corpus Linguistics and Machine Learning Techniques.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
Internet Searching: Finding Quality Information
Conditional Random Fields for ASR
Introduction to Corpus Linguistics: Exploring Collocation
Deep Exploration and Filtering of Text (DEFT)
CS 430: Information Discovery
A Country Report – COCOSDA Activities in China Data More and more companies on data resources and services suppliers are emerging in China: a new.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
CSC480 Software Engineering
Why Study Spoken Language?
Recognizing Structure: Sentence and Topic Segmentation
A Brief Introduction to the Internet
A SPEAKER’S GUIDEBOOK 4TH EDITION CHAPTER 9
European Network of e-Lexicography
Social Knowledge Mining
Data Mining, Information Extraction and Search in Spoken Documents
Why Study Spoken Language?
Turn-taking and Disfluencies
CNIT 131 HTML5 – Anchor/Link.
Advanced NLP: Speech Research and Technologies
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Searching and Summarizing Speech
Searching and Summarizing Speech
Advanced NLP: Speech Research and Technologies
Tetsuya Nasukawa, IBM Tokyo Research Lab
Peggy van der Kreeft Deutsche Welle
King Saud University, Riyadh, Saudi Arabia
Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman
FCE (FIRST CERTIFICATE IN ENGLISH) General information.
Building Topic/Trend Detection System based on Slow Intelligence
Adobe Acrobat DC Accessibility: Accessibility Checker
CSCI 5832 Natural Language Processing
Introduction to Search Engines
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

Data Mining, Information Extraction and Search in Spoken Documents Julia Hirschberg CS 4706 12/2/2018

Today Data mining from text Searching audio data instead of text Information extraction from spoken documents Speech data mining 12/2/2018

Data Mining Discovery of trends and patterns across very large datasets, usually for decision-making purposes Fraud detection in banking, telephony Stock market Indications of demographic disasters New causes of diseases …finding things you don’t know you’re looking for Information retrieval vs. ‘mining for nuggets’ 12/2/2018

Dating Mining in Computational Linguistics Finding lexical co-occurrence information Finding parallel text corpora on the web for MT Finding ‘new’ topics in news stories TDT task Exploring citation links: Networks of influence Information extraction, e.g. find mutual acquaintances 12/2/2018

Snowball (Agichtein et al ’01): Seed set of patterns (e.g. Norman Mailer, 59  <firstname> <lastname>, <age>; the 59-year-old Mailer  the <age>-year-old <lastname>) Find more patterns by looking for e.g. Mailer close to 59 Mailer turned 59 last week. Though Mailer is 59… 12/2/2018

But Searching Audio Data is Harder Large amounts of audio data available: on the web, in company archives, in our homes We have tools supporting random access to text – but for audio we’re limited to serial search How can we develop methods to search audio as easily as text? 12/2/2018

Applications Searching online TV and radio news and archives Library of Congress Searching a/v archives, movies Searching trial recordings and legislative sessions Searching meetings, customer care exchanges, focus groups Telephone calls and voicemail 12/2/2018

Current Approach Train/adapt a speech recognizer for the corpus Produce an ASR transcript Segment spoken `documents’ into sentences, turns, topics Index (errorful) transcripts for Information Retrieval and link to audio via timestamps Enables audio search by content 12/2/2018

Some Examples SpeechBot searching internet broadcasts Google Voice Search: search audio by voice (not yet) SCANMail searching voicemail 12/2/2018

Information Extraction and QA from Speech DARPA GALE project: improve information gathering from text, speech, translations Current Domain: newswire and news broadcasts in English, Arabic, and Mandarin 3 competing teams ASR/MT bakeoffs ‘Distillation’ evaluations QA User studies Requires identification and annotation of information and ‘formatting’ in speech 12/2/2018

Sample Distillation Questions List facts about <event> Find people who are mutual acquaintances of <person1> and <person2> Identify persons arrested from <organization> and give their name and role in that organization Produce a biography of <person> Provide information on <organization> Find statements made by or attributed to <person> about <topic> How did <country> react to <event> 12/2/2018

Nightingale Architecture Automatic Annotation Distillation Speaker modeling Information assimilation ASR MT Audio diarization Prosodic metadata Target Language Source Language Punctuation Capitalization Info repository Linguistic structure Prosodic analysis Names Relations Intelligence delivery Topic modeling 12/2/2018

Information Annotation Spoken documents … Lack many cues found in text documents Format (sentences, turns, paragraphs) Include spontaneous speech phenomena which are difficult for ASR and NLP technologies to handle Disfluencies, fragments Contain errors Annotation can turn a weakness into a strength 12/2/2018

From an ASR Transcript aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please 12/2/2018

To Speaker Segmentation (Diarization) Speaker: 0 - aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston Speaker: 1 - the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please 12/2/2018

Add Speaker Role Labels Anchor - aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston Reporter - the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please 12/2/2018

Perform Sentence Detection and Punctuation Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston. Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please. 12/2/2018

Detect Story Boundaries Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston. Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please. 12/2/2018

Detect Disfluencies (and Keep/Remove) Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston. Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please. 12/2/2018

Detect Named Entities Anchor - Aides tonight in Boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by Milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening from the University of Massachusetts in Boston. Reporter - The site of the widely anticipated first of eight between vice president Al Gore and Governor George W. Bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from Jim Lehrer of P.B.S. N.B.C.'s David Gregory is here with Governor Bush. Claire Shipman is covering the vice president Claire you begin tonight please. 12/2/2018

Resolve References Anchor - Aides tonight in Boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by Milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening from the University of Massachusetts in Boston. Reporter - The site of the widely anticipated first of eight between vice president Al Gore and Governor George W. Bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from Jim Lehrer of P.B.S. N.B.C.'s David Gregory is here with Governor Bush [Governor George W. Bush]. Claire Shipman is covering the vice president Claire [Claire Shipman] you begin tonight please. 12/2/2018

Speech Data Mining How does it differ from text data mining? Must handle errorful transcription Lacks (reliable) formatting Contains spontaneous speech phenomena We need to bring additional sources to bear on the problem 12/2/2018

Maskey et al 2004: Improving Proper Name Transcription in Voicemail How can we improve transcription of proper names without increasing the size of the ASR lexicon? Use meta-data available at runtime to hypothesize caller’s and callee’s names Caller ID string – “cname” Name of mailbox owner – “mname” 12/2/2018

Corpus Scanmail corpus 100 hours of voicemail messages from 140 employees of AT&T. Manually transcribed with “cname” and “mname” tags Gender balanced ~12% non-native speakers 238 random messages for testing, rest (~ 10,000 messages) for training Training corpus consisted of 100 hours of voicemail messages collected from the voicemail boxes of 140 employees at AT&T, called Scanmail. The corpus is approximately gender balanced, and has 12% of messages by non-native speakders. The corpus was manually transcribed, with caller id and mailbox owner parts bracketed. 238 messages were selected randomly for testing, and the rest for training. In the test set, 317 word tokens were caller names, and 219 were mailbox owner names. 12/2/2018

Approach Create a class-based language model Create a name network to give instances for the classes of the model Replace the class-based language model at runtime with the appropriate name networks, identified from the cname and mname of the call 12/2/2018

Name Network To get values for “mname” and “cname”, an internal AT&T employee directory (~ 40,000 people) listing used “cname” created from variations of static titles (Miss, Mr), full first names and nicknames (Alexander, Alex), and last names (Jones) 12/2/2018

Name Network Probability within class – training corpus Probability within first names – AT&T directory listing 12/2/2018

Experimental Results Word Error Rates (WER) improvement small Absolute reduction of 0.6% Named Error Rate (NER) improvement significant Absolute reduction of 20 % Large reduction in NER important: Getting a name right is important to business users Scanmail users expressed a strong desire for the system to recognize their own names correctly 12/2/2018

Next Class HTK Toolkit and HW5 (Fadi Biadsy) 12/2/2018