CS 4705 Corpus Linguistics and Machine Learning Techniques.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
University of Sheffield NLP Module 4: Machine Learning.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
CS 4705 Lecture 13 Corpus Linguistics I. From Knowledge-Based to Corpus-Based Linguistics A Paradigm Shift begins in the 1980s –Seeds planted in the 1950s.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
 Learning for NLP  Midterm Review: Midterm next Tuesday  Homework back  Thanks for doing midterm exam! Some very useful comments came in.
Some definitions Morphemes = smallest unit of meaning in a language Phrase = set of one or more words that go together (from grammar) (e.g., subject clause,
CS 4705 Robust Semantics, Information Extraction, and Information Retrieval.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Towards Natural Clarification Questions in Dialogue Systems Svetlana Stoyanchev, Alex Liu, and Julia Hirschberg AISB 2014 Convention at Goldsmiths, University.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Automatic Extraction of Opinion Propositions and their Holders Steven Bethard, Hong Yu, Ashley Thornton, Vasileios Hatzivassiloglou and Dan Jurafsky Department.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
CS4705 Corpus Linguistics and Machine Learning Techniques.
Discourse Markers Discourse & Dialogue CS November 25, 2006.
Speech and Language Processing
Julia Hirschberg, Michiel Bacchiani, Phil Isenhour, Aaron Rosenberg, Larry Stead, Steve Whittaker, Jon Wright, and Gary Zamchick (with Martin Jansche,
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Ling 570 Day 17: Named Entity Recognition Chunking.
What is a reflection? serious thought or consideration the fixing of the mind on some subject;
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Universit at Dortmund, LS VIII
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
AQUAINT Herbert Gish and Owen Kimball June 11, 2002 Answer Spotting.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia.
Automatic Identification of Pro and Con Reasons in Online Reviews Soo-Min Kim and Eduard Hovy USC Information Sciences Institute Proceedings of the COLING/ACL.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield.
1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
1 Introduction to Your Norstar Telephone System IT Support Center or
Using Linguistic Analysis and Classification Techniques to Identify Ingroup and Outgroup Messages in the Enron Corpus.
TYPE OF READINGS.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research
Identifying Expressions of Opinion in Context Eric Breck and Yejin Choi and Claire Cardie IJCAI 2007.
Named entities recognition Jana Kravalová. Content 1. Task 2. Data 3. Machine learning 4. SVM 5. Evaluation and results.
Introduction Machine Learning 14/02/2017.
Automatic Hedge Detection
Robust Semantics, Information Extraction, and Information Retrieval
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Searching and Summarizing Speech
Searching and Summarizing Speech
iSRD Spam Review Detection with Imbalanced Data Distributions
Text Mining & Natural Language Processing
Presentation transcript:

CS 4705 Corpus Linguistics and Machine Learning Techniques

Review What do we know about so far? –Words (stems and affixes, roots and templates,…) –POS (e.g. nouns, verbs, adverbs, adjectives, determiners, articles, …) –Named Entities (e.g. Person Names) –Ngrams (simple word sequences) –Syntactic Constituents (NPs, VPs, Ss,…)

What useful things can we do – with only this knowledge? Find sentence boundaries, abbreviations Find Named Entities (person names, company names, telephone numbers, addresses,…) Find topic boundaries and classify articles into topics Identify a document’s author and their opinion on the topic, pro or con Answer simple questions (factoids) Do simple summarization/compression

But first, we need corpora… Online collections of text and speech Some examples –Brown Corpus –Wall Street Journal and AP News –ATIS, Broadcast News –TDTN –Switchboard, Call Home –TRAINS, FM Radio, BDC Corpus –Hansards’ parallel corpus of French and English –And many private research collections

Next, we pose a question…the dependent variable Binary questions: –Is this word followed by a sentence boundary or not? –A topic boundary? –Does this word begin a person name? End one? –Should this word or sentence be included in a summary? Other classification: –Is this document about medical issues? Politics? Religion? Sports? … Predicting continuous variables: –How loud or high should this utterance be produced?

Finding a suitable corpus and preparing it for analysis Which corpora can answer my question? –Do I need to get them labeled to do so? Dividing the corpus into training and test corpora –To develop a model, we need a training corpus overly narrow corpus: doesn’t generalize overly general corpus: don't reflect task or domain –To demonstrate how general our model is, we need a test corpus to evaluate the model Development test set vs. held out test set –To evaluate our model we must choose an evaluation metric Accuracy Precision, recall, F-measure,… Cross validation

Then we build the model… Again, identify the dependent variable: what do we want to predict or classify? –Does this word begin a person name? Is this word within a person name? Identify the independent variables: what features might help to predict the dependent variable? –What is this word’s POS? What is the POS of the word before it? After it? –Is this word capitalized? Is it followed by a ‘.’? –How far is this word from the beginning of its sentence? Extract the values of each variable from the corpus by some automatic means

A Sample Feature Vector for Sentence Ending Detection WordIDPOSCap?, After?Dist/SbegEnd? ClintonNyn1n wonVnn2n easilyAdvny3n butConjnn4n

An Example: Finding Caller Names in Voic  SCANMailSCANMail Motivated by interviews, surveys and usage logs of heavy users: –Hard to scan new msgs to find those you need to deal with quickly –Hard to find msg you want in archive –Hard to locate information you want in any msg How could we help?

SCANMail Architecture Caller SCANMailSubscriber

Corpus Collection Recordings collected from 138 AT&T Labs employees’ mailboxes 100 hours; 10K msgs; 2500 speakers Gender balanced: 12% non-native speakers Mean message duration 36.4 secs, median 30.0 secs Hand-transcribed and annotated with caller id, gender, age, entity demarcation (names, dates, telnos) Also recognized using ASR engine

Transcription and Bracketing [ Greeting: hi R ] [ CallerID: it's me ] give me a call [ um ] right away cos there's [.hn ] I guess there's some [.hn ] change [ Date: tomorrow ] with the nursery school and they [ um ] [.hn ] anyway they had this idea [ cos ] since I think J's the only one staying [ Date: tomorrow ] for play club so they wanted to they suggested that [.hn ] well J2 actually offered to take J home with her and then would she

would meet you back at the synagogue at [ Time: five thirty ] to pick her up [.hn ] [ uh ] so I don't know how you feel about that otherwise M_ and one other teacher would stay and take care of her till [ Date: five thirty tomorrow ] but if you [.hn ] I wanted to know how you feel before I tell her one way or the other so call me [.hn ] right away cos I have to get back to her in about an hour so [.hn ] okay [ Closing: bye [.nhn ] [.onhk ]

SCANMail Demo mail/ mail/ Audix extension: demo Audix password: (null)

Information Extraction (Martin Jansche and Steve Abney) Goals: extract key information from msgs to present in headers Approach: –Supervised learning from transcripts (phone #’s, caller self-ids) –Combine Machine Learning techniques with simpler alternatives, e.g. hand-crafted rules –Two stage approaches

–Features exploit structure of key elements (e.g. length of phone numbers) and of surrounding context (e.g. self-ids tend to occur at beginning of msg)

Telephone Number Identification Rules convert all numbers to standard digit format Predict start of phone number with rules –This step over-generates –Prune with decision-tree classifier Best features: –Position in msg –Lexical cues –Length of digit string Performance: –.94 F on human-labeled transcripts –.95 F on ASR)

Caller Self-Identifications Predict start of id with classifier –97% of id’s begin 1-7 words into msg Then predict length of phrase –Majority are only 2-4 words long Avoid risk of relying on correct speech recognition for names Best cues to end of phrase are a few common words –‘I’, ‘could’, ‘please’ –No actual names: they over-fit the data Performance –.71 F on human-labeled –.70 F on ASR

Introduction to Weka