CS 4705 Corpus Linguistics and Machine Learning Techniques.

CS 4705 Corpus Linguistics and Machine Learning Techniques

Review What do we know about so far? –Words (stems and affixes, roots and templates,…) –POS (e.g. nouns, verbs, adverbs, adjectives, determiners, articles, …) –Named Entities (e.g. Person Names) –Ngrams (simple word sequences) –Syntactic Constituents (NPs, VPs, Ss,…)

What useful things can we do – with only this knowledge? Find sentence boundaries, abbreviations Find Named Entities (person names, company names, telephone numbers, addresses,…) Find topic boundaries and classify articles into topics Identify a document’s author and their opinion on the topic, pro or con Answer simple questions (factoids) Do simple summarization/compression

But first, we need corpora… Online collections of text and speech Some examples –Brown Corpus –Wall Street Journal and AP News –ATIS, Broadcast News –TDTN –Switchboard, Call Home –TRAINS, FM Radio, BDC Corpus –Hansards’ parallel corpus of French and English –And many private research collections

Next, we pose a question…the dependent variable Binary questions: –Is this word followed by a sentence boundary or not? –A topic boundary? –Does this word begin a person name? End one? –Should this word or sentence be included in a summary? Other classification: –Is this document about medical issues? Politics? Religion? Sports? … Predicting continuous variables: –How loud or high should this utterance be produced?

Finding a suitable corpus and preparing it for analysis Which corpora can answer my question? –Do I need to get them labeled to do so? Dividing the corpus into training and test corpora –To develop a model, we need a training corpus overly narrow corpus: doesn’t generalize overly general corpus: don't reflect task or domain –To demonstrate how general our model is, we need a test corpus to evaluate the model Development test set vs. held out test set –To evaluate our model we must choose an evaluation metric Accuracy Precision, recall, F-measure,… Cross validation

Then we build the model… Again, identify the dependent variable: what do we want to predict or classify? –Does this word begin a person name? Is this word within a person name? Identify the independent variables: what features might help to predict the dependent variable? –What is this word’s POS? What is the POS of the word before it? After it? –Is this word capitalized? Is it followed by a ‘.’? –How far is this word from the beginning of its sentence? Extract the values of each variable from the corpus by some automatic means

A Sample Feature Vector for Sentence Ending Detection WordIDPOSCap?, After?Dist/SbegEnd? ClintonNyn1n wonVnn2n easilyAdvny3n butConjnn4n

An Example: Finding Caller Names in Voicemail  SCANMailSCANMail Motivated by interviews, surveys and usage logs of heavy users: –Hard to scan new msgs to find those you need to deal with quickly –Hard to find msg you want in archive –Hard to locate information you want in any msg How could we help?

SCANMail Architecture Caller SCANMailSubscriber

Corpus Collection Recordings collected from 138 AT&T Labs employees’ mailboxes 100 hours; 10K msgs; 2500 speakers Gender balanced: 12% non-native speakers Mean message duration 36.4 secs, median 30.0 secs Hand-transcribed and annotated with caller id, gender, age, entity demarcation (names, dates, telnos) Also recognized using ASR engine

Transcription and Bracketing [ Greeting: hi R ] [ CallerID: it's me ] give me a call [ um ] right away cos there's [.hn ] I guess there's some [.hn ] change [ Date: tomorrow ] with the nursery school and they [ um ] [.hn ] anyway they had this idea [ cos ] since I think J's the only one staying [ Date: tomorrow ] for play club so they wanted to they suggested that [.hn ] well J2 actually offered to take J home with her and then would she

would meet you back at the synagogue at [ Time: five thirty ] to pick her up [.hn ] [ uh ] so I don't know how you feel about that otherwise M_ and one other teacher would stay and take care of her till [ Date: five thirty tomorrow ] but if you [.hn ] I wanted to know how you feel before I tell her one way or the other so call me [.hn ] right away cos I have to get back to her in about an hour so [.hn ] okay [ Closing: bye [.nhn ] [.onhk ]

SCANMail Demo http://www.avatarweb.com/scan mail/ http://www.avatarweb.com/scan mail/ Audix extension: demo Audix password: (null)

Information Extraction (Martin Jansche and Steve Abney) Goals: extract key information from msgs to present in headers Approach: –Supervised learning from transcripts (phone #’s, caller self-ids) –Combine Machine Learning techniques with simpler alternatives, e.g. hand-crafted rules –Two stage approaches

–Features exploit structure of key elements (e.g. length of phone numbers) and of surrounding context (e.g. self-ids tend to occur at beginning of msg)

Telephone Number Identification Rules convert all numbers to standard digit format Predict start of phone number with rules –This step over-generates –Prune with decision-tree classifier Best features: –Position in msg –Lexical cues –Length of digit string Performance: –.94 F on human-labeled transcripts –.95 F on ASR)

Caller Self-Identifications Predict start of id with classifier –97% of id’s begin 1-7 words into msg Then predict length of phrase –Majority are only 2-4 words long Avoid risk of relying on correct speech recognition for names Best cues to end of phrase are a few common words –‘I’, ‘could’, ‘please’ –No actual names: they over-fit the data Performance –.71 F on human-labeled –.70 F on ASR

Introduction to Weka

CS 4705 Corpus Linguistics and Machine Learning Techniques.

Similar presentations

Presentation on theme: "CS 4705 Corpus Linguistics and Machine Learning Techniques."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 4705 Corpus Linguistics and Machine Learning Techniques.

Similar presentations

Presentation on theme: "CS 4705 Corpus Linguistics and Machine Learning Techniques."— Presentation transcript:

Similar presentations

About project

Feedback