Extraction as Classification. What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00.

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.
1 I256: Applied Natural Language Processing Marti Hearst Nov 15, 2006.
Extracting Personal Names from Applying Named Entity Recognition to Informal Text Einat Minkov & Richard C. Wang Language Technologies Institute.
CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 11, 2004.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.
Exploiting diverse knowledge source via Maximum Entropy in Name Entity Recognition Author: Andrew Borthwick John Sterling Eugene Agichtein Ralph Grishman.
I256 Applied Natural Language Processing Fall 2009 Lecture 13 Information Extraction (1) Barbara Rosario.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Information Extraction
Introduction to Text Mining
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents.
India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.
Information Extraction CSE 574 Dan Weld. What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14,
December 2005CSA3180: Information Extraction I1 CSA3180: Natural Language Processing Information Extraction 1 – Introduction Information Extraction Named.
Information Extraction Yunyao Li EECS /SI /29/2006.
Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore.
How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.
807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
2009© 2009 Aircraft Maintenance Systems RD inc. The following presentation will showcase Aircraft Maintenance Systems RD inc. products most important.
December 2005CSA3180: Information Extraction I1 CSA2050: Natural Language Processing Information Extraction Named Entities IE Systems MUC Finite State.
1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.
Types of Extraction. Wrappers 2 IE from Text 3 AttributeWalmart ProductVendor Product Product NameCHAMP Bluetooth Survival Solar Multi- Function Skybox.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Some Work on Information Extraction at IRL Ganesh Ramakrishnan IBM India Research Lab.
Automatic Rule Refinement for Information Extraction Bin Liu University of Michigan Laura Chiticariu IBM Research - Almaden Vivian Chu IBM Research - Almaden.
Presenter: Shanshan Lu 03/04/2010
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
NYU: Description of the Proteus/PET System as Used for MUC-7 ST Roman Yangarber & Ralph Grishman Presented by Jinying Chen 10/04/2002.
IE by Candidate Classification: Jansche & Abney, Cohen et al William Cohen 1/30/06.
Talk Schedule Question Answering from Bryan Klimt July 28, 2005.
India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.
Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the.
Information Retrieval
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley Supported by NSF DBI
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Introduction to “Event Extraction” Jan 18, What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Data Acquisition. Get all data necessary for the analysis task at hand Some data comes from inside the company –Need to go and talk with various data.
Wrapper Induction & Other Use of “Structure”
Information Extraction Lecture
Institute of Informatics & Telecommunications
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Relation Extraction CSCI-GA.2591
Introduction to Information Extraction
Social Knowledge Mining
Searching and Summarizing Speech
WHIRL – Reasoning with IE output
Presentation transcript:

Extraction as Classification

What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE QA End User

What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka Named Entity Recognition

Landscape of IE Tasks (1/4): Degree of Formatting Text paragraphs without formatting Grammatical sentences and some formatting & links Non-grammatical snippets, rich formatting & links Tables Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Landscape of IE Tasks (3/4): Complexity of extraction task Closed set He was born in Alabama… Regular set Phone: (413) Complex pattern University of Arkansas P.O. Box 140 Hope, AR …was among the six houses sold by Hope Feldman that year. Ambiguous patterns, needing context and many sources of evidence The CALD main office can be reached at The big Wyoming sky… U.S. states U.S. phone numbers U.S. postal addresses Person names Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio Pawel Opalinski, Software Engineer at WhizBang Labs. E.g. word patterns:

Landscape of IE Tasks (4/4): Single Field/Record Single entity Person: Jack Welch Binary relationship Relation: Person-Title Person: Jack Welch Title: CEO N-ary record “Named entity” extraction Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Relation: Company-Location Company: General Electric Location: Connecticut Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Location: Connecticut

Models for NER Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Token Tagging Abraham Lincoln was born in Kentucky. Most likely state sequence? This is often treated as a structured prediction problem…classifying tokens sequentially HMMs, CRFs, ….

MUC-7 Last Message Understanding Conference, (fore-runner to ACE), about 1998 –200 articles in development set (newswire text about aircraft accidents) –200 articles in final test (launch events) –Names of: persons, organizations, locations, dates, times, currency & percentage

TE: Template Elements (attributes) TR: Template Relations (binary relations) ST: Scenario Template (events) CO: coreference NE: named entity recognition

LTG Identifinder (HMMs) MENE+Proteus Manitoba (NB filtered names) NetOwl Commercial RBS

Bothwick, Sterling, Agichtein, Grishman: the MENE system ?

Borthwick et al: MENE system Simple idea: tag every token: –4 tags/field for x=person, organization: x_start, x_continue, x_end, x_unique Also “other” –To extract: Compute P(y|xi) for each xi Use Viterbi to find ML consistent sequence: –continue follows start –end follows continue or start –…

Borthwick et al: MENE system Simple idea: tag every token: –4 tags/field for x=person, organization: x_start, x_continue, x_end, x_unique Also “other” –Learner: Maxent/logistic regression Regularize by dropping rare features –To extract: Compute P(y|x i ) for each x i Use Viterbi to find ML consistent sequence: –continue follows start –end follows continue or start –…

Viterbi in MENE

Borthwick et al: MENE system Features g(h,f) for the loglinear model –function of “history” (features of token) and “future” (predicted class) –Lexical features combine Identity of token in window and Predicted class eg g(h,f)=[token -1 (h)=“Mr.” and f=person_unique] –Section features combine: section name & class

Borthwick et al: MENE system Features g(h,f) for the loglinear model –Dictionary features: Match each multi-world dictionary d to text For token sequences record d_start, d_cont,.. Combine these values with category eg g(h,f)=[places 0 (h)=“places_uniq” and f=organization_start] PittsburghcoachTomlin … Places_uniqPlaces_other Celeb_other Celeb_uniq

Dictionaries in MENE

Borthwick et al: MENE system Features g(h,f) for the loglinear model –External system features: Run someone else’s system s on text For token sequences record sx_start, sx_cont,.. Combine these values with category eg g(h,f)=[proteus 0 (h)=“places_uniq” and f=organization_start]

MENE results (dry run)

MENE learning curves

Largest U.S. Cable Operator Makes Bid for Walt Disney By ANDREW ROSS SORKIN The Comcast Corporation, the largest cable television operator in the United States, made a $54.1 billion unsolicited takeover bid today for The Walt Disney Company, the storied family entertainment colossus. If successful, Comcast's audacious bid would once again reshape the entertainment landscape, creating a new media behemoth that would combine the power of Comcast's powerful distribution channels to some 21 million subscribers in the nation with Disney's vast library of content and production assets. Those include its ABC television network, ESPN and other cable networks, and the Disney and Miramax movie studios. Short names Longer names

LTG system Another MUC-7 competitor Handcoded rules for “easy” cases (amounts, etc) Process of repeated tagging and “matching” for hard cases –Sure-fire (high precision) rules for names where type is clear (“Phillip Morris, Inc – The Walt Disney Company”) –Partial matches to sure-fire rule are filtered with a maxent classifier (candidate filtering) using contextual information, etc –Higher-recall rules, avoiding conflicts with partial-match output “Phillip Morris announced today…. - “Disney’s ….” –Final partial-match & filter step on titles with different learned filter. Exploits discourse/context information

LTG Results

LTG Identifinder (HMMs) MENE+Proteus Manitoba (NB filtered names) NetOwl Commercial RBS

Jansch & Abney paper He was a grammarian, and could doubtless see further into the future than others. -- J.R.R. Tolkien, "Farmer Giles of Ham" echo golf echo x-ray | tr "ladyfinger orthodontics rx-"

Background on JA paper

SCAN: Search & Summarization for Audio Collections (AT&T Labs)

Why IE from personal voic Unified interface for , voic , fax, … requires uniform headers: –Sender, Time, Subject, … –Headers are key for uniform interface Independently, voic access is slow: –useful to have fast access to important parts of message (contact number, caller)

Background on JA – con’t Quick review of Huang, Zweig & Padmanabhan (IBM Yorktown) “Information Extraction from Voic ”: –Goal: find identity and contact number of callers in voic (NER + role classification) –Evaluated three systems on ~= 5000 labeled manually transcribed messages: Baseline system: –200 hand-coded rules based on “trigger phrases” State-of-art Ratnaparki-style MaxEnt tagger: –Lexical unigrams, bigrams, dictionary features for names, numbers, “trigger phrases” + feature selection –Poor results: On manually transcribed data, F1 in 80s for both tasks (good!) On ASR data, F1 about 50% for caller names, 80% for contact numbers even with a very loose performance metric Best learning method barely beat the baseline rule-based system.

What’s interesting in this paper How and when to we use ML? Robust information extraction –Generalizing from manual transcripts (i.e., human-produced written version of voic ) to automatic (ASR) transcripts Place of hand-coding vs learning in information extraction –How to break up task –Where and how to use engineering Candidate Generator Learned filter Candidate phrase Extracted phrase

Voic corpus About 10,000 manually transcribed and annotated voice messages used for evaluation Not quite the usual NER task: we only want the caller’s name

Observation: caller phrases are short and near the beginning of the message.

Caller-phrase extraction Propose start positions i1,…,iN Use a learned decision tree to pick the best i Propose end positions i+j1,i+j2,…,i+jM Use a learned decision tree to pick the best j

Baseline (HZP, Collins log-linear) IE as tagging, similar to Borthwick: Pr(tag i|word i,word i-1,…,word i+1,…,tag i-1,…) estimated via MAXENT model Beam search to find best tag sequence given word sequence (we’ll talk more about this next week) Features of model are words, word pairs, word pair+tag trigrams, …. Hithereit’sBilland… Other Caller_startCaller_contother

Performance

Observation: caller names are really short and near the beginning of the message.

What about ASR transcripts?

Extracting phone numbers Phase 1: hand-coded grammer proposes candidate phone numbers –Not too hard, due to limited vocabulary –Optimize recall (96%) not precision (30%) Phase 2: a learned decision tree filters candidates –Use length, position, a few context features

Results

Their Conclusions