Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Progress update Lin Ziheng. System overview 2 Components – Connective classifier Features from Pitler and Nenkova (2009): – Connective: because – Self.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 21 11/8/2011.
Sequence Classification: Chunking & NER Shallow Processing Techniques for NLP Ling570 November 23, 2011.
Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011.
CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.
Stemming, tagging and chunking Text analysis short of parsing.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1 I256: Applied Natural Language Processing Marti Hearst Sept 25, 2006.
Part of speech (POS) tagging
Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Winter 2003/4Pls – syntax – Catriel Beeri1 SYNTAX Syntax: form, structure The syntax of a pl: The set of its well-formed programs The rules that define.
Context Free Grammars Reading: Chap 12-13, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
ENTITY EXTRACTION: RULE-BASED METHODS “I’m booked on the train leaving from Paris at 6 hours 31” Rule: Location (Token string = “from”) ({DictionaryLookup=location}):loc.
Ling 570 Day 17: Named Entity Recognition Chunking.
Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides.
©2003 Paula Matuszek CSC 9010: Information Extraction Dr. Paula Matuszek (610) Fall, 2003.
Lecture 13 Information Extraction Topics Name Entity Recognition Relation detection Temporal and Event Processing Template Filling Readings: Chapter 22.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Context Free Grammars Reading: Chap 9, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Rada Mihalcea.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
Natural Language Processing Vasile Rus
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
CSC 594 Topics in AI – Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
Statistical NLP: Lecture 3
CRF &SVM in Medication Extraction
Relation Extraction CSCI-GA.2591
Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides.
Business Process Measures
Robust Semantics, Information Extraction, and Information Retrieval
Chunking—Practical Exercise
CSCI 5832 Natural Language Processing
CSCE 590 Web Scraping – Information Retrieval
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Social Knowledge Mining
LING 388: Computers and Language
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Natural Language - General
Introduction Task: extracting relational facts from text
Lecture 13 Information Extraction
4. Computational Problem Solving
CSCI 5832 Natural Language Processing
Chunk Parsing CS1573: AI Application Development, Spring 2003
CSCI 5832 Natural Language Processing
Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides.
CS246: Information Retrieval
CSCI 5832 Natural Language Processing
PolyAnalyst Web Report Training
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)

9/26/2016 Speech and Language Processing - Jurafsky and Martin 2 Information Extraction  Turns out that these partial/parsing and chunking methods for syntax, given us the means to address shallow semantic problems as well  Figure out the entities (the players, props, instruments, locations, etc.) in a text  Figure out how they’re related  Figure out what kind of events/activities they’re all up to  And do each of those tasks in a loosely-coupled data-driven manner

9/26/2016 Speech and Language Processing - Jurafsky and Martin 3 Information Extraction  Ordinary newswire text is often used in typical examples.  And there’s an argument that there are useful applications there  But the real interest/money is in specialized domains  Bioinformatics  Electronic medical records  Stock market analysis  Intelligence analysis  Social media

9/26/2016 Speech and Language Processing - Jurafsky and Martin 4 Example App

9/26/2016 Speech and Language Processing - Jurafsky and Martin 5 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

9/26/2016 Speech and Language Processing - Jurafsky and Martin 6 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

9/26/2016 Speech and Language Processing - Jurafsky and Martin 7 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

9/26/2016 Speech and Language Processing - Jurafsky and Martin 8 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

9/26/2016 Speech and Language Processing - Jurafsky and Martin 9 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

9/26/2016 Speech and Language Processing - Jurafsky and Martin 10 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

9/26/2016 Speech and Language Processing - Jurafsky and Martin 11 NER  Find and classify all the named entities in a text.  What’s a named entity?  A reference to an entity via the mention of its name.  Colorado Rockies  This is a subset of the possible mentions...  Rockies, the team, it, they...  Find means identify the exact span of the mention.  Classify means determine the category of the entity being referred to.

9/26/2016 Speech and Language Processing - Jurafsky and Martin 12 NER Approaches  As with partial parsing and chunking there are two basic approaches (and hybrids)  Rule-based (regular expressions)  Lists of names  Patterns to match things that look like names  Patterns to match the environments that classes of names tend to occur in.  ML-based approaches  Get annotated training data  Extract features  Train systems to replicate the annotation

9/26/2016 Speech and Language Processing - Jurafsky and Martin 13 ML Approach

9/26/2016 Speech and Language Processing - Jurafsky and Martin 14 Encoding for Sequence Labeling  We can use the same IOB encoding here that we used for chunking:  For N classes we have 2*N+1 tags  An I and B for each class and a O for outside any class.  Each token in a text gets a tag.

9/26/2016 Speech and Language Processing - Jurafsky and Martin 15 NER Features

9/26/2016 Speech and Language Processing - Jurafsky and Martin 16 NER as Sequence Labeling

9/26/2016 Speech and Language Processing - Jurafsky and Martin 17 NER Evaluation  Suppose you employ this scheme. What’s the best way to measure performance.  Probably not the per-tag accuracy we used for POS tagging.  Why? It’s not measuring what we care about We need a metric that looks at the chunks not the tags

9/26/2016 Speech and Language Processing - Jurafsky and Martin 18 Example  Suppose we were looking for Location mentions.  If the system simply said O all the time it would do pretty well on a per-label basis since most words reside outside any location.

9/26/2016 Speech and Language Processing - Jurafsky and Martin 19 Precision/Recall/F  Precision: The fraction of entities the system returned that were right  “Right” means the boundaries and the label are correct given some labeled test set.  Recall: The fraction of entities that system got from those that it should have gotten.  F: Simple harmonic mean of those two numbers.

9/26/2016 Speech and Language Processing - Jurafsky and Martin 20 Entity-level evaluation  We need to evaluation P/R/F at the entity level.  But we may not care equally about all kinds of entities  So we might weight them differently in the evaluation routine.  Or consider the P/R/F for each one separately

9/26/2016 Speech and Language Processing - Jurafsky and Martin 21 Relations  Once you have captured the entities in a text you might want to ascertain how they relate to one another.  Here we’re just talking about explicitly stated relations

9/26/2016 Speech and Language Processing - Jurafsky and Martin 22 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

9/26/2016 Speech and Language Processing - Jurafsky and Martin 23 Relation Types  As with named entities, the list of relations is application specific. For generic news texts...

9/26/2016 Speech and Language Processing - Jurafsky and Martin 24 Relations  By relation we really mean sets of tuples.  Think about populating a database.

9/26/2016 Speech and Language Processing - Jurafsky and Martin 25 Relation Analysis  We can divide relation analysis into two parts  Determining if 2 entities are related  And if they are, classifying the relation  There are 2 reasons to do this  Cutting down on training time for classification by eliminating most pairs  Producing separate feature-sets that are appropriate for each task.

9/26/2016 Speech and Language Processing - Jurafsky and Martin 26 Relation Analysis  Let’s just worry about named entities within the same sentence

9/26/2016 Speech and Language Processing - Jurafsky and Martin 27 Features  We can group the features (for both tasks) into three categories  Features of the named entities involved  Features derived from the words between and around the named entities  Features derived from the syntactic environment that governs the two entities

9/26/2016 Speech and Language Processing - Jurafsky and Martin 28 Features  Features of the entities  Their types  Concatenation of the types  Headwords of the entities  George Washington Bridge  Words in the entities  Features between and around  Particular positions to the left and right of the entities  +/- 1, 2, 3  Bag of words between

9/26/2016 Speech and Language Processing - Jurafsky and Martin 29 Features  Syntactic environment  Constituent path through the tree from one to the other  Base syntactic chunk sequence from one to the other  Dependency path

9/26/2016 Speech and Language Processing - Jurafsky and Martin 30 Example  For the following example, we’re interested in the possible relation between American Airlines and Tim Wagner.  American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said.