Information Extraction

Slides:

Advertisements

Similar presentations

An Ontology Creation Methodology: A Phased Approach

Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.

A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.

Processing of large document collections Part 8 (Information extraction) Helena Ahonen-Myka Spring 2005.

Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.

1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.

Information extraction from text Spring 2003, Part 2 Helena Ahonen-Myka.

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.

CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

Basi di dati distribuite Prof. M.T. PAZIENZA a.a

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Using Information Extraction for Question Answering Done by Rani Qumsiyeh.

Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수

Part of speech (POS) tagging

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.

Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.

Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany,

Processing of large document collections Part 10 (Information extraction: learning extraction patterns) Helena Ahonen-Myka Spring 2005.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Survey of Semantic Annotation Platforms

Interpreting Dictionary Definitions Dan Tecuci May 2002.

Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,

Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.

Ling 570 Day 17: Named Entity Recognition Chunking.

Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출 

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.

Lecture 13 Information Extraction Topics Name Entity Recognition Relation detection Temporal and Event Processing Template Filling Readings: Chapter 22.

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.

NYU: Description of the Proteus/PET System as Used for MUC-7 ST Roman Yangarber & Ralph Grishman Presented by Jinying Chen 10/04/2002.

Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.

Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.

Disambiguation Read J & M Chapter 17.1 – The Problem Washington Loses Appeal on Steel Duties Sue caught the bass with the new rod. Sue played the.

Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.

MedKAT Medical Knowledge Analysis Tool December 2009.

Supertagging CMSC Natural Language Processing January 31, 2006.

Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.

CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.

Processing of large document collections Part 9 (Information extraction: learning extraction patterns) Helena Ahonen-Myka Spring 2006.

March, 2007RCO LLC, RCO Text Analysis Technologies for information extraction and business intelligence We can tell you everything about.

Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.

A Simple English-to-Punjabi Translation System By : Shailendra Singh.

Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.

Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.

Approaches to Machine Translation

INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.

CSCE 590 Web Scraping – Information Retrieval

Social Knowledge Mining

A Machine Learning Approach to Coreference Resolution of Noun Phrases

Automatic Detection of Causal Relations for Question Answering

Lecture 13 Information Extraction

Approaches to Machine Translation

A Machine Learning Approach to Coreference Resolution of Noun Phrases

Presentation transcript:

Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: Define domain-specific templates Simple and reliable linguistic processing Recognize known types of entities and relations Fill templates with recognized information

Example 4 Apr. Dallas - Early last evening, a tornado swept through northwest Dallas. The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. The Texaco station at 102 Main St. was also severely damaged, but no injuries were reported. Event: tornado Date: 4/3/97 Time: 19:15 Location: “northwest Dallas” : Texas : USA Damage: “mobile homes” (2) “Texaco station” (1) Injuries: none

Sentence Analysis Pattern Merging Extraction Template Generation tornado swept: Event: tornado through northwest Dallas: Loc: “northwest Dallas” causing extensive damage: Damage Early last evening: adv-phrase:time a tornado: noun-group:subject swept: verb-group ... Early last evening, a tornado swept through northwest Dallas. The twister occurred without warning at about .... Early/ADV last/ADJ evening/NN:time ,/, a/DT tornado/NN:weather swept/VBD ... 4 Apr. Dallas – Early last evening, a tornado swept through northwest.... Tokenization & Tagging Sentence Analysis Pattern Extraction Merging Template Generation Event: tornado Date: 4/3/97 Time: 19:15 Location: “northwest Dallas” : Texas : USA ...

MUC: Message Understanding Conference “Competitive” conference with predefined tasks for research groups to address Tasks (MUC-7): Named Entities: Extract typed entities from text Equivalence Classes: Solving coreference Attributes: Fill in attributes of entities Facts: Extract logical relations between entities Events: Extract descriptions of events from text

Tokenization & Tagging Tokenization & POS tagging Also lexical semantic information, such as “time”, “location”, “weather”, “person”, etc. Sentence Analysis Shallow parsing for phrase types Use tagging & semantics to tag phrases Note phrase heads

Pattern Extraction Find domain-specific relations between text units Typically use lexical triggers and relation-specific patterns to recognize relations Concept: Damaged-Object Trigger: destroyed Position: direct-object Constraints: physical-thing ... and [ destroyed ] [ two mobile homes ]  Damaged-Object = “two mobile homes”

Learning Extraction Patterns Very difficult to predefine extraction patterns Must be redone for each new domain Hence, corpus-based approaches are indicated Some methods: AutoSlog (1992) – “syntactic” learning PALKA (1995) – “conceptual” learning CRYSTAL (1995) – covering algorithm

AutoSlog (Lehnert 1992) Patterns based on recognizing “concepts” Concept: what concept to recognize Trigger: a word indicating an occurrence Position: what syntactic role the concept will take in the sentence Constraints: what type of entity to allow Enabling conditions: constraints on the linguistic context

Position: prep-phrase-object Constraints: time Concept: Event-Time Trigger: “at” Position: prep-phrase-object Constraints: time Enabling conditions: post-verb The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. Event-Time = 19:15

Learning Patterns Supervised: Training is text with patterns to be extracted from it Knowledge: 13 general syntactic patterns Algorithm: Find sentence with target noun phrase “two mobile homes” Partial parsing of sentence: find syntactic relations Try all linguistic patterns to find match Generate concept pattern from match

Linguistic Patterns Identify domain-specific thematic roles based on syntactic structure active-voice-verb followed by target=direct object  Concept = target concept Trigger = verb of active-voice-verb Position = direct-object Constraints = semantic-class of target Enabling conditions = active-voice

More Examples victim was murdered perpetrator bombed perpetrator attempted to kill was aimed at target Some bad extraction patterns occur (e.g, “is” as a trigger) Human review process

CRYSTAL Complex syntactic patterns Use “covering” algorithm: Generate most specific possible patterns for all occurrences of targets in corpus Loop: Find most specific unifier of the most similar patterns C & C’, generating new pattern P If P has less than ε error on corpus, replace C and C’ with P Continue until no new patterns can be added

Merging Motor Vehicles International Corp. announced a major management shake-up ... MVI said the CEO has resigned ... The Big 10 auto maker is attempting to regain market share ... It will announce losses ... A company spokesman said they are moving their operations ... MVI, the first company to announce such a move since the passage of the new international trade agreement, is facing increasing demands from unionized workers...

Coreference Resolution Many different kinds of linguistic phenomena: Proper names, Aliases (MVI), Definite NPs (the Big 10 auto maker), Pronouns (it, they), Appositives (, the first company to ...) Errors of previous phases may be amplified

Learning to Merge Treat coreference as a classification task Should this pair of entities be linked? Methodology: Training corpus: manually link all coreferential expressions Each possible pair is a training example, if they are linked it is positive if not, it is negative Create a feature vector for each example Use your favorite learning algorithm

MLR (1995) 66 features were used, in 4 categories: Lexical features of each phrase e.g, do they overlap? Grammatical role of each phrase e.g, subject, direct-object Semantic classes of each phrase e.g, physical-thing, company Relative positions of the phrases e.g, X one sentence after Y Decision-tree learning (C4.5)

C4.5 Incrementally build decision-tree from labeled training examples At each stage choose “best” attribute to split dataset E.g, use info-gain to compare features After building complete tree, prune the leaves to prevent overfitting Use statistical tests to determine if enough examples are in leaf bins, if not – prune!

C4.5 40 training f1 25 training 15 training f2 f3 C1 C2 C2 C1

RESOLVE (1995) C4.5 with 8 complex features: NAME-{1,2}: does reference include a name? JV-CHILD-{1,2}: does reference refer to part of a joint venture? ALIAS: does one reference contain an alias for the other? BOTH-JV-CHILD: do both refer to part of a joint venture? COMMON-NP: do both contain a common NP? SAME-SENTENCE: are both in the same sentence?

Decision Tree

RESOLVE Results 50 texts, leave-1-out cross-validation: System Recall Precision Unpruned 85.4% 87.6% Pruned 80.1% 92.4% Manual 67.7% 94.4%

Full System: FASTUS (1996) Input Text Pattern Recognition Coreference Resolution Partial Templates Template Merger Output Template

John Smith, 47, was named president of ABC Corp. Pattern Recognition Multiple passes of finite-state methods John Smith, 47, was named president of ABC Corp. Pers-Name Num Aux V N P Org-Name Domain-Event Poss-N-Group V-Group

Partially-Instantiated Templates Person: _______ Pos: President Org: ABC Corp. Person: John Smith Start: End: Domain-Dependent!!

Coreference analysis: He = John Smith The Next Sentence... He replaces Mike Jones. Coreference analysis: He = John Smith Person: Mike Jones Pos: ________ Org: ________ Person: John Smith Start: End:

Unification Unify new template with preceding template(s), if possible... Person: Mike Jones Pos: President Org: ABC Corp. Person: John Smith Start: End:

Principle of Least Commitment Idea: Maintain options as long as possible E.g: parsing – maintain a lattice structure: The committee heads announced that... DT NN1 NN2 VBZ VBD CSub N-GRP Event Event: Announce Actor: Committee heads

Principle of Least Commitment Idea: Maintain options as long as possible E.g: parsing – maintain a lattice structure: The committee heads ABC’s recruitment effort. DT NN1 NN2 VBZ NNpos NN N-GRP N-GRP Event Head: Committee Effort: ABC’s recruitment

More Least Commitment Maintain multiple coreference hypotheses: Disambiguate when creating domain-events More information available Too many possibilities? Use beam search algorithm: maintain k ‘best’ hypotheses at every stage