Information Extraction

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

1 I256: Applied Natural Language Processing Marti Hearst Nov 15, 2006.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
Information Extraction CS 652 Information Extraction and Integration.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan.
CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 11, 2004.
I256 Applied Natural Language Processing Fall 2009 Lecture 13 Information Extraction (1) Barbara Rosario.
Information Extraction from Biomedical Text Jerry R. Hobbs Artificial Intelligence Center SRI International.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Text Mining
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Information Extraction Junichi Tsujii Graduate School of Science University of Tokyo Japan Ronen Feldman Bar Ilan University Israel.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
December 2005CSA3180: Information Extraction I1 CSA3180: Natural Language Processing Information Extraction 1 – Introduction Information Extraction Named.
Information Extraction Yunyao Li EECS /SI /29/2006.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
December 2005CSA3180: Information Extraction I1 CSA2050: Natural Language Processing Information Extraction Named Entities IE Systems MUC Finite State.
CSC 594 Topics in AI – Text Mining and Analytics
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
Finite State Parsing & Information Extraction CMSC Intro to NLP January 10, 2006.
Some Work on Information Extraction at IRL Ganesh Ramakrishnan IBM India Research Lab.
10. Parsing with Context-free Grammars -Speech and Language Processing- 발표자 : 정영임 발표일 :
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Trevor Crum 04/23/2014 *Slides modified from Shamil Mustafayev’s 2013 presentation * 1.
Data Mining: Text Mining
Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley Supported by NSF DBI
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Shamil Mustafayev 04/16/
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Vincent Granville, Ph.D. Co-Founder, DSC
Information Extraction
Introduction to Information Extraction
Social Knowledge Mining
Machine Learning in Natural Language Processing
CSE 635 Multimedia Information Retrieval
Natural Language Processing
CS246: Information Retrieval
Presentation transcript:

Information Extraction Adapted from slides by Junichi Tsujii, Ronen Feldman and others

Most Data are Unstructured (Text) or Semi-Structured… Email Insurance claims News articles Web pages Patent portfolios … Customer complaint letters Contracts Transcripts of phone calls with customers Technical documents … Examples are: Letters from customers, email correspondence, recordings of phone calls with customers, contracts, technical documentation, patents, etc. With ever dropping prices of mass storage, companies collect more and more of such data. But what can we get from this data? That’s where text mining comes in. The goal of text mining is to extract knowledge from this ninety percent unstructured masses of text. Text data mining has become more and more important… (Adapted from J. Dorre et al. “Text Mining: Finding Nuggets in Mountains of Textual Data”)

Application Tasks of NLP (1)Information Retrieval/Detection To search and retrieve documents in response to queries for information (2)Passage Retrieval To search and retrieve part of documents in response to queries for information (3)Information Extraction To extract information that fits pre-defined database schemas or templates, specifying the output formats (4) Question/Answering Tasks To answer general questions by using texts as knowledge base: Fact retrieval, combination of IR and IE (5)Text Understanding To understand texts as people do: Artificial Intelligence

Information Extraction: A Pragmatic Approach Let application requirements drive semantic analysis Identify the types of entities that are relevant to a particular task Identify the range of facts that one is interested in for those entities Ignore everything else

IE definitions Entity: an object of interest such as a person or organization Attribute: A property of an entity such as name, alias, descriptor or type Fact: A relationship held between two or more entities such as Position of Person in Company Event: An activity involving several entities such as terrorist act, airline crash, product information

IE accuracy typical figures by information type Entity: 90-98% Attribute: 80% Fact: 60-70% Event: 50-60%

MUC conferences MUC 1 to MUC 7 1987 to 1997 Topics: Naval operations (2) Terrorist Activity (2) Joint venture and microelectronics Management changes Space Vehicles and Missile launches

MUC and Scenario Templates Define a set of “interesting entities” Persons, organizations, locations… Define a complex scenario involving interesting events and relations over entities Example: management succession: persons, companies, positions, reasons for succession This collection of entities and relations is called a “scenario template.”

Problems with Scenario Template Encouraged development of highly domain specific ontologies, rule systems, heuristics, etc. Most of the effort expended on building a scenario template system was not directly applicable to a different scenario template.

Addressing the Problem Address a large number of smaller, more focused scenario templates (Event-99) Develop a more systematic ground-up approach to semantics by focusing on elementary entities, relations, and events (ACE)

The ACE Evaluation The ACE program – challenge of extracting content from human language. Research effort directed to master first the extraction of “entities” Then the extraction of “relations” among these entities Finally the extraction of “events” that are causally related sets of relations After two years, top systems successfully capture well over 50 % of the value at the entity level

The ACE Program “Automated Content Extraction” Develop core information extraction technology by focusing on extracting specific semantic entities and relations over a very wide range of texts. Corpora: Newswire and broadcast transcripts, but broad range of topics and genres. Third person reports Interviews Editorials Topics: foreign relations, significant events, human interest, sports, weather Discourage highly domain- and genre-dependent solutions

Applications of IE Routing of information Infrastructure for IR and categorization (higher level features) Event based summarization Automatic creation of databases and knowledge bases

Where would IE be useful? Semi-structured text Generic documents like news articles Most of the information in the doc is centered around a set of easily identifiable entities

The Problem Date Time: Start - End Location Speaker Person

What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Courtesy of William W. Cohen

What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. Courtesy of William W. Cohen

What is “Information Extraction” segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte VP Richard Stallman founder Free Software Foundation aka “named entity extraction” Courtesy of William W. Cohen

What is “Information Extraction” segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte VP Richard Stallman founder Free Software Foundation Courtesy of William W. Cohen

What is “Information Extraction” segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte VP Richard Stallman founder Free Software Foundation Courtesy of William W. Cohen

What is “Information Extraction” segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte VP Richard Stallman founder Free Software Foundation * NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Richard Stallman founder Free Soft.. * * Courtesy of William W. Cohen

Landscape of IE Tasks: Single Field/Record Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Binary relationship N-ary record Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt How many components of the output of your system? Single entity… It is more difficult to get high accuracy on the right than on the left, because each consituent has to be right, and errors compound. 90%^5=60% Relation: Company-Location Company: General Electric Location: Connecticut Location: Connecticut “Named entity” extraction

Landscape of IE Techniques Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Lexicons Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGIN END Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Context Free Grammars Abraham Lincoln was born in Kentucky. NNP V P NP PP VP S Most likely parse? Courtesy of William W. Cohen

IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) Yesterday Pedro Domingos spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos

HMM for Segmentation Simplest Model: One state per entity type

Discriminative Approaches Yesterday Pedro Domingos spoke this example sentence. Is this phrase (X) a name? Y=1 (yes); Y=0 (no) Learn from many examples to predict Y from X parameters Maximum Entropy, Logistic Regression: Features (e.g., is the phrase capitalized?) More sophisticated: Consider dependency between different labels (e.g. Conditional Random Fields)

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

Based on finite states automata (FSA) FASTUS Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names set up new Twaiwan dallors 2.Basic Phrases: Simple noun groups, verb groups and particles a Japanese trading house had set up 3.Complex phrases: Complex noun groups and verb groups production of 20, 000 iron and metal wood clubs 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. [company] [set up] [Joint-Venture] with 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

Information Extraction ………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the second floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou, was hacked to death with 45 cm watermelon knives. Name of the Venture: Yaxing Benz Products: buses and bus chassis Location: Yangzhou,China Companies involved: (1)Name: X? Country: German (2)Name: Y? Country: China

Information Extraction A German vehicle-firm executive was stabbed to death …. ………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the second floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou, was hacked to death with 45 cm watermelon knives. Crime-Type: Murder Type: Stabbing The killed: Name: Jurgen Pfrang Age: 51 Profession: Deputy general manager Location: Nanjing, China Different template for crimes

Interpretation of Texts (1)Information Retrieval/Detection User System (2)Passage Retrieval (3)Information Extraction (4) Question/Answering Tasks (5)Text Understanding

Characterization of Texts Queries IR System Collection of Texts

Characterization of Texts Interpretation Knowledge Characterization of Texts Queries IR System Collection of Texts

Characterization of Texts Interpretation Knowledge Characterization of Texts Passage IR System Queries Collection of Texts

Characterization of Texts Interpretation Knowledge Characterization of Texts Passage IR System IE System Queries Templates Structures of Sentences NLP Collection of Texts Texts

Interpretation Knowledge IE System Templates Texts

IE as compromise NLP Predefined Knowledge IE as compromise NLP Interpretation IE System General Framework of NLP/NLU Templates Texts Predefined

Performance Evaluation (1)Information Retrieval/Detection Rather clear A bit vague Very vague (2)Passage Retrieval (3)Information Extraction (4) Question/Answering Tasks (5)Text Understanding

Collection of Documents Query N N: Correct Documents M:Retrieved Documents C: Correct Documents that are actually retrieved Collection of Documents M C Precision: Recall: C M N F-Value: P R P+R 2P・R

Collection of Documents Query N N: Correct Templates M:Retrieved Templates C: Correct Templates that are actually retrieved Collection of Documents M C Precision: Recall: C M N F-Value: P R P+R 2P・R More complicated due to partially filled templates

Framework of IE IE as compromise NLP

General Framework of NLP Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Incomplete Domain Knowledge Interpretation Rules Context processing Interpretation

General Framework of NLP Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Incomplete Domain Knowledge Interpretation Rules Context processing Interpretation

Approaches for building IE systems Knowledge Engineering Approach Rules crafted by linguists in cooperation with domain experts Most of the work done by insoecting a set of relevant documents

Approaches for building IE systems Automatically trainable systems Techniques based on statistics and almost no linguistic knowledge Language independent Main input – annotated corpus Small effort for creating rules, but crating annotated corpus laborious

Techniques in IE (1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted (2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques (3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences (4) Adaptation Techniques: Machine Learning, Trainable systems

General Framework of NLP 95 % F-Value 90 Part of Speech Tagger FSA rules Statistic taggers General Framework of NLP Local Context Statistical Bias Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names Morphological and Lexical Processing Syntactic Analysis Domain Dependent Semantic Anaysis Domain specific rules: <Word><Word>, Inc. Mr. <Cpt-L>. <Word> Machine Learning: HMM, Decision Trees Rules + Machine Learning Context processing Interpretation

General Framework of NLP Based on finite states automata (FSA) FASTUS General Framework of NLP Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names Morphological and Lexical Processing 2.Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Semantic Anaysis Context processing Interpretation 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

General Framework of NLP Based on finite states automata (FSA) FASTUS General Framework of NLP Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names Morphological and Lexical Processing 2.Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Semantic Anaysis Context processing Interpretation 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

General Framework of NLP Based on finite states automata (FSA) FASTUS General Framework of NLP Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names Morphological and Lexical Processing 2.Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Semantic Analysis Context processing Interpretation 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

Computationally more complex, Less Efficiency Chomsky Hierarchy Hierarchy of Grammar of Automata Regular Grammar Finite State Automata Context Free Grammar Push Down Automata Context Sensitive Grammar Linear Bounded Automata Type 0 Grammar Turing Machine Computationally more complex, Less Efficiency

Computationally more complex, Less Efficiency Chomsky Hierarchy Hierarchy of Grammar of Automata Regular Grammar Finite State Automata Context Free Grammar Push Down Automata Context Sensitive Grammar Linear Bounded Automata Type 0 Grammar Turing Machine Computationally more complex, Less Efficiency A B n

1 ’s PN Art 2 ADJ N ’s Art 3 John’s interesting book with a nice cover P 4 PN

1 ’s PN Art 2 ADJ N ’s Art 3 John’s interesting book with a nice cover P 4 PN

1 ’s PN Art 2 ADJ N ’s Art 3 John’s interesting book with a nice cover P 4 PN

1 ’s PN Art 2 ADJ N ’s Art 3 John’s interesting book with a nice cover P 4 PN

1 ’s PN Art 2 ADJ N ’s Art 3 John’s interesting book with a nice cover P 4 PN

1 ’s PN Art 2 ADJ N ’s Art 3 John’s interesting book with a nice cover P 4 PN

1 ’s PN Art 2 ADJ N ’s Art 3 John’s interesting book with a nice cover P 4 PN

1 ’s PN Art 2 ADJ N ’s Art 3 John’s interesting book with a nice cover P 4 PN

1 ’s PN Art 2 ADJ N ’s Art 3 John’s interesting book with a nice cover P 4 PN

1 ’s PN Art 2 ADJ N ’s Art 3 John’s interesting book with a nice cover P 4 PN

Pattern-maching PN ’s (ADJ)* N P Art (ADJ)* N {PN ’s/ Art}(ADJ)* N(P Art (ADJ)* N)* 1 ’s PN Art 2 ADJ N ’s Art 3 John’s interesting book with a nice cover P 4 PN

General Framework of NLP Based on finite states automata (FSA) FASTUS General Framework of NLP Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names Morphological and Lexical Processing 2.Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Semantic Analysis Context processing Interpretation 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 “metal wood” clubs a month. 1.Complex words 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location Attachment Ambiguities are not made explicit

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 “metal wood” clubs a month. {{ }} 1.Complex words 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location a Japanese tea house a [Japanese tea] house a Japanese [tea house]

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 “metal wood” clubs a month. 1.Complex words 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location Structural Ambiguities of NP are ignored

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 “metal wood” clubs a month. 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location 3.Complex Phrases

Example of IE: FASTUS(1993) [COMPNY] said Friday it [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] and [COMPNY] to produce [PRODUCT] to be supplied to [LOCATION]. [JOINT-VENTURE], [COMPNY], capitalized at 20 million [CURRENCY-UNIT] [START] production in [TIME] with production of 20,000 [PRODUCT] a month. 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location 3.Complex Phrases Some syntactic structures like …

Example of IE: FASTUS(1993) [COMPNY] said Friday it [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] to produce [PRODUCT] to be supplied to [LOCATION]. [JOINT-VENTURE] capitalized at [CURRENCY] [START] production in [TIME] with production of [PRODUCT] a month. 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location 3.Complex Phrases Syntactic structures relevant to information to be extracted are dealt with.

Syntactic variations GM set up a joint venture with Toyota. GM announced it was setting up a joint venture with Toyota. GM signed an agreement setting up a joint venture with Toyota. GM announced it was signing an agreement to set up a joint venture with Toyota.

GM set up a joint venture with Toyota. Syntactic variations GM set up a joint venture with Toyota. GM announced it was setting up a joint venture with Toyota. GM signed an agreement setting up a joint venture with Toyota. GM announced it was signing an agreement to set up a joint venture with Toyota. S NP VP V N GM signed agreement setting up [SET-UP] GM plans to set up a joint venture with Toyota. GM expects to set up a joint venture with Toyota.

GM set up a joint venture with Toyota. Syntactic variations GM set up a joint venture with Toyota. GM announced it was setting up a joint venture with Toyota. GM signed an agreement setting up a joint venture with Toyota. GM announced it was signing an agreement to set up a joint venture with Toyota. S [SET-UP] NP VP GM V set up GM plans to set up a joint venture with Toyota. GM expects to set up a joint venture with Toyota.

Example of IE: FASTUS(1993) The attachment positions of PP are determined at this stage. Irrelevant parts of sentences are ignored. [COMPNY] [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] to produce [PRODUCT] to be supplied to [LOCATION]. [JOINT-VENTURE] capitalized at [CURRENCY] [START] production in [TIME] with production of [PRODUCT] a month. 3.Complex Phrases 4.Domain Events [COMPANY][SET-UP][JOINT-VENTURE]with[COMPNY] [COMPANY][SET-UP][JOINT-VENTURE] (others)* with[COMPNY]

Complications caused by syntactic variations Relative clause The mayor, who was kidnapped yesterday, was found dead today. [NG] Relpro {NG/others}* [VG] {NG/others}*[VG] [NG] Relpro {NG/others}* [VG]

Complications caused by syntactic variations Relative clause The mayor, who was kidnapped yesterday, was found dead today. [NG] Relpro {NG/others}* [VG] {NG/others}*[VG] [NG] Relpro {NG/others}* [VG]

Complications caused by syntactic variations Relative clause The mayor, who was kidnapped yesterday, was found dead today. [NG] Relpro {NG/others}* [VG] {NG/others}*[VG] [NG] Relpro {NG/others}* [VG] Basic patterns Surface Pattern Generator Patterns used by Domain Event Relative clause construction Passivization, etc.

Based on finite states automata (FSA) FASTUS Based on finite states automata (FSA) NP, who was kidnapped, was found. 1.Complex Words: 2.Basic Phrases: 3.Complex phrases: 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Piece-wise recognition of basic templates Reconstructing information carried via syntactic structures by merging basic templates 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA) FASTUS Based on finite states automata (FSA) NP, who was kidnapped, was found. 1.Complex Words: 2.Basic Phrases: 3.Complex phrases: 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Piece-wise recognition of basic templates 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Reconstructing information carried via syntactic structures by merging basic templates

Based on finite states automata (FSA) FASTUS Based on finite states automata (FSA) NP, who was kidnapped, was found. 1.Complex Words: 2.Basic Phrases: 3.Complex phrases: 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Piece-wise recognition of basic templates 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Reconstructing information carried via syntactic structures by merging basic templates

Based on finite states automata (FSA) FASTUS Based on finite states automata (FSA) NP, who was kidnapped, was found. 1.Complex Words: 2.Basic Phrases: 3.Complex phrases: 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Piece-wise recognition of basic templates 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Reconstructing information carried via syntactic structures by merging basic templates

Current state of the arts of IE Carefully constructed IE systems F-60 level (interannotater agreement: 60-80%) Domain: telegraphic messages about naval operation (MUC-1:87, MUC-2:89) news articles and transcriptions of radio broadcasts Latin American terrorism (MUC-3:91, MUC-4:1992) News articles about joint ventures (MUC-5, 93) News articles about management changes (MUC-6, 95) News articles about space vehicle (MUC-7, 97) Handcrafted rules (named entity recognition, domain events, etc) Automatic learning from texts: Supervised learning : corpus preparation Non-supervised, or controlled learning