Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수 1999. 11. 2.

Slides:



Advertisements
Similar presentations
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Search Engines and Information Retrieval
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Information Extraction CS 652 Information Extraction and Integration.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Integration of Information Extraction with an Ontology M. Vargas-Vera, J.Domingue, Y.Kalfoglou, E. Motta and S. Buckingham Sum.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Machine Learning Approach Lecture 5.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany,
Information Extraction
Learning Information Extraction Patterns Using WordNet Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield,
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark Greenwood Natural Language Processing Group University of Sheffield, UK.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
On the Issue of Combining Anaphoricity Determination and Antecedent Identification in Anaphora Resolution Ryu Iida, Kentaro Inui, Yuji Matsumoto Nara Institute.
Open Information Extraction using Wikipedia
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
1 Multi-Perspective Question Answering Using the OpQA Corpus (HLT/EMNLP 2005) Veselin Stoyanov Claire Cardie Janyce Wiebe Cornell University University.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
MedKAT Medical Knowledge Analysis Tool December 2009.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
AUTONOMOUS REQUIREMENTS SPECIFICATION PROCESSING USING NATURAL LANGUAGE PROCESSING - Vivek Punjabi.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Robust Semantics, Information Extraction, and Information Retrieval
Social Knowledge Mining
Automatic Detection of Causal Relations for Question Answering
CS246: Information Retrieval
Presentation transcript:

Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수

Empirical Methods in Information Extraction[Cardie97]2 Contents q Introduction q The Architecture of an Information Extraction System q The Role of Corpus-Based Language Learning Algorithms q Learning Extraction Patterns q Coreference Resolution and Template Generation q Future Directions

Empirical Methods in Information Extraction[Cardie97]3 Introduction(1/2) q Information Extraction System  inherently domain specific  takes as input an unrestricted text and summarizes the text with respect to a prespecified topic or domain of interest. (Figure 1)Figure 1  skim a text to find relevant sections and then focus only on these sections.  MUC performance evaluation  recall  precision  applications  analyzing… 3 terrorist activities, business joint ventures, medical patient records, …  building… 3 KB from web pages, job listing DB from newsgroups / web sites / advertisements, weather forecast DB from web pages,... (# correct slot-fillers in output template) / (# slot-fillers in answer key) (# correct slot-fillers in output template) / (# slot-fillers in output template)

Empirical Methods in Information Extraction[Cardie97]4 Introduction(2/2) q Problems in today’s IE systems  accuracy  the errors of an automated IE system are … 3 due to its relative shallow understanding of the input text 3 difficult to track down and to correct  portability  domain-specific nature  manually modifying and adding domain-specific linguistic knowledge to an existing NLP system is slow and error-prone. We will see that empirical methods for IE are corpus-based, machine learning algorithms.

Empirical Methods in Information Extraction[Cardie97]5 The Architecture of an IE System(1/2) q Approaches to IE in the early days  traditional NLP techniques vs. keyword matching techniques q Standard architecture for IE systems (Figure 2)Figure 2  tokenization and tagging  tag each word with respect to POS and possibly semantic class  sentence analysis  one or more stages of syntactic analysis  identify… 3 noun/verb groups, prepositional phrases, subjects, objects, conjunctions, … 3 semantic entities relevant to the extraction topic  the system need only perform partial parsing 3 looks for fragments of text that can be reliably recognized 3 the ambiguity resolution decisions can be postponed

Empirical Methods in Information Extraction[Cardie97]6 The Architecture of an IE System(2/2) q Standard architecture for IE systems (continued)  extraction  the first entirely domain-specific component  identifies domain-specific relations among relevant entities in the text  merging  coreference resolution, or anaphora resolution 3 determines whether it refers to an existing entity or whether it is new  determine the implicit subjects of all verb phrases 3 discourse-level inference  template generation  determines the number of distinct events in the text  maps the individually extracted pieces of information onto each event  produces output templates  the best place to apply domain-specific constraint  some slots require set fills, or require normalization of their fillers.

Empirical Methods in Information Extraction[Cardie97]7 The Role of Corpus-Based Language Learning Algorithms(1/3) q Q: How have researchers used empirical methods in NLP to improve the accuracy and portability of IE systems?  A: corpus-based language learning algorithms have been used to improve individual components of the IE system. q For language tasks that are domain-independent and syntactic  annotated corpora already exist  POS tagging, partial parsing, WSD  the importance of WSD for IE task remains unclear. q NL learning techniques are more difficult to apply to subsequent stages of IE.  learning extraction patterns, coreference resolution, template generation

Empirical Methods in Information Extraction[Cardie97]8 The Role of Corpus-Based Language Learning Algorithms(2/3) q The problems of applying empirical methods  no corpora annotated with the appropriate semantic & domain-specific supervisory information  corpus for IE =  the output templates … 3 say nothing about which occurrence of the string is responsible for the extraction 3 provide no direct means for learning patterns to extract symbols not necessarily appearing anywhere in the text(set fills)  the semantic & domain-specific language-processing skills require the output of earlier levels of analysis(tagging & partial parsing).  complicate to generate the training examples  whenever the behavior of these earlier modules changes, 3 new training examples must be generated 3 the learning algorithms for later stages must be retrained  learning algorithms must deal with noise caused by errors from earlier components  new algorithms need to be developed

Empirical Methods in Information Extraction[Cardie97]9 The Role of Corpus-Based Language Learning Algorithms(3/3) q Data-driven nature of corpus-based approaches  accuracy  when the training data is derived from the same type of texts that the IE system is to process, 3 the acquired language skills are automatically tuned to that corpus, increasing the accuracy of the system.  portability  because each NLU skill is learned automatically rather than being manually coded, 3 that skill can be moved quickly from one IE system to another by retraining the appropriate component.

Empirical Methods in Information Extraction[Cardie97]10 Learning Extraction Patterns(1/5) q The role for empirical methods in the Extraction phase  knowledge acquisition: to automate the acquisition of good extraction patterns q AutoSlog[Riloff 1993]  learns extraction patterns in the form of domain-specific concept node definitions for use with the CIRCUS parser. (Figure 3)Figure 3  learns concept node definitions via a one-shot learning algorithmlearning algorithm  background knowledge  a small set of general linguistic patterns (approximately 13)  requires human feedback loop, which filters bad extraction patterns  accuracy: 98%, portability: 5 hours  critical step towards building IE systems that are trainable entirely by end-users (Figure 4)Figure 4

Empirical Methods in Information Extraction[Cardie97]11 Learning Extraction Patterns(2/5) Given: a noun phrase to be extracted 1. Find the sentence from which the noun phrase originated. 2. Present the sentence to the partial parser for processing. 3. Apply the linguistic patterns in order. 4. When a pattern applies, generate a concept node definition from the matched constituents, their context, the concept type provided in the annotation for the target noun phrase, and the predefined semantic class for the filler. followed by = Concept = of > Trigger = “ of >” Position = direct-object Constraints = (( of >)) Enabling Conditions = ((active-voice)) AutoSlog’s Learning Algorithm

Empirical Methods in Information Extraction[Cardie97]12 Learning Extraction Patterns(3/5) q PALKA[Kim & Moldovan 1995]  background knowledge  concept hierarchy 3 a set of keywords that can be used to trigger each pattern 3 comprises a set of generic semantic case frame definitions for each type of information to be extracted  semantic class lexicon q CRYSTAL[Soderland 1995]  triggers comprise a much more detailed specification of linguistic context  employs a covering algorithmcovering algorithm  medical diagnosis domain  precision: 50-80%, recall: 45-75%

Empirical Methods in Information Extraction[Cardie97]13 Learning Extraction Patterns(4/5) 1. Begin by generating the most specific concept node possible for every phrase to be extracted in the training texts. 2. For each concept node C 2.1. Find the most similar concept node C’ Relax the constrains of each just enough to unify C and C’ Test the new extraction pattern P against the training corpus. If (error rate < threshold) then Add P; Replace C and C’ else stop. CRYSTAL’s Learning Algorithm

Empirical Methods in Information Extraction[Cardie97]14 Learning Extraction Patterns(5/5) q Comparison  AutoSlog  general to specific  human feedback  PALKA  generalization & specialization  automated feedback  require more background knowledge  CRYSTAL  specific to general(covering algorithm)  automated feedback  require more background knowledge q Research issues  handling set fills  type of the extracted information  evaluation  determining which method for learning extraction patterns will give the best results in a new extraction domain

Empirical Methods in Information Extraction[Cardie97]15 Coreference Resolution and Template Generation(1/3) q Discourse processing is a major weakness of existing IE system  generating good heuristics is challenging  assume as input fully parsed sentences  must take into account the accumulated errors  must be able to handle the myriad forms of coreference across different domains q Coreference problem as a classification task (Figure 5)Figure 5  given two phrases and the context in which they occur,  classify the phrases with respect to whether or not they refer to the same object

Empirical Methods in Information Extraction[Cardie97]16 Coreference Resolution and Template Generation(2/3) q MLR[Aone & Bennett 1995]  use C4.5 decision tree induction system  tested on the Japanese corpus for the business joint ventures  use automatically generated data set  66 domain-independent features  evaluated using data sets derived from 250 texts  recall: %, precision: 83-88% q RESOLVE[McCarthy & Lehnert 1995]  use C4.5 decision tree induction system  tested on the English corpus for the business joint ventures(MUC-5)  use manually generated, noise-free data set  include domain-specific features  evaluated using data sets derived from 50 texts  recall: 80-85%, precision: 87-92%

Empirical Methods in Information Extraction[Cardie97]17 Coreference Resolution and Template Generation(3/3) q The results for coreference resolution are promising  possible to develop automatically trainable coreference systems that can compete favorably with manually designed systems  specially designed learning algorithms need not be developed  symbolic ML techniques offer a mechanism for evaluating the usefulness of different knowledge sources q Still, much research remains to be done  additional types of anaphors using a variety of feature sets  the role of domain-specific information for coreference resolution  the relative effect of errors from the preceding phases of text analysis q Trainable systems that tackle Merging & Template Generation  TTG[Dolan 1991], Wrap-Up[Soderland & Lehnert 1994]  generate a series of decision tree

Empirical Methods in Information Extraction[Cardie97]18 Future Directions q Unsupervised learning algorithms  a means for sidestepping the lack of large, annotated corpora q Techniques that allow end-users to quickly train IE systems  through interaction with the system over time  without intervention by NLP system developers

Empirical Methods in Information Extraction[Cardie97]19 IE System in the Domain of Natural Disasters

Empirical Methods in Information Extraction[Cardie97]20 Architecture for an IE System

Empirical Methods in Information Extraction[Cardie97]21 Concept Node for Extracting “Damage” Information Concept Node Definition: domain-specific semantic case frame (one slot per frame) Concept: the type of concept to be recognized Trigger: the word that activates the pattern Position: the syntactic position where the concept is expected to be found Constraint: selectional restrictions that apply to any potential instance of the concept Enabling Conditions: constraints on the linguistic context of the triggering word that must be satisfied before the pattern is activated

Empirical Methods in Information Extraction[Cardie97]22 Learning Information Extraction Patterns

Empirical Methods in Information Extraction[Cardie97]23 A Machine Learning Approach to Coreference Resolution