November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

Processing of large document collections Part 8 (Information extraction) Helena Ahonen-Myka Spring 2005.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
6 Nov 2001IS202: Information Organization and Retrieval Information Extraction Ray Larson & Warren Sack IS202: Information Organization and Retrieval Fall.
Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
Introduction to Machine Learning Approach Lecture 5.
Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.
Information extraction from text Spring 2003, Part 1 Helena Ahonen-Myka.
9/8/20151 Natural Language Processing Lecture Notes 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
IE (Wilks)-1 Information Extraction: Beyond Document Retrieval Robert Gaizauskas and Yorick Wilks Computational Linguistics and Chinese Language Processing.
Ⅵ. Information Extraction (1~4) 2007 년 2 월 20 일 인공지능 연구실 이경택 Text: The text mining handbook Page.94 ~ 109.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출 
Abstract Question answering is an important task of natural language processing. Unification-based grammars have emerged as formalisms for reasoning about.
ITTL.ppt-1 Information Technology & Telecommunications Laboratory Semantic Technologies Applied to FOIA Review William Underwood Partnerships in Innovation:
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Knowledge Discovery for a Focused Domain Scanning of documents and messages of interest to a business and the extraction of relevant facts for knowledge.
1 Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures Kevin Humphreys, George.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Processing of large document collections Part 1 (Introduction) Helena Ahonen-Myka Spring 2006.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
School of something FACULTY OF OTHER School of Languages, Cultures and Societies – Faculty of Arts School of Computing – Faculty of Engineering Multilingual.
Information Retrieval Quality of a Search Engine.
Using Semantic Relations to Improve Information Retrieval
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
WIKT 2007Košice, november Tvorba sémantických metadát Michal Laclavík Ústav Informatiky SAV.
University of Malta CSA3080: Lecture 10 © Chris Staff 1 of 18 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Defining measures for WMS and VOMS services evaluation
Text Based Information Retrieval
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Robust Semantics, Information Extraction, and Information Retrieval
Introduction to Information Extraction
Social Knowledge Mining
CS246: Information Retrieval
Presentation transcript:

November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003CSA4050: Information Extraction I2 Sources R. Gaizauskas and Y. Wilks, Information Extraction: Beyond Document Retrieval. Technical Report CS-97-10, Department of Computer Science, University of Sheffield, 1997.

November 2003CSA4050: Information Extraction I3 What is Information Extraction? IE: the analysis of unrestricted text in order to extract information about pre-specified types of entity, relationship and event. Typically, text is newspaper text or newswire feed. Typically, prespecified structure is a class- like object with different data fields.

November 2003CSA4050: Information Extraction I4 A Example of Information Extraction 19 March – A bomb went off near a power tower in San Salvador leaving a large part of city, without energy; but no casualties has been reported. According to unofficial sources, the bomb- allegedly detonated by urban Guerilla commandos- blew up a power tower in northwestern part of San Salvador Template Structure: Incident Type : bombing Date : March 19 Location : San Salvador Perpetrator : Urban Guerilla Commandos Target : power tower

November 2003CSA4050: Information Extraction I5 Different levels of structure can be envisaged. –Named Entities –Relationships –Events –Scenarios

November 2003CSA4050: Information Extraction I6 Examples of Named Entities People –John Smith, J. Smith, Smith, John, Mr. Smith Locations –EU, The Hague, SLT, Piazza Tuta Organisations –IBM, The Mizzi Group, University of Malta Numerical Quantities –Lm 10, forty per cent, 40%, $10

November 2003CSA4050: Information Extraction I7 Examples of Relationships between Named Entities George Bush 1 is [President 2 of the United States 3 ] 4 –nation(3) –president(1,3) –coref(1,4)

November 2003CSA4050: Information Extraction I8 Examples of Events Financial Events –Takeover bids –Changes of management Socio/Political Events –Terrorist attacks –Traffic accidents Geographical Events –Natural Disasters

November 2003CSA4050: Information Extraction I9 Some Differences between IE and IR IE extracts relevant information from documents. IE has emerged from research into rule based systems in CL. IE typically based on some kind of linguistic analysis of source text. Information Retrieval (IR) retrieves relevant documents in a collection IR mostly influenced from theory of information, probability, and statistics. IR typically uses bag of words model of source text.

November 2003CSA4050: Information Extraction I10 Why Linguistic Analysis is Necessary Active/Passive distinction –BNC Holdings named Ms G. Torretta to succeed Mr. N. Andrews as new chairperson –Nicholas Andrews was named by Gina Torretta as chair-person of BNC Holdings Use of different phrases to mean the same thing –Ms. Gina Torretta took the helm at BNC Holdings. She succeeds Nick Andrews –G Torretta succeeds N Andrews as chairperson at BNC Holdings Establishing coreferences

November 2003CSA4050: Information Extraction I11 Brief History N Sager Linguistic String project: automatically induced information formats for radiology reports 1970s R. Schank: Scripts 1982 G. DeJong FRUMP: “Sketchy Scripts” used to process UPI newswire stores in domains (e.g. earthquakes; labour strikes); systematic evaluation J-P Zarri – analysis of historical texts by translating text into a semantic metalanguage 1986 ATRANS (S. Lytinen et al) – script based system for analysis of money transfer messages between banks 1992 Carnegie Group: JASPER - skims company press releases to fill in templates concerning earnings and dividends.

November 2003CSA4050: Information Extraction I12 Message Understanding Conferences Conferences aimed at comparing the performance of a number of systems working on IE from naval messages. Sponsored by DARPA and organised by the US Naval Command centre, San Diego. –Progressively more difficult tasks. –Progressively more refined evaluation measures.

November 2003CSA4050: Information Extraction I13 MUC Tasks MUC1: tactical naval operations reports on ship sightings and engagements. No task definition; no evaluation criteria MUC3: newswire stories about terrorist attacks. 18 slot templates to be filled. Formal evaluation criteria supplied. MUC6: specific subtasks including named entity recognition; coreference identification; scenario template extraction.

November 2003CSA4050: Information Extraction I14 IE Subtasks Named Entity recognition (NE) –Finds and classifies names, places etc. Coreference Resolution (CO) –Identifies identity relations between entities in texts. Template Element construction (TE) –Adds descriptive information to NE results (using CO). Template Relation construction (TR) –Finds relations between TE entities. Scenario Template production (ST) –Fits TE and TR results into specified event scenarios.

November 2003CSA4050: Information Extraction I15 Evaluation: the IR Starting Point selected target false postrue pos false neg

November 2003CSA4050: Information Extraction I16 Evaluation Metrics Starting points are those used for IR, namely recall and precision. RelevantNot Relevant Retrievedtp (true pos)fp (false pos) Not Retrievedfn (false neg)tn (true neg)

November 2003CSA4050: Information Extraction I17 IR Measures: Precision and Recall Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) Precision P = tp/(tp + fp) Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) Recall R = tp/(tp + fn)

November 2003CSA4050: Information Extraction I18 F-Measure Whatever method is chosen to establish P and R there is a trade-off between them. For this reason researchers often use a measure which combines the two. F = 1/ (α/P + (1- α)/R) is commonly used where α is a factor which determines the weighting between P and R When α = 0.5 the formula reduces to the harmonic mean = 2PR/(P+R) Clearly F is weighed towards P as α approaches 1.

November 2003CSA4050: Information Extraction I19 Harmonic Mean xy arithmetic mean geometric mean harmonic mean arithmetic mean

November 2003CSA4050: Information Extraction I20 Evaluation Metrics for IE For IE, these measures need to be related to the activity of slot-filling: –Slot fills can be correct, partially correct or incorrect, missing, spurious. –These differences permit the introduction of finer grained measures of correctness that include overgeneration, undergeneration, and substitution.

November 2003CSA4050: Information Extraction I21 Recall Recall is a measure of how much relevant information a system has extracted from text. It is the ratio of how much information is actually extracted against how much information there is to be extracted, ie count of facts extracted count of possible facts

November 2003CSA4050: Information Extraction I22 Precision Precision is a measure of how accurate a system is in extracting information. It is the ratio of how much correct information is actually extracted against how much information is extracted, i.e. count of correct facts extracted count of facts extracted

November 2003CSA4050: Information Extraction I23 Bare Bones Architecture ( from Appelt and Israel 1999) Tokenisation Morphological & Lexical Processing Syntactic Analysis Discourse Analysis Word segmentation POS Tagging Word Sense Tagging Preparsing Parsing Coreference

November 2003CSA4050: Information Extraction I24 Generic IE System (Hobbs 1993) Text Zoner Preprocessor Template Generator Lexical Disambiguator Semantic Interpreter Coreference Resolution Fragment Combiner Parser FilterPreparser

November 2003CSA4050: Information Extraction I25 Large Scale IE LaSIE General-purpose IE research system geared towards MUC-6 tasks. Pipelined system with three principle processing tasks: –Lexical preprocessing –Parsing and semantic interpretation –Discourse interpretation

November 2003CSA4050: Information Extraction I26 LaSIE: Processing Stages Lexical preprocessing: reads, tokenises, and tags raw input text. Parsing and semantic interpretation: chart parser; best-parse selection; construction of predicate/argument structure Discourse interpretation: adds information from predicate-argument representation to a world model in the form of a hierarchically structured semantic net

November 2003CSA4050: Information Extraction I27 LaSIE Parse Forest It is rare that analysis contains a unique, spanning parse selection of best parse is carried out by choosing that sequence of non-overlapping, semantically interpretable categories that covers the most words and consists of the fewest constituents.

November 2003CSA4050: Information Extraction I28 LaSIE Discourse Model

November 2003CSA4050: Information Extraction I29 Example Applications of IE Finance Medicine Law Police Academic Research

November 2003CSA4050: Information Extraction I30 Future Trends Better performance: higher precision & recall User (not expert) defined IE: minimisation of role of expert Integration with other technologies (e.g. IR) Multilingual IE