November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003CSA4050: Information Extraction I2 Sources R. Gaizauskas and Y. Wilks, Information Extraction: Beyond Document Retrieval. Technical Report CS-97-10, Department of Computer Science, University of Sheffield, 1997.

November 2003CSA4050: Information Extraction I3 What is Information Extraction? IE: the analysis of unrestricted text in order to extract information about pre-specified types of entity, relationship and event. Typically, text is newspaper text or newswire feed. Typically, prespecified structure is a class- like object with different data fields.

November 2003CSA4050: Information Extraction I4 A Example of Information Extraction 19 March – A bomb went off near a power tower in San Salvador leaving a large part of city, without energy; but no casualties has been reported. According to unofficial sources, the bomb- allegedly detonated by urban Guerilla commandos- blew up a power tower in northwestern part of San Salvador Template Structure: Incident Type : bombing Date : March 19 Location : San Salvador Perpetrator : Urban Guerilla Commandos Target : power tower

November 2003CSA4050: Information Extraction I5 Different levels of structure can be envisaged. –Named Entities –Relationships –Events –Scenarios

November 2003CSA4050: Information Extraction I6 Examples of Named Entities People –John Smith, J. Smith, Smith, John, Mr. Smith Locations –EU, The Hague, SLT, Piazza Tuta Organisations –IBM, The Mizzi Group, University of Malta Numerical Quantities –Lm 10, forty per cent, 40%, $10

November 2003CSA4050: Information Extraction I7 Examples of Relationships between Named Entities George Bush 1 is [President 2 of the United States 3 ] 4 –nation(3) –president(1,3) –coref(1,4)

November 2003CSA4050: Information Extraction I8 Examples of Events Financial Events –Takeover bids –Changes of management Socio/Political Events –Terrorist attacks –Traffic accidents Geographical Events –Natural Disasters

November 2003CSA4050: Information Extraction I9 Some Differences between IE and IR IE extracts relevant information from documents. IE has emerged from research into rule based systems in CL. IE typically based on some kind of linguistic analysis of source text. Information Retrieval (IR) retrieves relevant documents in a collection IR mostly influenced from theory of information, probability, and statistics. IR typically uses bag of words model of source text.

November 2003CSA4050: Information Extraction I10 Why Linguistic Analysis is Necessary Active/Passive distinction –BNC Holdings named Ms G. Torretta to succeed Mr. N. Andrews as new chairperson –Nicholas Andrews was named by Gina Torretta as chair-person of BNC Holdings Use of different phrases to mean the same thing –Ms. Gina Torretta took the helm at BNC Holdings. She succeeds Nick Andrews –G Torretta succeeds N Andrews as chairperson at BNC Holdings Establishing coreferences

November 2003CSA4050: Information Extraction I11 Brief History 1960-80 N Sager Linguistic String project: automatically induced information formats for radiology reports 1970s R. Schank: Scripts 1982 G. DeJong FRUMP: “Sketchy Scripts” used to process UPI newswire stores in domains (e.g. earthquakes; labour strikes); systematic evaluation. 1983 J-P Zarri – analysis of historical texts by translating text into a semantic metalanguage 1986 ATRANS (S. Lytinen et al) – script based system for analysis of money transfer messages between banks 1992 Carnegie Group: JASPER - skims company press releases to fill in templates concerning earnings and dividends.

November 2003CSA4050: Information Extraction I12 Message Understanding Conferences Conferences aimed at comparing the performance of a number of systems working on IE from naval messages. Sponsored by DARPA and organised by the US Naval Command centre, San Diego. –Progressively more difficult tasks. –Progressively more refined evaluation measures.

November 2003CSA4050: Information Extraction I13 MUC Tasks MUC1: tactical naval operations reports on ship sightings and engagements. No task definition; no evaluation criteria MUC3: newswire stories about terrorist attacks. 18 slot templates to be filled. Formal evaluation criteria supplied. MUC6: specific subtasks including named entity recognition; coreference identification; scenario template extraction.

November 2003CSA4050: Information Extraction I14 IE Subtasks Named Entity recognition (NE) –Finds and classifies names, places etc. Coreference Resolution (CO) –Identifies identity relations between entities in texts. Template Element construction (TE) –Adds descriptive information to NE results (using CO). Template Relation construction (TR) –Finds relations between TE entities. Scenario Template production (ST) –Fits TE and TR results into specified event scenarios.

November 2003CSA4050: Information Extraction I15 Evaluation: the IR Starting Point selected target false postrue pos false neg

November 2003CSA4050: Information Extraction I16 Evaluation Metrics Starting points are those used for IR, namely recall and precision. RelevantNot Relevant Retrievedtp (true pos)fp (false pos) Not Retrievedfn (false neg)tn (true neg)

November 2003CSA4050: Information Extraction I17 IR Measures: Precision and Recall Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) Precision P = tp/(tp + fp) Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) Recall R = tp/(tp + fn)

November 2003CSA4050: Information Extraction I18 F-Measure Whatever method is chosen to establish P and R there is a trade-off between them. For this reason researchers often use a measure which combines the two. F = 1/ (α/P + (1- α)/R) is commonly used where α is a factor which determines the weighting between P and R When α = 0.5 the formula reduces to the harmonic mean = 2PR/(P+R) Clearly F is weighed towards P as α approaches 1.

November 2003CSA4050: Information Extraction I19 Harmonic Mean xy arithmetic mean geometric mean harmonic mean 50 4060504948 3070504642 2080504032 arithmetic mean

November 2003CSA4050: Information Extraction I20 Evaluation Metrics for IE For IE, these measures need to be related to the activity of slot-filling: –Slot fills can be correct, partially correct or incorrect, missing, spurious. –These differences permit the introduction of finer grained measures of correctness that include overgeneration, undergeneration, and substitution.

November 2003CSA4050: Information Extraction I21 Recall Recall is a measure of how much relevant information a system has extracted from text. It is the ratio of how much information is actually extracted against how much information there is to be extracted, ie count of facts extracted count of possible facts

November 2003CSA4050: Information Extraction I22 Precision Precision is a measure of how accurate a system is in extracting information. It is the ratio of how much correct information is actually extracted against how much information is extracted, i.e. count of correct facts extracted count of facts extracted

November 2003CSA4050: Information Extraction I23 Bare Bones Architecture ( from Appelt and Israel 1999) Tokenisation Morphological & Lexical Processing Syntactic Analysis Discourse Analysis Word segmentation POS Tagging Word Sense Tagging Preparsing Parsing Coreference

November 2003CSA4050: Information Extraction I24 Generic IE System (Hobbs 1993) Text Zoner Preprocessor Template Generator Lexical Disambiguator Semantic Interpreter Coreference Resolution Fragment Combiner Parser FilterPreparser

November 2003CSA4050: Information Extraction I25 Large Scale IE LaSIE General-purpose IE research system geared towards MUC-6 tasks. Pipelined system with three principle processing tasks: –Lexical preprocessing –Parsing and semantic interpretation –Discourse interpretation

November 2003CSA4050: Information Extraction I26 LaSIE: Processing Stages Lexical preprocessing: reads, tokenises, and tags raw input text. Parsing and semantic interpretation: chart parser; best-parse selection; construction of predicate/argument structure Discourse interpretation: adds information from predicate-argument representation to a world model in the form of a hierarchically structured semantic net

November 2003CSA4050: Information Extraction I27 LaSIE Parse Forest It is rare that analysis contains a unique, spanning parse selection of best parse is carried out by choosing that sequence of non-overlapping, semantically interpretable categories that covers the most words and consists of the fewest constituents.

November 2003CSA4050: Information Extraction I28 LaSIE Discourse Model

November 2003CSA4050: Information Extraction I29 Example Applications of IE Finance Medicine Law Police Academic Research

November 2003CSA4050: Information Extraction I30 Future Trends Better performance: higher precision & recall User (not expert) defined IE: minimisation of role of expert Integration with other technologies (e.g. IR) Multilingual IE

November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

Similar presentations

Presentation on theme: "November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

Similar presentations

Presentation on theme: "November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?"— Presentation transcript:

Similar presentations

About project

Feedback