Information Extraction: What It Is How to Do It Where It’s Going Douglas E. Appelt Artificial Intelligence Center SRI International
Some URLs to Visit l » ANLP-97 tutorial on information extraction » Many WWW links –Research sites and literature –Resources for building systems l » An IE System for Power PC Macintoshes » Uses TIPSTER technology –TIPSTER architecture –Common Pattern Specification Language –It’s free » Comes with a complete English name recognizer
Information Extraction: Situating IE l Text Manipulation: grep l Information Retrieval l Information Extraction l Text Understanding
Text Understanding l No predetermined specification of semantic or communicative areas of interest l No clearly defined criteria of success l Representation of meaning must be sufficiently general to capture all of the meaning of the text and the author’s intentions.
Information Extraction l Information of interest is delimited and pre-specified l Fixed, predefined representation of information l Clear criteria of success are at least possible l Corollary Features » Small portion of text is relevant » Often, only a portion of a relevant sentence is relevant » Targeted at relatively large corpora
Applications l Information Retrieval (routing queries) l Indexing for Information Retrieval l Filter for IR Output l Direct Presentation to the User: highlighting l Summarization l Construction of data bases and knowledge bases
Evaluation Metrics l MUC Evaluations l Precision and Recall » Recall: percentage possible found » Precision: percentage provided that is correct » F-measure: weighted, geometric mean of recall and precision l Is there a F-60 barrier?
A Bare Bones Extraction System Tokenizer Morphological and Lexical Processing Parsing Domain Semantics
Flesh for the Bones Tokenizer Morphological and Lexical Processing Parsing Domain Semantics Text Sectionizing And Filtering Part of Speech and Word Sense Tagging Coreference Merging Partial Results
The IE Approach - KISS l Keep it Simple, Stupid » Finite-state language models » Fragment processing » Simple semantics –Propositional –Small number of propositions –Often represented by templates l Use heuristics » Missing Information » Make favorable recall/precision tradeoffs
Two Approaches to Extraction Systems l Knowledge Engineering Approach » Grammars constructed by hand » Domain patterns discovered by introspection and corpus examination » Laborious tuning and hill-climbing l Learning and Statistical Approach » Apply statistical methods where possible » Learn rules from annotated corpora » Learn from interaction with user
Knowledge Engineering Approach l Advantages » Skilled computational linguists can build good systems quickly » Best performing practical systems have so far been handcrafted. l Disadvantages » Very laborious development process » Difficult to port systems to new domains » Requires expertise
Learning-Statistical Approach l Advantages » Domain portability is straightforward » Minimal expertise required for customization » Rule acquisition is data driven - complete coverage of examples l Disadvantages » Training data may not exist and may be difficult or expensive to obtain » Highest performing systems are still hand-crafted
A Combined Approach l Use statistical methods on modules where training data exists, and high accuracy can be achieved » Part-of-speech tagging » Name recognition » Coreference l Use knowledge engineering when training data is sparse and human ingenuity required » Domain Processing
Lexical Processing: Named Entity Recognition l Named Entities are targets of extraction in many domains » Companies » Other organizations » People » Locations » Dates, times, currency l Impossible or impractical to list all possible named entities in a lexicon
The List Fallacy l Comprehensive lexical resources do not necessarily result in improved extraction performance » Some entities are so new they’re not on any lists » Rare senses cause problems - “has been” as a noun » Names often overlap with other names and ordinary words –“Dallas” can be the name of a person –“Dollar” is the name of a town l Solutions » Part-of-speech tagging » Recognition from context
Knowledge Engineering vs. Statistical Models l Knowledge Engineering » SRI, SRA, Isoquest » Performance –1996: F –1998: F l Statistical Models » BBN, NYU (1998) » Performance –1997: F 93 –1998: F Hand-coding reduces the error rate by 50%.
Knowledge Engineering Name Recognition l Identify some names explicitly in lexicon l Identify parts of names with lexical features l Write rules that recognize names » Use capitalization in English » Recognize names based on internal structure –“Mumble Mumble City” Location –“Mumble Mumble GmbH” Company » Exceptions for common “gotchas” –“Yesterday IBM announced…” –“General Electric” is a company, not a general l Many complex rules are the result
Statistical Model Name Recognition Hidden Markov Models Start End Name Not-a-name
Statistical Model Name Recognition l Transitions are probabilistic l Training » Annotate a corpus » Estimate transition probabilities given words (and/or their features) » P(s i | s i-1, w i ) l Application » Compute the maximum-likelihood path through the network for the input text. » The Viterbi algorithm
Training Data l The amount of data needed is not onerous (diminishing returns at 100,000 words) l Annotation can be done by non-linguist native speakers l Training also works (with some degradation) for upper-case-only and punctuation-free texts.
Interesting Aside l NYU trained a statistical model using as word “features” whether various other name recognition systems tagged that word as part of a name. l Result: Better than human performance! » System achieved F » Experienced humans: F
Parsing in IE Systems l Some IE Systems have attempted full parsing » NYU Pre-1996 Proteus System » SRI Tacitus System l Attempts to adapt to the IE task » Fragment interpretation » Limitation of search l Statistical Parsing? » No real systems exist yet
Problems with Full Parsing l The search space becomes prohibitively large for long sentences. » The system is slow. Rapid development and testing of rules becomes impossible. l “Full Parse” heuristic » It is often possible with a comprehensive grammar to span the sentence with a highly improbable parse when the actual analysis is outside of the grammar, or lost in the search space.
The IE Approach to Parsing l Analyze sentences as simple constituents that can be described with a finite-state grammar » Noun Groups, Verb Groups, Particles » Ignore prepositional attachment » Ignore clause boundaries l Parser consists of one or more finite-state transducers mapping words into simple constituents
A Finite-State Fragment Parse A. C. Nielson Co. NG said VG George Garrick, NG 40 years old, president NG of Information Resources, Inc. NG 's London-based European Information Services operation NG will become VG president NG and chief operating officer NG of Nielson Marketing Research USA NG a unit NG of Dun & Bradstreet. NG
Handling Difficult Cases l Relative Clauses » Use nondeterminism to connect single subject to multiple clauses l VP Conjunction » Use nondeterminism to connect single subject to multiple verb phrases l Appositives » Handle only domain-relevant cases l Prepositional Attachment » Handle only domain-relevant cases
An Application Domain l Identify domain-relevant objects l Identify properties of those objects l Identify relationships among domain-relevant objects l Identify relevant events involving domain objects
The Molecular Approach l Standard approach l High precision, low recall approach » Read texts » Identify common, domain relevant patterns signaling properties, events, and relationships » Build rules to cover those » Move to less frequent, less reliable patterns
The Atomic Approach l Aims for high recall, low precision » Determine features of application-relevant entity types » Determine features of application-relevant event and relation types » Every occurrence of a phrase with the relevant feature triggers a candidate event/relation » Merge candidate relations to obtain more fully instantiated event/relation descriptions » Filter using application-specific criteria
Appropriateness l Appropriate when » Relevant entities have easily determined types » Only one or a small number of relations can hold of an entity with a given type » Relevant events and relations are symmetrical. l Examples » Labor negotiations » MUC-5 Microelectronics l Heavy reliance on merging of partial information (even within sentence)
Is There a Barrier?
Where is the Upper Bound? l Experience suggests that, for a MUC- like task with MUC scoring, it is unrealistic to expect to achieve more than about F 65 on a blind test. (F 70 on training data) l About 75% of human performance.
Reasons for the Limits l There is a long tail of increasingly rare domain-relevant expressions l A barrier of inherently hard linguistic phenomena » Complex coordination » Collective-distributive reference » Multiple interacting phenomena in the same sentence » Hard inferences required l Limits of heuristic tradeoffs are reached
Improve Information Retrieval l Routing task: » Build a quick extraction system for a topic. » IR system picks 2000 texts » Rescore by using extraction system to evaluate the text for relevance » Return the 1000 top texts l Results: 12 improve, 4 same, 5 worse l Best results when training data is sparse l More testing and evaluation needed.
Topic Oriented Summarization l Extract information of interest l Generate NL summary of extracted data l Generation can be in a different language, enabling cross-language access to key information.
Process Many Documents Quickly l Exploit redundancy in corpora to get higher recall from merging of multiple descriptions of the same event. » Analyze data from multiple news feeds l Annotating text for training language models » Need to identify names in speech (broadcast news) » Train class bigram on 100 million words of training data. » Because automatic name annotation is almost as good as human annotation, automatic annotation of training data is feasible.
Make Limits More Quickly Attainable l Automatic learning of rules from examples l Application of "open domain" extraction systems » Build general rules for a very broad domain, like "business and economic news" » Quickly customize rules from library for a specific application » Used a prototype to generate extraction systems for routing queries in a half-day.