Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: Define domain-specific templates Simple and reliable linguistic processing Recognize known types of entities and relations Fill templates with recognized information
Example 4 Apr. Dallas - Early last evening, a tornado swept through northwest Dallas. The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. The Texaco station at 102 Main St. was also severely damaged, but no injuries were reported. Event: tornado Date: 4/3/97 Time: 19:15 Location: “northwest Dallas” : Texas : USA Damage: “mobile homes” (2) “Texaco station” (1) Injuries: none
Sentence Analysis Pattern Merging Extraction Template Generation tornado swept: Event: tornado through northwest Dallas: Loc: “northwest Dallas” causing extensive damage: Damage Early last evening: adv-phrase:time a tornado: noun-group:subject swept: verb-group ... Early last evening, a tornado swept through northwest Dallas. The twister occurred without warning at about .... Early/ADV last/ADJ evening/NN:time ,/, a/DT tornado/NN:weather swept/VBD ... 4 Apr. Dallas – Early last evening, a tornado swept through northwest.... Tokenization & Tagging Sentence Analysis Pattern Extraction Merging Template Generation Event: tornado Date: 4/3/97 Time: 19:15 Location: “northwest Dallas” : Texas : USA ...
MUC: Message Understanding Conference “Competitive” conference with predefined tasks for research groups to address Tasks (MUC-7): Named Entities: Extract typed entities from text Equivalence Classes: Solving coreference Attributes: Fill in attributes of entities Facts: Extract logical relations between entities Events: Extract descriptions of events from text
Tokenization & Tagging Tokenization & POS tagging Also lexical semantic information, such as “time”, “location”, “weather”, “person”, etc. Sentence Analysis Shallow parsing for phrase types Use tagging & semantics to tag phrases Note phrase heads
Pattern Extraction Find domain-specific relations between text units Typically use lexical triggers and relation-specific patterns to recognize relations Concept: Damaged-Object Trigger: destroyed Position: direct-object Constraints: physical-thing ... and [ destroyed ] [ two mobile homes ] Damaged-Object = “two mobile homes”
Learning Extraction Patterns Very difficult to predefine extraction patterns Must be redone for each new domain Hence, corpus-based approaches are indicated Some methods: AutoSlog (1992) – “syntactic” learning PALKA (1995) – “conceptual” learning CRYSTAL (1995) – covering algorithm
AutoSlog (Lehnert 1992) Patterns based on recognizing “concepts” Concept: what concept to recognize Trigger: a word indicating an occurrence Position: what syntactic role the concept will take in the sentence Constraints: what type of entity to allow Enabling conditions: constraints on the linguistic context
Position: prep-phrase-object Constraints: time Concept: Event-Time Trigger: “at” Position: prep-phrase-object Constraints: time Enabling conditions: post-verb The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. Event-Time = 19:15
Learning Patterns Supervised: Training is text with patterns to be extracted from it Knowledge: 13 general syntactic patterns Algorithm: Find sentence with target noun phrase “two mobile homes” Partial parsing of sentence: find syntactic relations Try all linguistic patterns to find match Generate concept pattern from match
Linguistic Patterns Identify domain-specific thematic roles based on syntactic structure active-voice-verb followed by target=direct object Concept = target concept Trigger = verb of active-voice-verb Position = direct-object Constraints = semantic-class of target Enabling conditions = active-voice
More Examples victim was murdered perpetrator bombed perpetrator attempted to kill was aimed at target Some bad extraction patterns occur (e.g, “is” as a trigger) Human review process
CRYSTAL Complex syntactic patterns Use “covering” algorithm: Generate most specific possible patterns for all occurrences of targets in corpus Loop: Find most specific unifier of the most similar patterns C & C’, generating new pattern P If P has less than ε error on corpus, replace C and C’ with P Continue until no new patterns can be added
Merging Motor Vehicles International Corp. announced a major management shake-up ... MVI said the CEO has resigned ... The Big 10 auto maker is attempting to regain market share ... It will announce losses ... A company spokesman said they are moving their operations ... MVI, the first company to announce such a move since the passage of the new international trade agreement, is facing increasing demands from unionized workers...
Coreference Resolution Many different kinds of linguistic phenomena: Proper names, Aliases (MVI), Definite NPs (the Big 10 auto maker), Pronouns (it, they), Appositives (, the first company to ...) Errors of previous phases may be amplified
Learning to Merge Treat coreference as a classification task Should this pair of entities be linked? Methodology: Training corpus: manually link all coreferential expressions Each possible pair is a training example, if they are linked it is positive if not, it is negative Create a feature vector for each example Use your favorite learning algorithm
MLR (1995) 66 features were used, in 4 categories: Lexical features of each phrase e.g, do they overlap? Grammatical role of each phrase e.g, subject, direct-object Semantic classes of each phrase e.g, physical-thing, company Relative positions of the phrases e.g, X one sentence after Y Decision-tree learning (C4.5)
C4.5 Incrementally build decision-tree from labeled training examples At each stage choose “best” attribute to split dataset E.g, use info-gain to compare features After building complete tree, prune the leaves to prevent overfitting Use statistical tests to determine if enough examples are in leaf bins, if not – prune!
C4.5 40 training f1 25 training 15 training f2 f3 C1 C2 C2 C1
RESOLVE (1995) C4.5 with 8 complex features: NAME-{1,2}: does reference include a name? JV-CHILD-{1,2}: does reference refer to part of a joint venture? ALIAS: does one reference contain an alias for the other? BOTH-JV-CHILD: do both refer to part of a joint venture? COMMON-NP: do both contain a common NP? SAME-SENTENCE: are both in the same sentence?
Decision Tree
RESOLVE Results 50 texts, leave-1-out cross-validation: System Recall Precision Unpruned 85.4% 87.6% Pruned 80.1% 92.4% Manual 67.7% 94.4%
Full System: FASTUS (1996) Input Text Pattern Recognition Coreference Resolution Partial Templates Template Merger Output Template
John Smith, 47, was named president of ABC Corp. Pattern Recognition Multiple passes of finite-state methods John Smith, 47, was named president of ABC Corp. Pers-Name Num Aux V N P Org-Name Domain-Event Poss-N-Group V-Group
Partially-Instantiated Templates Person: _______ Pos: President Org: ABC Corp. Person: John Smith Start: End: Domain-Dependent!!
Coreference analysis: He = John Smith The Next Sentence... He replaces Mike Jones. Coreference analysis: He = John Smith Person: Mike Jones Pos: ________ Org: ________ Person: John Smith Start: End:
Unification Unify new template with preceding template(s), if possible... Person: Mike Jones Pos: President Org: ABC Corp. Person: John Smith Start: End:
Principle of Least Commitment Idea: Maintain options as long as possible E.g: parsing – maintain a lattice structure: The committee heads announced that... DT NN1 NN2 VBZ VBD CSub N-GRP Event Event: Announce Actor: Committee heads
Principle of Least Commitment Idea: Maintain options as long as possible E.g: parsing – maintain a lattice structure: The committee heads ABC’s recruitment effort. DT NN1 NN2 VBZ NNpos NN N-GRP N-GRP Event Head: Committee Effort: ABC’s recruitment
More Least Commitment Maintain multiple coreference hypotheses: Disambiguate when creating domain-events More information available Too many possibilities? Use beam search algorithm: maintain k ‘best’ hypotheses at every stage