Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Extraction

Similar presentations


Presentation on theme: "Information Extraction"— Presentation transcript:

1 Information Extraction
Extract meaningful information from text Without fully understanding everything! Basic idea: Define domain-specific templates Simple and reliable linguistic processing Recognize known types of entities and relations Fill templates with recognized information

2 Example 4 Apr. Dallas - Early last evening, a tornado swept through northwest Dallas. The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. The Texaco station at 102 Main St. was also severely damaged, but no injuries were reported. Event: tornado Date: 4/3/97 Time: 19:15 Location: “northwest Dallas” : Texas : USA Damage: “mobile homes” (2) “Texaco station” (1) Injuries: none

3 Sentence Analysis Pattern Merging Extraction Template Generation
tornado swept: Event: tornado through northwest Dallas: Loc: “northwest Dallas” causing extensive damage: Damage Early last evening: adv-phrase:time a tornado: noun-group:subject swept: verb-group ... Early last evening, a tornado swept through northwest Dallas. The twister occurred without warning at about .... Early/ADV last/ADJ evening/NN:time ,/, a/DT tornado/NN:weather swept/VBD ... 4 Apr. Dallas – Early last evening, a tornado swept through northwest.... Tokenization & Tagging Sentence Analysis Pattern Extraction Merging Template Generation Event: tornado Date: 4/3/97 Time: 19:15 Location: “northwest Dallas” : Texas : USA ...

4 MUC: Message Understanding Conference
“Competitive” conference with predefined tasks for research groups to address Tasks (MUC-7): Named Entities: Extract typed entities from text Equivalence Classes: Solving coreference Attributes: Fill in attributes of entities Facts: Extract logical relations between entities Events: Extract descriptions of events from text

5 Tokenization & Tagging
Tokenization & POS tagging Also lexical semantic information, such as “time”, “location”, “weather”, “person”, etc. Sentence Analysis Shallow parsing for phrase types Use tagging & semantics to tag phrases Note phrase heads

6 Pattern Extraction Find domain-specific relations between text units
Typically use lexical triggers and relation-specific patterns to recognize relations Concept: Damaged-Object Trigger: destroyed Position: direct-object Constraints: physical-thing ... and [ destroyed ] [ two mobile homes ]  Damaged-Object = “two mobile homes”

7 Learning Extraction Patterns
Very difficult to predefine extraction patterns Must be redone for each new domain Hence, corpus-based approaches are indicated Some methods: AutoSlog (1992) – “syntactic” learning PALKA (1995) – “conceptual” learning CRYSTAL (1995) – covering algorithm

8 AutoSlog (Lehnert 1992) Patterns based on recognizing “concepts”
Concept: what concept to recognize Trigger: a word indicating an occurrence Position: what syntactic role the concept will take in the sentence Constraints: what type of entity to allow Enabling conditions: constraints on the linguistic context

9 Position: prep-phrase-object Constraints: time
Concept: Event-Time Trigger: “at” Position: prep-phrase-object Constraints: time Enabling conditions: post-verb The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. Event-Time = 19:15

10 Learning Patterns Supervised: Training is text with patterns to be extracted from it Knowledge: 13 general syntactic patterns Algorithm: Find sentence with target noun phrase “two mobile homes” Partial parsing of sentence: find syntactic relations Try all linguistic patterns to find match Generate concept pattern from match

11 Linguistic Patterns Identify domain-specific thematic roles based on syntactic structure active-voice-verb followed by target=direct object Concept = target concept Trigger = verb of active-voice-verb Position = direct-object Constraints = semantic-class of target Enabling conditions = active-voice

12 More Examples victim was murdered perpetrator bombed perpetrator attempted to kill was aimed at target Some bad extraction patterns occur (e.g, “is” as a trigger) Human review process

13 CRYSTAL Complex syntactic patterns Use “covering” algorithm:
Generate most specific possible patterns for all occurrences of targets in corpus Loop: Find most specific unifier of the most similar patterns C & C’, generating new pattern P If P has less than ε error on corpus, replace C and C’ with P Continue until no new patterns can be added

14 Merging Motor Vehicles International Corp. announced a major management shake-up ... MVI said the CEO has resigned ... The Big 10 auto maker is attempting to regain market share ... It will announce losses ... A company spokesman said they are moving their operations ... MVI, the first company to announce such a move since the passage of the new international trade agreement, is facing increasing demands from unionized workers...

15 Coreference Resolution
Many different kinds of linguistic phenomena: Proper names, Aliases (MVI), Definite NPs (the Big 10 auto maker), Pronouns (it, they), Appositives (, the first company to ...) Errors of previous phases may be amplified

16 Learning to Merge Treat coreference as a classification task
Should this pair of entities be linked? Methodology: Training corpus: manually link all coreferential expressions Each possible pair is a training example, if they are linked it is positive if not, it is negative Create a feature vector for each example Use your favorite learning algorithm

17 MLR (1995) 66 features were used, in 4 categories:
Lexical features of each phrase e.g, do they overlap? Grammatical role of each phrase e.g, subject, direct-object Semantic classes of each phrase e.g, physical-thing, company Relative positions of the phrases e.g, X one sentence after Y Decision-tree learning (C4.5)

18 C4.5 Incrementally build decision-tree from labeled training examples
At each stage choose “best” attribute to split dataset E.g, use info-gain to compare features After building complete tree, prune the leaves to prevent overfitting Use statistical tests to determine if enough examples are in leaf bins, if not – prune!

19 C4.5 40 training f1 25 training 15 training f2 f3 C1 C2 C2 C1

20 RESOLVE (1995) C4.5 with 8 complex features:
NAME-{1,2}: does reference include a name? JV-CHILD-{1,2}: does reference refer to part of a joint venture? ALIAS: does one reference contain an alias for the other? BOTH-JV-CHILD: do both refer to part of a joint venture? COMMON-NP: do both contain a common NP? SAME-SENTENCE: are both in the same sentence?

21 Decision Tree

22 RESOLVE Results 50 texts, leave-1-out cross-validation: System Recall
Precision Unpruned 85.4% 87.6% Pruned 80.1% 92.4% Manual 67.7% 94.4%

23 Full System: FASTUS (1996) Input Text Pattern Recognition Coreference
Resolution Partial Templates Template Merger Output Template

24 John Smith, 47, was named president of ABC Corp.
Pattern Recognition Multiple passes of finite-state methods John Smith, 47, was named president of ABC Corp. Pers-Name Num Aux V N P Org-Name Domain-Event Poss-N-Group V-Group

25 Partially-Instantiated Templates
Person: _______ Pos: President Org: ABC Corp. Person: John Smith Start: End: Domain-Dependent!!

26 Coreference analysis: He = John Smith
The Next Sentence... He replaces Mike Jones. Coreference analysis: He = John Smith Person: Mike Jones Pos: ________ Org: ________ Person: John Smith Start: End:

27 Unification Unify new template with preceding template(s),
if possible... Person: Mike Jones Pos: President Org: ABC Corp. Person: John Smith Start: End:

28 Principle of Least Commitment
Idea: Maintain options as long as possible E.g: parsing – maintain a lattice structure: The committee heads announced that... DT NN1 NN2 VBZ VBD CSub N-GRP Event Event: Announce Actor: Committee heads

29 Principle of Least Commitment
Idea: Maintain options as long as possible E.g: parsing – maintain a lattice structure: The committee heads ABC’s recruitment effort. DT NN1 NN2 VBZ NNpos NN N-GRP N-GRP Event Head: Committee Effort: ABC’s recruitment

30 More Least Commitment Maintain multiple coreference hypotheses:
Disambiguate when creating domain-events More information available Too many possibilities? Use beam search algorithm: maintain k ‘best’ hypotheses at every stage


Download ppt "Information Extraction"

Similar presentations


Ads by Google