Download presentation
Presentation is loading. Please wait.
1
Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011
2
Roadmap Information extraction: Definition & Motivation General approach Relation extraction techniques: Fully supervised Lightly supervised
3
Information Extraction Motivation: Unstructured data hard to manage, assess, analyze
4
Information Extraction Motivation: Unstructured data hard to manage, assess, analyze Many tools for manipulating structured data: Databases: efficient search, agglomeration Statistical analysis packages Decision support systems
5
Information Extraction Motivation: Unstructured data hard to manage, assess, analyze Many tools for manipulating structured data: Databases: efficient search, agglomeration Statistical analysis packages Decision support systems Goal: Given unstructured information in texts Convert to structured data E.g., populate DBs
6
IE Example Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. Lead airline: Amount: Effective Date: Follower:
7
IE Example Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. Lead airline:United Airlines Amount:$6 Effective Date:2006-10-26 Follower:American Airlines
8
Other Applications Shared tasks: Message Understanding Conferences (MUC)
9
Other Applications Shared tasks: Message Understanding Conferences (MUC) Joint ventures/mergers:
10
Other Applications Shared tasks: Message Understanding Conferences (MUC) Joint ventures/mergers: CompanyA, CompanyB, CompanyC, amount, type Terrorist incidents
11
Other Applications Shared tasks: Message Understanding Conferences (MUC) Joint ventures/mergers: CompanyA, CompanyB, CompanyC, amount, type Terrorist incidents Incident type, Actor, Victim, Date, Location General:
12
Other Applications Shared tasks: Message Understanding Conferences (MUC) Joint ventures/mergers: CompanyA, CompanyB, CompanyC, amount, type Terrorist incidents Incident type, Actor, Victim, Date, Location General: Pricegrabber Stock market analysis Apartment finder
13
Information Extraction Components Incorporates range of techniques, tools
14
Information Extraction Components Incorporates range of techniques, tools FSTs (pattern extraction) Supervised learning: Classification Sequence labeling
15
Information Extraction Components Incorporates range of techniques, tools FSTs (pattern extraction) Supervised learning: Classification Sequence labeling Semi-, un-supervised learning: clustering, bootstrapping
16
Information Extraction Components Incorporates range of techniques, tools FSTs (pattern extraction) Supervised learning: Classification Sequence labeling Semi-, un-supervised learning: clustering, bootstrapping Part-of-Speech tagging Named Entity Recognition Chunking Parsing
17
Information Extraction Process Typically pipeline of major components
18
Information Extraction Process Typically pipeline of major components First step: Named Entity Recognition (NER) Often application-specific Generic type: Person, Organiztion, Location Specific: genes to college courses
19
Information Extraction Process Typically pipeline of major components First step: Named Entity Recognition (NER) Often application-specific Generic type: Person, Organiztion, Location Specific: genes to college courses Identify NE mentions: specific instances
20
Example: Pre-NER Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco.
21
Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].
22
Information Extraction: Reference Resolution Builds on NER Clusters entity mentions referring to same things Creates entity equivalence classes
23
Information Extraction: Reference Resolution Builds on NER Clusters entity mentions referring to same things Creates entity equivalence classes Includes anaphora resolution E.g. pronoun resolution (it, he, she…; one; this, that)
24
Information Extraction: Reference Resolution Builds on NER Clusters entity mentions referring to same things Creates entity equivalence classes Includes anaphora resolution E.g. pronoun resolution (it, he, she…; one; this, that) Also non-anaphoric, general mentions
25
Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].
26
Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].
27
Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text
28
Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text Mix of application-specific and generic relations Generic relations:
29
Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text Mix of application-specific and generic relations Generic relations: Family, employment, part-whole, membership, geospatial
30
Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].
31
Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text Mix of application-specific and generic relations Generic relations: Family, employment, part-whole, membership, geospatial Generic relations: United is part-of UAL Corp. American Airlines is part-of AMR Tim Wagner is an employee of American Airlines
32
Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text Mix of application-specific and generic relations Generic relations: Family, employment, part-whole, membership, geospatial Generic relations: United is part-of UAL Corp. American Airlines is part-of AMR Tim Wagner is an employee of American Airlines Domain-specific relations: United serves Chicago; United serves Denver, etc
33
Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].
34
Event Detection & Classification Event detection and classification: Find key events
35
Event Detection & Classification Event detection and classification: Find key events Fare increase by United Matching fare increase by American Reporting of fare increases How cued?
36
Event Detection & Classification Event detection and classification: Find key events Fare increase by United Matching fare increase by American Reporting of fare increases How cued? Instances of “said” and ”cite”
37
Temporal Expressions: Recognition and Analysis Temporal expression recognition Finds temporal events in text
38
Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].
39
Temporal Expressions: Recognition and Analysis Temporal expression recognition Finds temporal events in text E.g. Friday, Thursday in example text
40
Temporal Expressions: Recognition and Analysis Temporal expression recognition Finds temporal events in text E.g. Friday, Thursday in example text Others: Date expressions: Absolute: days of week, holidays Relative: two days from now, next year Clock times: noon, 3:30pm
41
Temporal Expressions: Recognition and Analysis Temporal expression recognition Finds temporal events in text E.g. Friday, Thursday in example text Others: Date expressions: Absolute: days of week, holidays Relative: two days from now, next year Clock times: noon, 3:30pm Temporal analysis: Map expressions to points/spans on timeline Anchor: e.g. Friday, Thursday: tied to example dateline
42
Template Filling Texts describe stereotypical situations in domain Identify texts evoking template situations Fill slots in templates based on text
43
Template Filling Texts describe stereotypical situations in domain Identify texts evoking template situations Fill slots in templates based on text In example: fare-raise is stereo-typical event
44
Information Extraction Steps Named Entity Recognition Reference Resolution Relation Detection & Classification Event Detection & Classification Temporal Expression Recognition & Analysis Template-filling
45
Relation Detection & Classification
46
Common Relations
47
Relations and Entities from Example
48
Supervised Approaches to Relation Analysis Two stage process:
49
Supervised Approaches to Relation Analysis Two stage process: Relation detection: Detect whether a relation holds between two entities
50
Supervised Approaches to Relation Analysis Two stage process: Relation detection: Detect whether a relation holds between two entities Positive examples:
51
Supervised Approaches to Relation Analysis Two stage process: Relation detection: Detect whether a relation holds between two entities Positive examples: Drawn from labeled training data Negative examples:
52
Supervised Approaches to Relation Analysis Two stage process: Relation detection: Detect whether a relation holds between two entities Positive examples: Drawn from labeled training data Negative examples: Generated from within-sentence entity pairs w/no relation Relation classification: Label each related pair of entities by type Multi-class classification task
53
Feature Design Which features can we use to classify relations?
54
Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].
55
Common Relations
56
Feature Design Which features can we use to classify relations?
57
Feature Design Which features can we use to classify relations? Named entity features: Named entity types of candidate arguments Concatenation of two entity types Head words of arguments Bag-of-words from arguments
58
Feature Design Which features can we use to classify relations? Named entity features: Named entity types of candidate arguments Concatenation of two entity types Head words of arguments Bag-of-words from arguments Context word features: Bag-of-words/bigrams between entities (+/- stemmed) Words and stems before/after entities Distance in words, # entities between arguments
59
Feature Design Features for relation classification Syntactic structure features can signal relations E.g. appositive signals part-of relation
60
Feature Design Features for relation classification Syntactic structure features can signal relations E.g. appositive signals part-of relation
61
Feature Design Syntactic structure features Derived from syntactic analysis From base-phrase chunking to full constituent parse
62
Feature Design Syntactic structure features Derived from syntactic analysis From base-phrase chunking to full constituent parse Features: Presence of construction Chunk base-phrase paths Bag-of-chunk heads Dependency/constituency path Distance between arguments How can we represent syntax?
63
Feature Design Syntactic structure features Derived from syntactic analysis From base-phrase chunking to full constituent parse Features: Presence of construction Chunk base-phrase paths Bag-of-chunk heads Dependency/constituency path Distance between arguments How can we represent syntax? Binary features: Detection of construction patterns
64
Feature Design Syntactic structure features Derived from syntactic analysis From base-phrase chunking to full constituent parse Features: Presence of construction Chunk base-phrase paths Bag-of-chunk heads Dependency/constituency path Distance between arguments How can we represent syntax? Binary features: Detection of construction patterns Structure features: Encode paths through trees
65
Some Example Features
66
Lightly Supervised Approaches Supervised approaches highly effective, but
67
Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training
68
Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training Unsupervised approaches interesting, but
69
Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training Unsupervised approaches interesting, but May not yield results consistent with desired categories Lightly supervised approaches:
70
Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training Unsupervised approaches interesting, but May not yield results consistent with desired categories Lightly supervised approaches: Build on small seed sets of patterns that capture relations of interest E.g. manually constructed regular expressions
71
Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training Unsupervised approaches interesting, but May not yield results consistent with desired categories Lightly supervised approaches: Build on small seed sets of patterns that capture relations of interest E.g. manually constructed regular expressions Iteratively generalize and refine patterns to optimize
72
Bootstrapping Relations Create seed patterns and instances Example: finding airline hub cities Initial pattern:
73
Bootstrapping Relations Create seed patterns and instances Example: finding airline hub cities Initial pattern: / * has a hub at * / In a large corpus finds examples like: Delta has a hub at LaGuardia. Milwaukee-based Midwest has a hub at KCI. American airlines has a hub at the San Juan airport
74
Bootstrapping Relations Create seed patterns and instances Example: finding airline hub cities Initial pattern: / * has a hub at * / In a large corpus finds examples like: Delta has a hub at LaGuardia. Milwaukee-based Midwest has a hub at KCI. American airlines has a hub at the San Juan airport Instances: (Delta, LaGuardia), (Midwest,KCI),(American,San Juan)
75
Bootstrapping Relations Issues:
76
Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix?
77
Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix? /[ORG] has hub at [LOC]/ Misses: Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport. Fix?
78
Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix? /[ORG] has hub at [LOC]/ Misses: Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport. Fix? Relax patterns
79
Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix? /[ORG] has hub at [LOC]/ Misses: Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport. Fix? Relax patterns – see issue 1
80
Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix? /[ORG] has hub at [LOC]/ Misses: Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport. Fix? Relax patterns – see issue 1 Add new high precision patterns
81
Bootstrapping How can we obtain more high quality patterns?
82
Bootstrapping How can we obtain more high quality patterns? Human labeling?
83
Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences
84
Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences E.g. tuple: Ryanair, Charleroi and hub
85
Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences E.g. tuple: Ryanair, Charleroi and hub Occurrences: Budget airline Ryanair, which uses Charleroi as a hub All flights in and out of Ryanair’s Belgian hub at Charleroi Patterns:
86
Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences E.g. tuple: Ryanair, Charleroi and hub Occurrences: Budget airline Ryanair, which uses Charleroi as a hub All flights in and out of Ryanair’s Belgian hub at Charleroi Patterns: /[ORG], which uses [LOC] as a hub /
87
Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences E.g. tuple: Ryanair, Charleroi and hub Occurrences: Budget airline Ryanair, which uses Charleroi as a hub All flights in and out of Ryanair’s Belgian hub at Charleroi Patterns: /[ORG], which uses [LOC] as a hub / /[ORG]’s hub at [LOC] /
89
Bootstrapping Issues Issues: How do we represent patterns? How to we assess the patterns? How to we assess the tuples?
90
Pattern Representation General approaches:
91
Pattern Representation General approaches: Regular expressions Specific, high precision
92
Pattern Representation General approaches: Regular expressions Specific, high precision Machine learning style feature vectors More flexible, higher recall
93
Pattern Representation General approaches: Regular expressions Specific, high precision Machine learning style feature vectors More flexible, higher recall Components in representation: Context before first entity mention Context between entity mentions Context after last entity mention Order of arguments
94
Assessing Patterns & Tuples Issue:
95
Assessing Patterns & Tuples Issue: No gold standard! Can use seed patterns
96
Assessing Patterns & Tuples Issue: No gold standard! Can use seed patterns Balance two main factors: Performance on seed tuples Hits and misses Productivity in full collection Finds
97
Assessing Patterns & Tuples Issue: No gold standard! Can use seed patterns Balance two main factors: Performance on seed tuples Hits and misses Productivity in full collection Finds Use conservative strategy to minimize semantic drift E.g. Sydney has a ferry hub at Circular Key.
98
Relation Analysis Evaluation Approaches: Standardized, labeled data sets
99
Relation Analysis Evaluation Approaches: Standardized, labeled data sets Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection
100
Relation Analysis Evaluation Approaches: Standardized, labeled data sets Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection No fully labeled corpus
101
Relation Analysis Evaluation Approaches: Standardized, labeled data sets Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection No fully labeled corpus Evaluate retrieved tuples (e.g. DB contents) Intuition: Don’t care how many times we find same tuple
102
Relation Analysis Evaluation Approaches: Standardized, labeled data sets Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection No fully labeled corpus Evaluate retrieved tuples (e.g. DB contents) Intuition: Don’t care how many times we find same tuple Checked by human assessors
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.