Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011.

Similar presentations


Presentation on theme: "Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011."— Presentation transcript:

1 Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

2 Roadmap Information extraction: Definition & Motivation General approach Relation extraction techniques: Fully supervised Lightly supervised

3 Information Extraction Motivation: Unstructured data hard to manage, assess, analyze

4 Information Extraction Motivation: Unstructured data hard to manage, assess, analyze Many tools for manipulating structured data: Databases: efficient search, agglomeration Statistical analysis packages Decision support systems

5 Information Extraction Motivation: Unstructured data hard to manage, assess, analyze Many tools for manipulating structured data: Databases: efficient search, agglomeration Statistical analysis packages Decision support systems Goal: Given unstructured information in texts Convert to structured data E.g., populate DBs

6 IE Example Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. Lead airline: Amount: Effective Date: Follower:

7 IE Example Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. Lead airline:United Airlines Amount:$6 Effective Date:2006-10-26 Follower:American Airlines

8 Other Applications Shared tasks: Message Understanding Conferences (MUC)

9 Other Applications Shared tasks: Message Understanding Conferences (MUC) Joint ventures/mergers:

10 Other Applications Shared tasks: Message Understanding Conferences (MUC) Joint ventures/mergers: CompanyA, CompanyB, CompanyC, amount, type Terrorist incidents

11 Other Applications Shared tasks: Message Understanding Conferences (MUC) Joint ventures/mergers: CompanyA, CompanyB, CompanyC, amount, type Terrorist incidents Incident type, Actor, Victim, Date, Location General:

12 Other Applications Shared tasks: Message Understanding Conferences (MUC) Joint ventures/mergers: CompanyA, CompanyB, CompanyC, amount, type Terrorist incidents Incident type, Actor, Victim, Date, Location General: Pricegrabber Stock market analysis Apartment finder

13 Information Extraction Components Incorporates range of techniques, tools

14 Information Extraction Components Incorporates range of techniques, tools FSTs (pattern extraction) Supervised learning: Classification Sequence labeling

15 Information Extraction Components Incorporates range of techniques, tools FSTs (pattern extraction) Supervised learning: Classification Sequence labeling Semi-, un-supervised learning: clustering, bootstrapping

16 Information Extraction Components Incorporates range of techniques, tools FSTs (pattern extraction) Supervised learning: Classification Sequence labeling Semi-, un-supervised learning: clustering, bootstrapping Part-of-Speech tagging Named Entity Recognition Chunking Parsing

17 Information Extraction Process Typically pipeline of major components

18 Information Extraction Process Typically pipeline of major components First step: Named Entity Recognition (NER) Often application-specific Generic type: Person, Organiztion, Location Specific: genes to college courses

19 Information Extraction Process Typically pipeline of major components First step: Named Entity Recognition (NER) Often application-specific Generic type: Person, Organiztion, Location Specific: genes to college courses Identify NE mentions: specific instances

20 Example: Pre-NER Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco.

21 Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

22 Information Extraction: Reference Resolution Builds on NER Clusters entity mentions referring to same things Creates entity equivalence classes

23 Information Extraction: Reference Resolution Builds on NER Clusters entity mentions referring to same things Creates entity equivalence classes Includes anaphora resolution E.g. pronoun resolution (it, he, she…; one; this, that)

24 Information Extraction: Reference Resolution Builds on NER Clusters entity mentions referring to same things Creates entity equivalence classes Includes anaphora resolution E.g. pronoun resolution (it, he, she…; one; this, that) Also non-anaphoric, general mentions

25 Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

26 Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

27 Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text

28 Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text Mix of application-specific and generic relations Generic relations:

29 Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text Mix of application-specific and generic relations Generic relations: Family, employment, part-whole, membership, geospatial

30 Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

31 Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text Mix of application-specific and generic relations Generic relations: Family, employment, part-whole, membership, geospatial Generic relations: United is part-of UAL Corp. American Airlines is part-of AMR Tim Wagner is an employee of American Airlines

32 Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text Mix of application-specific and generic relations Generic relations: Family, employment, part-whole, membership, geospatial Generic relations: United is part-of UAL Corp. American Airlines is part-of AMR Tim Wagner is an employee of American Airlines Domain-specific relations: United serves Chicago; United serves Denver, etc

33 Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

34 Event Detection & Classification Event detection and classification: Find key events

35 Event Detection & Classification Event detection and classification: Find key events Fare increase by United Matching fare increase by American Reporting of fare increases How cued?

36 Event Detection & Classification Event detection and classification: Find key events Fare increase by United Matching fare increase by American Reporting of fare increases How cued? Instances of “said” and ”cite”

37 Temporal Expressions: Recognition and Analysis Temporal expression recognition Finds temporal events in text

38 Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

39 Temporal Expressions: Recognition and Analysis Temporal expression recognition Finds temporal events in text E.g. Friday, Thursday in example text

40 Temporal Expressions: Recognition and Analysis Temporal expression recognition Finds temporal events in text E.g. Friday, Thursday in example text Others: Date expressions: Absolute: days of week, holidays Relative: two days from now, next year Clock times: noon, 3:30pm

41 Temporal Expressions: Recognition and Analysis Temporal expression recognition Finds temporal events in text E.g. Friday, Thursday in example text Others: Date expressions: Absolute: days of week, holidays Relative: two days from now, next year Clock times: noon, 3:30pm Temporal analysis: Map expressions to points/spans on timeline Anchor: e.g. Friday, Thursday: tied to example dateline

42 Template Filling Texts describe stereotypical situations in domain Identify texts evoking template situations Fill slots in templates based on text

43 Template Filling Texts describe stereotypical situations in domain Identify texts evoking template situations Fill slots in templates based on text In example: fare-raise is stereo-typical event

44 Information Extraction Steps Named Entity Recognition Reference Resolution Relation Detection & Classification Event Detection & Classification Temporal Expression Recognition & Analysis Template-filling

45 Relation Detection & Classification

46 Common Relations

47 Relations and Entities from Example

48 Supervised Approaches to Relation Analysis Two stage process:

49 Supervised Approaches to Relation Analysis Two stage process: Relation detection: Detect whether a relation holds between two entities

50 Supervised Approaches to Relation Analysis Two stage process: Relation detection: Detect whether a relation holds between two entities Positive examples:

51 Supervised Approaches to Relation Analysis Two stage process: Relation detection: Detect whether a relation holds between two entities Positive examples: Drawn from labeled training data Negative examples:

52 Supervised Approaches to Relation Analysis Two stage process: Relation detection: Detect whether a relation holds between two entities Positive examples: Drawn from labeled training data Negative examples: Generated from within-sentence entity pairs w/no relation Relation classification: Label each related pair of entities by type Multi-class classification task

53 Feature Design Which features can we use to classify relations?

54 Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

55 Common Relations

56 Feature Design Which features can we use to classify relations?

57 Feature Design Which features can we use to classify relations? Named entity features: Named entity types of candidate arguments Concatenation of two entity types Head words of arguments Bag-of-words from arguments

58 Feature Design Which features can we use to classify relations? Named entity features: Named entity types of candidate arguments Concatenation of two entity types Head words of arguments Bag-of-words from arguments Context word features: Bag-of-words/bigrams between entities (+/- stemmed) Words and stems before/after entities Distance in words, # entities between arguments

59 Feature Design Features for relation classification Syntactic structure features can signal relations E.g. appositive signals part-of relation

60 Feature Design Features for relation classification Syntactic structure features can signal relations E.g. appositive signals part-of relation

61 Feature Design Syntactic structure features Derived from syntactic analysis From base-phrase chunking to full constituent parse

62 Feature Design Syntactic structure features Derived from syntactic analysis From base-phrase chunking to full constituent parse Features: Presence of construction Chunk base-phrase paths Bag-of-chunk heads Dependency/constituency path Distance between arguments How can we represent syntax?

63 Feature Design Syntactic structure features Derived from syntactic analysis From base-phrase chunking to full constituent parse Features: Presence of construction Chunk base-phrase paths Bag-of-chunk heads Dependency/constituency path Distance between arguments How can we represent syntax? Binary features: Detection of construction patterns

64 Feature Design Syntactic structure features Derived from syntactic analysis From base-phrase chunking to full constituent parse Features: Presence of construction Chunk base-phrase paths Bag-of-chunk heads Dependency/constituency path Distance between arguments How can we represent syntax? Binary features: Detection of construction patterns Structure features: Encode paths through trees

65 Some Example Features

66 Lightly Supervised Approaches Supervised approaches highly effective, but

67 Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training

68 Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training Unsupervised approaches interesting, but

69 Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training Unsupervised approaches interesting, but May not yield results consistent with desired categories Lightly supervised approaches:

70 Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training Unsupervised approaches interesting, but May not yield results consistent with desired categories Lightly supervised approaches: Build on small seed sets of patterns that capture relations of interest E.g. manually constructed regular expressions

71 Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training Unsupervised approaches interesting, but May not yield results consistent with desired categories Lightly supervised approaches: Build on small seed sets of patterns that capture relations of interest E.g. manually constructed regular expressions Iteratively generalize and refine patterns to optimize

72 Bootstrapping Relations Create seed patterns and instances Example: finding airline hub cities Initial pattern:

73 Bootstrapping Relations Create seed patterns and instances Example: finding airline hub cities Initial pattern: / * has a hub at * / In a large corpus finds examples like: Delta has a hub at LaGuardia. Milwaukee-based Midwest has a hub at KCI. American airlines has a hub at the San Juan airport

74 Bootstrapping Relations Create seed patterns and instances Example: finding airline hub cities Initial pattern: / * has a hub at * / In a large corpus finds examples like: Delta has a hub at LaGuardia. Milwaukee-based Midwest has a hub at KCI. American airlines has a hub at the San Juan airport Instances: (Delta, LaGuardia), (Midwest,KCI),(American,San Juan)

75 Bootstrapping Relations Issues:

76 Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix?

77 Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix? /[ORG] has hub at [LOC]/ Misses: Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport. Fix?

78 Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix? /[ORG] has hub at [LOC]/ Misses: Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport. Fix? Relax patterns

79 Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix? /[ORG] has hub at [LOC]/ Misses: Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport. Fix? Relax patterns – see issue 1

80 Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix? /[ORG] has hub at [LOC]/ Misses: Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport. Fix? Relax patterns – see issue 1 Add new high precision patterns

81 Bootstrapping How can we obtain more high quality patterns?

82 Bootstrapping How can we obtain more high quality patterns? Human labeling?

83 Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences

84 Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences E.g. tuple: Ryanair, Charleroi and hub

85 Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences E.g. tuple: Ryanair, Charleroi and hub Occurrences: Budget airline Ryanair, which uses Charleroi as a hub All flights in and out of Ryanair’s Belgian hub at Charleroi Patterns:

86 Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences E.g. tuple: Ryanair, Charleroi and hub Occurrences: Budget airline Ryanair, which uses Charleroi as a hub All flights in and out of Ryanair’s Belgian hub at Charleroi Patterns: /[ORG], which uses [LOC] as a hub /

87 Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences E.g. tuple: Ryanair, Charleroi and hub Occurrences: Budget airline Ryanair, which uses Charleroi as a hub All flights in and out of Ryanair’s Belgian hub at Charleroi Patterns: /[ORG], which uses [LOC] as a hub / /[ORG]’s hub at [LOC] /

88

89 Bootstrapping Issues Issues: How do we represent patterns? How to we assess the patterns? How to we assess the tuples?

90 Pattern Representation General approaches:

91 Pattern Representation General approaches: Regular expressions Specific, high precision

92 Pattern Representation General approaches: Regular expressions Specific, high precision Machine learning style feature vectors More flexible, higher recall

93 Pattern Representation General approaches: Regular expressions Specific, high precision Machine learning style feature vectors More flexible, higher recall Components in representation: Context before first entity mention Context between entity mentions Context after last entity mention Order of arguments

94 Assessing Patterns & Tuples Issue:

95 Assessing Patterns & Tuples Issue: No gold standard! Can use seed patterns

96 Assessing Patterns & Tuples Issue: No gold standard! Can use seed patterns Balance two main factors: Performance on seed tuples Hits and misses Productivity in full collection Finds

97 Assessing Patterns & Tuples Issue: No gold standard! Can use seed patterns Balance two main factors: Performance on seed tuples Hits and misses Productivity in full collection Finds Use conservative strategy to minimize semantic drift E.g. Sydney has a ferry hub at Circular Key.

98 Relation Analysis Evaluation Approaches: Standardized, labeled data sets

99 Relation Analysis Evaluation Approaches: Standardized, labeled data sets Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection

100 Relation Analysis Evaluation Approaches: Standardized, labeled data sets Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection No fully labeled corpus

101 Relation Analysis Evaluation Approaches: Standardized, labeled data sets Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection No fully labeled corpus Evaluate retrieved tuples (e.g. DB contents) Intuition: Don’t care how many times we find same tuple

102 Relation Analysis Evaluation Approaches: Standardized, labeled data sets Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection No fully labeled corpus Evaluate retrieved tuples (e.g. DB contents) Intuition: Don’t care how many times we find same tuple Checked by human assessors


Download ppt "Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011."

Similar presentations


Ads by Google