Download presentation
Presentation is loading. Please wait.
Published bySamson Briggs Modified over 9 years ago
1
1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University of Washington
2
2 Web Information Extraction …cities such as Chicago… => Chicago City C such as x => x C [Hearst,1992] …Edison invented the light bulb… (Edison, light bulb) Invented x V y => (x, y) V e.g., KnowItAll [Etzioni et al., 2005], TextRunner [Banko et al., 2007], others [Pasca et al., 2007]
3
3 Identifying correct extractions …mayors of major cities such as Giuliani… => Giuliani City Supervised IE: hand-label examples of each concept Not possible on the Web (far too many concepts) => Unsupervised IE (UIE) How can we automatically identify correct extractions for any concept without hand-labeled data?
4
4 KnowItAll Hypothesis (KH) Extractions that occur more frequently in distinct sentences in the corpus are more likely to be correct. Repetitions of the same error are relatively rare …mayors of major cities such as Giuliani… …hotels in popular cities such as Marriot.… Misinformation is the exception rather than the rule “Elvis killed JFK” – 200 hits “Oswald killed JFK” – 3000 hits
5
5 Redundancy KH can identify many correct statements because the Web is highly redundant – same facts repeated many times, in many ways – e.g., “Edison invented the light bulb” – 10,000 hits (but leveraging the KH is a little tricky => probabilistic model) Thesis: We can identify correct extractions without labeled data using a probabilistic model of redundancy.
6
6 1) Background 2) KH as a general problem structure Monotonic Feature Model 3) U RNS model How does probability increase with repetition? 4) Challenge: The “long tail” Unsupervised language models Outline
7
7 Classical Supervised Learning ? Learn function from x = (x 1, …, x d ) to y {0, 1} given labeled examples (x, y) x1x1 x2x2
8
8 Semi-Supervised Learning (SSL) Learn function from x = (x 1, …, x d ) to y {0, 1} given labeled examples (x, y) and unlabeled examples (x) x1x1 x2x2
9
9 Monotonic Features x1x1 x2x2 Learn function from x = (x 1, …, x d ) to y {0, 1} given monotonic feature x 1 and unlabeled examples (x)
10
10 Monotonic Features x1x1 x2x2 Learn function from x = (x 1, …, x d ) to y {0, 1} given monotonic feature x 1 and unlabeled examples (x) P(y=1 | x 1 ) increases with x 1
11
11 Common Structure TaskMonotonic Feature UIE “C such as x” [Etzioni et al., 2005] Word Sense Disambiguation “plant and animal species” [Yarowsky, 1995] Information Retrievalsearch query [Kwok & Grunfield, 1995; Thompson & Turtle, 1995] Document Classification Topic word, e.g.: “politics” [McCallum & Nigam, 1999; Gliozzo, 2005] Named Entity Recognition contains(“Mr.”) [Collins & Singer, 1998]
12
12 MF model is provably distinct from standard smoothness assumptions in SSL Cluster Assumption Manifold Assumption => MFs can complement other methods Unlike co-training, MF Model doesn’t require labeled data pre-defined “views” Isn’t this just ___ ?
13
13 One MF implies PAC-learnability without labeled data …when MF is conditionally independent of other features & is minimally informative Corollary to co-training theorem [Blum and Mitchell, 1998] MFs provide more information (vs. labels) about unlabeled examples as feature space grows As number of features increases Information gain due to MFs stays constant, vs. Information gain due to labeled examples falls (under assumptions) Theoretical Results
14
14 M FA: Given MFs and unlabeled data Use the MFs to produce noisy labels Train any classifier Classification with the MF Model
15
15 20 Newsgroups dataset (MF:newsgroup name) Vs. Two SSL baselines (NB + EM, LP) Without labeled data: Experimental Results
16
16 M FA-SSL provides a 15% error reduction for 100-400 labeled examples. M FA-BOTH provides a 31% error reduction for 0-800 labeled examples. Experimental Results
17
17 Bad News: confusable MFs For more complex tasks, monotonicity is insufficient Example: City extractions MF: extraction frequency with e.g., “cities such as x”..also MF for: has skyscrapers has an opera house located on Earth, … New York 1488 Chicago 999 Los Angeles 859 … … Twisp 1 Northeast 1 MF Extraction value
18
18 Performance of M FA in UIE
19
19 MFA for SSL in UIE
20
20 1) Background 2) KH as a general problem structure Monotonic Feature Model 3) U RNS model How does probability increase with repetition? 4) Challenge: The “long tail” Unsupervised language models Outline
21
21 If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C? Consider a single pattern suggesting C, e.g., countries such as x Redundancy: Single Pattern
22
22 “…countries such as Saudi Arabia…” “…countries such as the United States…” “…countries such as Saudi Arabia…” “…countries such as Japan…” “…countries such as Africa…” “…countries such as Japan…” “…countries such as the United Kingdom…” “…countries such as Iraq…” “…countries such as Afghanistan…” “…countries such as Australia…” C = Country n = 10 occurrences Redundancy: Single Pattern
23
23 C = Country n = 10 Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia k 2211111122111111 p = probability pattern yields a correct extraction, i.e., p = 0.9 0.99 0.9 Noisy-or ignores: –Sample size (n) –Distribution of C Naïve Model: Noisy-Or P noisy-or P noisy-or (x C | x seen k times) = 1 – (1 – p) k [Agichtein & Gravano, 2000; Lin et al. 2003]
24
24 United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean C = Country n ~50,000 3899 1999 1 0.9999… 0.9 C = Country n = 10 Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia 2211111122111111 0.99 0.9 As sample size increases, noisy-or becomes inaccurate. Needed in Model: Sample Size P noisy-or k k
25
25 C = Country n ~50,000 3899 1999 1 0.9999… 0.9 Needed in Model: Distribution of C P noisy-or k P freq (x C | x seen k times) = 1 – (1 – p) k/n United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean
26
26 C = Country n ~50,000 3899 1999 1 0.9999… 0.05 Needed in Model: Distribution of C P freq k United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean P freq (x C | x seen k times) = 1 – (1 – p) k/n
27
27 New York Chicago... El Estor Nikki Ragaz Villegas Northeastwards C = City n ~50,000 1488 999 1 0.9999… 0.05 Probability x C depends on the distribution of C. C = Country n ~50,000 3899 1999 1 0.9999… 0.05 Needed in Model: Distribution of C P freq k k United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean
28
28 Tokyo U.K. Sydney Cairo Tokyo Atlanta Yakima Utah U.K. …cities such asTokyo… Urn for C = City My solution: U RNS Model
29
29 C – set of unique target labels E – set of unique error labels num(C) – distribution of target labels num(E) – distribution of error labels Urn – Formal Definition
30
30 distribution of target labels: num(C) = {2, 2, 1, 1, 1} distribution of error labels: num(E) = {2, 1} U.K. Sydney Cairo Tokyo Atlanta Yakima Utah U.K. Urn for C = City Urn Example
31
31 If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C? Computing Probabilities
32
32 Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C? where s is the total number of balls in the urn Computing Probabilities
33
33 U RNS without labeled data Needed: num(C), num(E) Assumed to be Zipf Frequency of ith element i -z With assumptions, learn Zipfian parameters for any class C from unlabeled data alone
34
34 p1 - p C Zipf E Zipf Observed frequency distribution U RNS without labeled data Constant across C, for a given pattern Learn num(C) from unlabeled data! Constant across C
35
35 New York Chicago... El Estor Nikki Ragaz Villegas Cres Northeastwards C = City n ~50,000 1488 999 1 0.9999… 0.63 C = Country n ~50,000 3899 1999 1 0.9999… 0.03 Probabilities Assigned by U RNS P U RNS k k United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland
36
36 U RNS ’s probabilities are 15-22x closer to optimal. Probability Accuracy
37
37 Sensitivity Analysis U RNS assumes num(E), p are constant If we alter parameter choices substantially, U RNS still outperforms noisy-or, PMI by at least 8x Most sensitive to p p ~ 0.85 is relatively consistent across randomly selected classes from Wordnet (solvents, devices, thinkers, relaxants, mushrooms, mechanisms, resorts, flies, tones, machines, …)
38
38 Multiple urns Target label frequencies correlated across urns Error label frequencies can be uncorrelated PhraseHits “Omaha and other cities” 950 “Illinois and other cities” 24,400 “cities such as Omaha” 930 “cities such as Illinois” 6 Multiple Extraction Patterns
39
39 Benefits from Multiple Urns 101.0 200.98751.0 500.9250.955 1000.83750.845 2000.70750.71 Precision at K K Single Multiple Using multiple U RNS reduces error by 29%.
40
40 U RNS vs. M FA
41
41 U RNS + M FA in SSL MFA-ssl (urns) reduces error by 6%, on average.
42
42 U RNS : Learnable from unlabeled data All U RNS parameters can be learned from unlabeled data alone [Theorem 20] U RNS implies PAC learnability from unlabeled data alone [Theorem 21] Even with confusable MFs (i.e. even without conditional independence) (with assumptions)
43
43 Parameters Learnable (1) We can express the U RNS model as: Compound Poisson Process Mixture g C ( ) + g E ( ) can be learned, given enough samples [Loh, 1993] Task: learn power-law distributions g C ( ), g E ( ) from their sum
44
44 Parameters Learnable (2) Assume: Sufficiently high frequency => only target elements Sufficiently low frequency => only errors Then: g C ( ) + g E ( ) =
45
45 1) Background 2) KH as a general problem structure Monotonic Feature Model 3) U RNS model How does probability increase with repetition? 4) Challenge: The “long tail” Unsupervised language models Outline
46
46 A mixture of correct and incorrect e.g., ( Dave Shaver, Pickerington ) ( Ronald McDonald, McDonaldland ) Tend to be correct e.g., ( Bloomberg, New York City ) Challenge: the “long tail”
47
47 Mayor McCheese
48
48 Strategy 1) Model how common extractions occur in text 2) Rank sparse extractions by fit to model Assessing Sparse Extractions
49
49 Terms in the same class tend to appear in similar contexts. “cities including __” 42,0001 “__ and other cities” 37,9000 The Distributional Hypothesis Hits with Hits with Context Chicago Twisp “__ hotels” 2,000,0001,670 “mayor of __” 657,00082
50
50 Precomputed – scalable Handle sparsity Unsupervised Language Models
51
51 … cities such as Chicago, Boston, But Chicago isn’t the best cities such as Chicago, Boston, Los Angeles and Chicago. … Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005] 121 … … such as x, Boston But x isn’t the Angeles and x. Baseline: context vectors
52
52 Twisp : HMM( Twisp ): HMM provides “distributional summary” Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)...0001 0.140.01…0.06 t=1 2 N HMM Compresses Context Vectors
53
53 Task: Ranking sparse TextRunner extractions. Metric: Area under precision-recall curve. Language models reduce missing area by 39% over nearest competitor. Experimental Results HeadquarteredMerged Average Frequency0.7100.7840.713 PL0.6510.851…0.785 LM0.8100.9080.851
54
54 Summary of Thesis Formalization of Monotonic Features (MFs) One MF enables PAC Learnability from unlabeled data alone [Corollary 4.1] MFs provide greater information gain vs. labels as feature space increases in size [Theorem 8] The MF model is formally distinct from other SSL approaches [Theorems 9 and 10] MF model is insufficient when “subconcepts” are present [Proposition 12]
55
55 Summary: MFs (Continued) MFA: General SSL algorithm for MFs Given MFs, MFA perf. equivalent to state-of-the-art SSL algorithm with 160 labeled examples. [Table 2.1] Even when MFs are not given, MFA can detect MFs in SSL, reducing error by 16%. [Figure 2.5] MFA is not effective for UIE [Table 2.2 & Figure 2.6]
56
56 Summary: U RNS URNS: Formal model of redundancy in IE Describes how probability increases with MF value [Proposition 13] Models corroboration among multiple extraction mechanisms (multiple URNS) [Proposition 14]
57
57 U RNS Theoretical Results Uniform Special Case (USC) Odds in USC increase exponentially with repetition [Theorem 15] Error decreases exponentially when parameters are known [Theorem 16] Zipfian Case (ZC) Closed-form expression for ZC probability given parameters and odds given repetitions [Theorem 17] Error in ZC is bounded above by K / n 1- for any > 0 when parameters are known [Theorem 19]
58
58 U RNS Theoretical Results (cont.) Zipfian Case (ZC) In ZC, with probability 1- , the parameters of U RNS can be estimated with error 0, given sufficient data [Theorem 20] In ZC, U RNS guarantees PAC learnability given only unlabeled data, given that the MF is sufficiently informative and a “seperability” criterion is met in the concept space [Theorem 21]
59
59 U RNS Experimental Results Supervised Learning [Table 3.3] 19% error reduction over noisy-or 10% error reduction over logistic regression Comparable performance to SVM Semi-supervised IE [Figure 3.4] 6% error reduction over LP Unsupervised IE [Figure 3.2] 1500% error reduction over noisy-or 2200% error reduction over PMI Improved Efficiency [Table 3.2] 8x faster than PMI
60
60 Other Applications of U RNS Estimating extraction precision and recall [Table 3.7] Identifying synonymous objects and relations (R ESOLVER ) [Yates & Etzioni, 2007] Identifying functional relations in text [Ritter et al., 2008]
61
61 Assessing Sparse Extractions Hidden Markov Model assessor (H MM-T ): Error reduction of 23-46% over context vectors on typechecking task [Table 4.1] Error reduction of 28% over context vectors on sparse unary extractions [Table 4.2] 10-50x more efficient vs. context vectors Sparse extraction assessment with language models: Error reduction of 39% over previous work [Table 4.3] Massively more scalable than previous techniques
62
62 Acknowledgements: Oren Etzioni Mike Cafarella Pedro Domingos Susan Dumais Eric Horvitz Alan Ritter Stef Schoenmackers Stephen Soderland Dan Weld
63
63
64
64
65
65 Extraction is sometimes “easy”: generic extraction patterns …cities such as Chicago… => Chicago City C such as x => x C [Hearst,1992] But most sentences are “tough”: We walked the tree-lined streets of the bustling metropolis that is Atlanta. Extracting Atlanta City requires: Syntactic Parsing (Atlanta -> is -> metropolis) Subclass discovery (metropolis(x)=>city(x)) Challenging & difficult to scale e.g. [Collins, 1997; Snow & Ng 2006] Web IE without labeled examples
66
66 Extraction is sometimes “easy”: generic extraction patterns …cities such as Chicago… => Chicago City C such as x => x C [Hearst,1992] But most sentences are “tough”: We walked the tree-lined streets of the bustling metropolis that is Atlanta. “cities such as Atlanta” – 21,600 Hits Web IE without labeled examples
67
67 Web IE without labeled examples Extraction is sometimes “easy”: generic extraction patterns …cities such as Chicago… => Chicago City C such as x => x C [Hearst,1992] …Bloomberg, mayor of New York City… (Bloomberg, New York City) Mayor x, C of y => (x, y) C The scale and redundancy of the Web makes a multitude of facts “easy” to extract.
68
68 http://www.cs.washington.edu/research/textrunner/ [Banko et al., 2007] TextRunner Search
69
69 Extraction patterns make errors: “Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…” Extraction patterns make errors: “Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…” But… Task: Assess which extractions are correct Without hand-labeled examples At Web-scale Thesis: “We can assess extraction correctness by leveraging redundancy and probabilistic models.”
70
70 1) Motivation 2) Background on Web IE 3) Estimating extraction correctness U RNS model of redundancy [Downey et al., IJCAI 2005] (Distinguished Paper Award) 4) Challenge: The “long tail” 5) Machine learning generalization Outline
71
71 2) Multiple patterns PhraseHits 1) Repetition “Chicago and other cities” 94,400 “Illinois and other cities” 23,100 “cities such as Chicago” 42,500 “cities such as Illinois” 7 Redundancy – Two Intuitions Goal: a formal model of these intuitions. Given a term x and a set of sentences containing extraction patterns for a class C, what is the probability that x C?
72
72 Given a term x and a set of sentences containing extraction patterns for a class C, what is the probability that x C? If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C? Consider a single pattern suggesting C, e.g., countries such as x Redundancy: Single Pattern
73
73 “…countries such as Saudi Arabia…” “…countries such as the United States…” “…countries such as Saudi Arabia…” “…countries such as Japan…” “…countries such as Africa…” “…countries such as Japan…” “…countries such as the United Kingdom…” “…countries such as Iraq…” “…countries such as Afghanistan…” “…countries such as Australia…” C = Country n = 10 occurrences Redundancy: Single Pattern
74
74 C = Country n = 10 Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia k 2211111122111111 p = probability pattern yields a correct extraction, i.e., p = 0.9 0.99 0.9 Noisy-or ignores: –Sample size (n) –Distribution of C Naïve Model: Noisy-Or P noisy-or P noisy-or (x C | x seen k times) = 1 – (1 – p) k [Agichtein & Gravano, 2000; Lin et al. 2003]
75
75 United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean C = Country n ~50,000 3899 1999 1 0.9999… 0.9 C = Country n = 10 Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia 2211111122111111 0.99 0.9 As sample size increases, noisy-or becomes inaccurate. Needed in Model: Sample Size P noisy-or k k
76
76 C = Country n ~50,000 3899 1999 1 0.9999… 0.9 Needed in Model: Distribution of C P noisy-or k P freq (x C | x seen k times) = 1 – (1 – p) k/n United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean
77
77 C = Country n ~50,000 3899 1999 1 0.9999… 0.05 Needed in Model: Distribution of C P freq k United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean P freq (x C | x seen k times) = 1 – (1 – p) k/n
78
78 New York Chicago... El Estor Nikki Ragaz Villegas Northeastwards C = City n ~50,000 1488 999 1 0.9999… 0.05 Probability x C depends on the distribution of C. C = Country n ~50,000 3899 1999 1 0.9999… 0.05 Needed in Model: Distribution of C P freq k k United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean
79
79 Tokyo U.K. Sydney Cairo Tokyo Atlanta Yakima Utah U.K. …cities such asTokyo… Urn for C = City My solution: U RNS Model
80
80 C – set of unique target labels E – set of unique error labels num(C) – distribution of target labels num(E) – distribution of error labels Urn – Formal Definition
81
81 distribution of target labels: num(C) = {2, 2, 1, 1, 1} distribution of error labels: num(E) = {2, 1} U.K. Sydney Cairo Tokyo Atlanta Yakima Utah U.K. Urn for C = City Urn Example
82
82 If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C? Computing Probabilities
83
83 Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C? where s is the total number of balls in the urn Computing Probabilities
84
84 Multiple urns Target label frequencies correlated across urns Error label frequencies can be uncorrelated PhraseHits “Chicago and other cities” 94,400 “Illinois and other cities” 23,100 “cities such as Chicago” 42,500 “cities such as Illinois” 7 Multiple Extraction Patterns
85
85 U RNS without labeled data Needed: num(C), num(E) Assumed to be Zipf Frequency of ith element i -z With assumptions, learn Zipfian parameters for any class C from unlabeled data alone
86
86 p1 - p C Zipf E Zipf Observed frequency distribution U RNS without labeled data Constant across C, for a given pattern Learn num(C) from unlabeled data! Constant across C
87
87 New York Chicago... El Estor Nikki Ragaz Villegas Cres Northeastwards C = City n ~50,000 1488 999 1 0.9999… 0.63 C = Country n ~50,000 3899 1999 1 0.9999… 0.03 Probabilities Assigned by U RNS P U RNS k k United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland
88
88 U RNS ’s probabilities are 15-22x closer to optimal. Probability Accuracy
89
89 Computation is efficient Continuous Zipf & Poisson approximations => Closed form expression P(x C | evidence) vs. Pointwise Mutual Information (PMI) [Etzioni et al. 2005] PMI computed with search engine hit counts (inspired by [Turney, 2000]) U RNS requires no hit count queries (~8x faster) Scalability
90
90 Probabilistic model of redundancy Accurate without hand-labeled examples 15-22x improvement in accuracy Scalable 8x faster [Downey et al., IJCAI 2005] U RNS : Contributions
91
91 1) Motivation 2) Background on Web IE 3) Estimating extraction correctness 4) Challenge: The “long tail” Language models to the rescue [Downey et al., ACL 2007] 5) Machine learning generalization Outline
92
92 A mixture of correct and incorrect e.g., ( Dave Shaver, Pickerington ) ( Ronald McDonald, McDonaldland ) Tend to be correct e.g., ( Bloomberg, New York City ) Challenge: the “long tail”
93
93 Mayor McCheese
94
94 Strategy 1) Model how common extractions occur in text 2) Rank sparse extractions by fit to model Unsupervised language models Precomputed – scalable Handle sparsity Assessing Sparse Extractions
95
95 The “distributional hypothesis”: Instances of the same relationship tend to appear in similar contexts. …David B. Shaver was elected as the new mayor of Pickerington, Ohio. http://www.law.capital.edu/ebriefsarchive/Summer2004/ClassActionsLeft.asp …Mike Bloomberg was elected as the new mayor of New York City. http://www.queenspress.com/archives/coverstories/2001/issue52/coverstory.htmwww.queenspress.com/archives/coverstories/2001/issue52/coverstory.htm Assessing Sparse Extractions
96
96 Type errors are common: Alexander the Great conquered Egypt… (Great, Egypt) Conquered Locally acquired malaria is now uncommon… (Locally, malaria) Acquired Type checking
97
97 … cities such as Chicago, Boston, But Chicago isn’t the best cities such as Chicago, Boston, Los Angeles and Chicago. … Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005] 121 … … such as x, Boston But x isn’t the Angeles and x. Baseline: context vectors (1)
98
98 Miami : Twisp : Problems: Vectors are large Intersections are sparse...71251513... when he visited X he visited X and visited X and other X and other cities...0001 Baseline: context vectors (2)
99
99 titi t i+1 t i+2 t i+3 wiwi w i+1 w i+2 w i+3 cities such as Seattle Hidden Markov Model (HMM) States – unobserved Words – observed Hidden States t i {1, …, N}(N fairly small) Train on unlabeled data – P(t i | w i = w) is N-dim. distributional summary of w – Compare extractions using KL divergence
100
100 Twisp : P(t | Twisp ): Distributional Summary P(t | w) Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)...0001 0.140.01…0.06 t=1 2 N HMM Compresses Context Vectors
101
101 Is Pickerington of the same type as Chicago ? Chicago, Illinois Pickerington, Ohio Chicago: Pickerington: => Context vectors say no, dot product is 0! 2910 …, Ohio, Illinois 01 … Example
102
102 HMM Generalizes: Chicago, Illinois Pickerington, Ohio Example
103
103 Task: Ranking sparse TextRunner extractions. Metric: Area under precision-recall curve. Language models reduce missing area by 39% over nearest competitor. Experimental Results HeadquarteredMerged Average Frequency0.7100.7840.713 PL0.6510.851…0.785 LM0.8100.9080.851
104
104 No hand-labeled data Scalability Language models precomputed => Can be queried at interactive speed Improved accuracy over previous work [Downey et al., ACL 2007] R EALM : Contributions
105
105 1) Motivation 2) Background on Web IE 3) Estimating extraction correctness 4) Challenge: The “long tail” 5) Machine learning generalization Monotonic Features [Downey et al., 2008 (submitted)] Outline
106
106 Common Structure TaskHintBootstrap Web IE “x, C of y” Distributional Hypothesis Word Sense Disambiguation “plant and animal species” One sense per context, one sense per discourse [Yarowsky, 1995] Information Retrieval search queryPseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995] Document Classification Topic word, e.g.: “politics” Semi-supervised Learning [McCallum & Nigam, 1999; Gliozzo, 2005]
107
107 Common Structure TaskHintBootstrap Web IE “x, C of y” Distributional Hypothesis Word Sense Disambiguation “plant and animal species” One sense per context, one sense per discourse [Yarowsky, 1995] Information Retrieval search queryPseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995] Document Classification Topic word, e.g.: “politics” Bag-of-words and EM [McCallum & Nigam, 1999; Gliozzo, 2005] Identity of a monotonic feature x i such that: P(y = 1 | x i ) increases strictly monotonically with x i Classification of examples x = (x 1, …, x d ) into classes y {0, 1}
108
108 Classical Supervised Learning ? Learn function from x = (x 1, …, x d ) to y {0, 1} given labeled examples (x, y) x1x1 x2x2
109
109 Semi-Supervised Learning (SSL) Learn function from x = (x 1, …, x d ) to y {0, 1} given labeled examples (x, y) and unlabeled examples (x) x1x1 x2x2
110
110 Monotonic Features x1x1 x2x2 Learn function from x = (x 1, …, x d ) to y {0, 1} given monotonic feature x 1 and unlabeled examples (x)
111
111 Monotonic Features x1x1 x2x2 Learn function from x = (x 1, …, x d ) to y {0, 1} given monotonic feature x 1 and unlabeled examples (x)
112
112 Monotonic Features x1x1 x2x2 Learn function from x = (x 1, …, x d ) to y {0, 1} given monotonic feature x 1 and unlabeled examples (x)
113
113 1. No labeled data, MFs given (M A ) With noisy labels from MFs, train any classifier 2. Labeled data, no MFs given (M A-SSL ) Detect MFs from labeled data, run M A 3. Labeled data and MFs given (M A-BOTH ) Run M A with given & detected MFs Exploiting MF Structure
114
114 20 Newsgroups dataset Task: Given text, determine newsgroup of origin (MFs: newsgroup name) Without labeled data: Experimental Results
115
115 M A-SSL provides a 15% error reduction for 100-400 labeled examples. M A-BOTH provides a 31% error reduction for 0-800 labeled examples. Experimental Results
116
116 Co-training Requires labeled examples and known views Semi-supervised smoothness assumptions Cluster assumption Manifold assumption …both provably distinct from MF structure Relationship to other approaches
117
117 Best known methods for IE without labeled data Probabilities of correctness (U RNS ) Massive improvements in accuracy (15-22x) Handling sparse data (Language models) Vastly more scalable than previous work Accuracy wins (39% error reduction) Generalization beyond IE Monotonic Feature abstraction – widely applicable Accuracy wins in document classification Summary of Results
118
118 IE Web IE But still need: A coherent knowledge base MayorOf(Chicago, Daley) – the same “Chicago” as Starred-in(Chicago, Zeta-Jones)? Future Work: entity resolution, schema discovery Improved accuracy and coverage Currently, ignore character/document features, recursive structure, etc. Future work: more sophisticated language models (e.g. PCFGs) Conclusions and Future Work
119
119 Thanks! Acknowledgements: Oren Etzioni Mike Cafarella Pedro Domingos Susan Dumais Eric Horvitz Stef Schoenmackers Dan Weld
120
120 Self-Supervised Learning Input ExamplesOutput Supervised LabeledClassifier Semi-supervised Labeled & UnlabeledClassifier Self-supervised UnlabeledClassifier Unsupervised UnlabeledClustering
121
121 Language Modeling for IE R EALM is simple, ignores: Character- or Document-Level Features Web structure Recursive structure (PCFGs) Goal: x won an Oscar for playing a villain… What is P(x) ? From facts to knowledge Entity resolution and inference Future Work
122
122 Named Entity Location Lexical Statistics improve state of the art [Downey et al., IJCAI 2007] Modeling Web Search Characterizing user behavior [Downey et al., SIGIR 2007] (poster) [Liebling et al., 2008] (submitted) Predictive models [Downey et al., IJCAI 2007] Other Work
123
123 Web Fact-Finding Who has won three or more Academy Awards?
124
124 Web Fact-Finding Problems: User has to pick the right words, often a tedious process: " world foosball champion in 1998 “ – 0 hits “ world foosball champion ” 1998 – 2 hits, no answer What if I could just ask for P(x) in “x was world foosball champion in 1998?” How far can language modeling and the distributional hypothesis take us?
125
125 Miami Twisp Star Wars...9802025030513... 501211 110000211... X soundtrack he visited X and cities such as X X and other cities X lodging KnowItAll HypothesisDistributional Hypothesis
126
126 Miami Twisp Star Wars...9802025030513... 501211 110000211... X soundtrack he visited X and cities such as X X and other cities X lodging KnowItAll Hypothesis Distributional Hypothesis
127
127 invent in real time TextRunner Ranked by frequency REALM improves precision of the top 20 extractions by an average of 90%.
128
128 Tarantella, Santa Cruz International Business Machines Corporation, Armonk Mirapoint, Sunnyvale ALD, Sunnyvale PBS, Alexandria General Dynamics, Falls Church Jupitermedia Corporation, Darien Allegro, Worcester Trolltech, Oslo Corbis, Seattle TR Precision: 40% REALM Precision: 100% Improving TextRunner: Example (1) “headquartered” Top 10: company, Palo Alto held company, Santa Cruz storage hardware and software, Hopkinton Northwestern Mutual, Tacoma 1997, New York City Google, Mountain View PBS, Alexandria Linux provider, Raleigh Red Hat, Raleigh TI, Dallas TR Precision: 40%
129
129 Arabs, Rhodes Arabs, Istanbul Assyrians, Mesopotamia Great, Egypt Assyrians, Kassites Arabs, Samarkand Manchus, Outer Mongolia Vandals, North Africa Arabs, Persia Moors, Lagos TR Precision: 60% REALM Precision: 90% Improving TextRunner: Example (2) “conquered” Top 10: Great, Egypt conquistador, Mexico Normans, England Arabs, North Africa Great, Persia Romans, part Romans, Greeks Rome, Greece Napoleon, Egypt Visigoths, Suevi Kingdom TR Precision: 60%
130
130 Previous n-gram technique (1) 1) Form a context vector for each extracted argument: … cities such as Chicago, Boston, But Chicago isn’t the best cities such as Chicago, Boston, Los Angeles and Chicago. … 2) Compute dot products between extractions and seeds in this space [cf. Ravichandran et al. 2005]. 121 … … such as, Boston But isn’t the Angeles and.
131
131 Miami: Twisp: Problems: Vectors are large Intersections are sparse...71251513... when he visited X he visited X and visited X and other X and other cities...0001 Previous n-gram technique (2)
132
132 Miami: P(t | Miami): Latent state distribution P(t | w) Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)...71251513... 0.140.01…0.06 t=1 2 N Compressing Context Vectors
133
133 Example: N-Grams on Sparse Data Is Pickerington of the same type as Chicago ? Chicago, Illinois Pickerington, Ohio Chicago: Pickerington: => N-grams says no, dot product is 0! 2910 …, Ohio, Illinois 01 …
134
134 HMM Generalizes: Chicago, Illinois Pickerington, Ohio Example: H MM-T on Sparse Data
135
135 H MM-T Limitations Learning iterations take time proportional to (corpus size *T k+1 ) T = number of latent states k = HMM order We use limited values T=20, k=3 Sufficient for typechecking ( Santa Clara is a city) Too coarse for relation assessment ( Santa Clara is where Intel is headquartered)
136
136 The R EALM Architecture Two steps for assessing R(arg1, arg2) Typechecking Ensure arg1 and arg2 are of proper type for R MayorOf ( Intel, Santa Clara ) Leverages all occurrences of each arg Relation Assessment Ensure R actually holds between arg1 and arg2 MayorOf ( Giuliani, Seattle ) Both steps use pre-computed language models => Scales to Open IE
137
137 Type checking isn’t enough NY Mayor Giuliani toured downtown Seattle. Want: How do arguments behave in relation to each other? Relation Assessment
138
138 N-gram language model: P(w i, w i-1, … w i-k ) arg1, arg2 often far apart => large k (inaccurate) R EL-GRAMS (1)
139
139 Relational Language Model (R EL-GRAMS ): For any two arguments e 1, e 2 : P(w i, w i-1, … w i-k | w i = e 1, e 1 near e 2 ) k can be small – R EL-GRAMS still captures entity relationships Mitigate sparsity with BM25 metric (from IR) Combine with H MM-T by multiplying ranks. R EL-GRAMS (2)
140
140 Experiments Task: Re-rank sparse TextRunner extractions for Conquered, Founded, Headquartered, Merged R EALM vs. TextRunner (TR) – frequency ordering (equivalent to PMI [Etzioni et al, 2005] and Urns [Downey et al, 2005] ) Pattern Learning (PL) – based on Snowball [Agichtein 2000] H MM-T and R EL-GRAMS in isolation
141
141 Learning num(C) and num(E) From untagged data: ill-posed problem num(C) can vary wildly with C e.g., countries vs. cities vs. mayors Assume: 1) Consistent precision of a single co-occurrence, e.g., in a randomly drawn phrase “ C such as x ”, x C about p of the time. (0.9 for [Etzioni, 2005] ) 2) num(E) is constant for all C 3) num(C) is Zipf Estimate num(C) from untagged data using EM [Downey et al. 2005] (Also: multiple contexts)
142
142 U RNS without labeled data 1 - P( x C ) in “C such as x” Assumed ~0.9 Error Distribution Assumed large with Zipf parameter 1.0
143
143 U RNS without labeled data 1 - Can vary wildly (e.g. cities vs. countries). Learned from unlabeled data using EM
144
144 Distributional Similarity Naïve Approach – find sentences containing seed1&seed2 or arg1&arg2: Compare context distributions: P(w b,…, w e | seed1, seed2 ) P(w b,…, w e | arg1, arg2) But e – b can be large Many parameters, sparse data => inaccuracy wbwb …whwh seed1w h+2 …wiwi seed2w i+2 …wewe wbwb …whwh arg1w h+2 …wiwi arg2w i+2 …wewe
145
145 http://www.cs.washington.edu/research/textrunner/ TextRunner Search
146
146 Large textual corpora are redundant, and we can use this observation to bootstrap extraction and classification models from minimally labeled, or even completely unlabeled data. Thesis
147
147 Supervised classification task: Feature space X of d-tuples x = (x 1, …, x d ) Binary output space Y = {0, 1} Inputs Labeled examples D L = {(x, y)} ~ P(x, y) Output: concept c: X -> {0, 1} that approximates P(y | x). Monotonic Features
148
148 Semi-supervised classification task: Feature space X of d-tuples x = (x 1, …, x d ) Binary output space Y = {0, 1} Inputs Labeled examples D L = {(x, y)} ~ P(x, y) Unlabeled examples D U = {(x)} ~ P(x) Output: concept c: X -> {0, 1} that approximates P(y | x). Monotonic Features Smaller
149
149 Semi-supervised classification task: Feature space X of d-tuples x = (x 1, …, x d ) Binary output space Y = {0, 1} Inputs Labeled examples D L = {(x, y)} ~ P(x, y) Unlabeled examples D U = {(x)} ~ P(x) Monotonic features M {1,…,d} such that: P(y=1 | x i ) increases strictly monotonically with x i for all i M. Output: concept c: X -> {0, 1} that approximates P(y | x). Potentially empty! Monotonic Features
150
150 Problem: num(C) can vary wildly e.g. cities vs. countries Assume: num(C), num(E) Zipf distributed freq. of ith element i -z p and num(E) independent of C Learn num(C) from unlabeled data alone With Expectation Maximization U RNS without labeled data
151
151 20 Newsgroups dataset Task: Given text, determine newsgroup of origin (MFs: newsgroup name) Without labeled data: Experimental Results
152
152 Typecheck each arg by comparing HMM’s distributional summaries: Rank arguments in ascending order of f(arg) HMM Type-checking
153
153 Classical Supervised Learning ? Learn function from x = (x 1, …, x d ) to y {0, 1} given labeled examples (x, y) x1x1 x2x2
154
154 Semi-supervised Learning (SSL) Learn function from x = (x 1, …, x d ) to y {0, 1} given labeled examples (x, y) and unlabeled examples (x) x1x1 x2x2
155
155 Self-supervised Learning x1x1 x2x2 Learn function from x = (x 1, …, x d ) to y {0, 1} given unlabeled examples (x)
156
156 Self-supervised Learning x1x1 x2x2 Learn function from x = (x 1, …, x d ) to y {0, 1} given unlabeled examples (x) and system labels its own examples
157
157 Self-supervised Learning Input ExamplesOutput Supervised LabeledClassifier Semi-supervised Labeled & UnlabeledClassifier Self-supervised UnlabeledClassifier Unsupervised UnlabeledClustering
158
158 Supervised classification task: Feature space X of d-tuples x = (x 1, …, x d ) Binary output space Y = {0, 1} Inputs Labeled examples D L = {(x, y)} ~ P(x, y) Output: concept c: X -> {0, 1} that approximates P(y | x). Monotonic Features
159
159 Semi-supervised classification task: Feature space X of d-tuples x = (x 1, …, x d ) Binary output space Y = {0, 1} Inputs Labeled examples D L = {(x, y)} ~ P(x, y) Unlabeled examples D U = {(x)} ~ P(x) Output: concept c: X -> {0, 1} that approximates P(y | x). Monotonic Features Smaller
160
160 Semi-supervised classification task: Feature space X of d-tuples x = (x 1, …, x d ) Binary output space Y = {0, 1} Inputs Labeled examples D L = {(x, y)} ~ P(x, y) Unlabeled examples D U = {(x)} ~ P(x) Monotonic features M {1,…,d} such that: P(y=1 | x i ) increases strictly monotonically with x i for all i M. Output: concept c: X -> {0, 1} that approximates P(y | x). Potentially empty! Monotonic Features
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.