1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University of Washington

2 Web Information Extraction …cities such as Chicago… => Chicago  City C such as x => x  C [Hearst,1992] …Edison invented the light bulb… (Edison, light bulb)  Invented x V y => (x, y)  V e.g., KnowItAll [Etzioni et al., 2005], TextRunner [Banko et al., 2007], others [Pasca et al., 2007]

3 Identifying correct extractions …mayors of major cities such as Giuliani… => Giuliani  City Supervised IE: hand-label examples of each concept Not possible on the Web (far too many concepts) => Unsupervised IE (UIE) How can we automatically identify correct extractions for any concept without hand-labeled data?

4 KnowItAll Hypothesis (KH) Extractions that occur more frequently in distinct sentences in the corpus are more likely to be correct. Repetitions of the same error are relatively rare …mayors of major cities such as Giuliani… …hotels in popular cities such as Marriot.… Misinformation is the exception rather than the rule “Elvis killed JFK” – 200 hits “Oswald killed JFK” – 3000 hits

5 Redundancy KH can identify many correct statements because the Web is highly redundant – same facts repeated many times, in many ways – e.g., “Edison invented the light bulb” – 10,000 hits (but leveraging the KH is a little tricky => probabilistic model) Thesis: We can identify correct extractions without labeled data using a probabilistic model of redundancy.

6 1) Background 2) KH as a general problem structure Monotonic Feature Model 3) U RNS model How does probability increase with repetition? 4) Challenge: The “long tail” Unsupervised language models Outline

7 Classical Supervised Learning ? Learn function from x = (x 1, …, x d ) to y  {0, 1} given labeled examples (x, y) x1x1 x2x2

8 Semi-Supervised Learning (SSL) Learn function from x = (x 1, …, x d ) to y  {0, 1} given labeled examples (x, y) and unlabeled examples (x) x1x1 x2x2

9 Monotonic Features x1x1 x2x2 Learn function from x = (x 1, …, x d ) to y  {0, 1} given monotonic feature x 1 and unlabeled examples (x)

10 Monotonic Features x1x1 x2x2 Learn function from x = (x 1, …, x d ) to y  {0, 1} given monotonic feature x 1 and unlabeled examples (x) P(y=1 | x 1 ) increases with x 1

11 Common Structure TaskMonotonic Feature UIE “C such as x” [Etzioni et al., 2005] Word Sense Disambiguation “plant and animal species” [Yarowsky, 1995] Information Retrievalsearch query [Kwok & Grunfield, 1995; Thompson & Turtle, 1995] Document Classification Topic word, e.g.: “politics” [McCallum & Nigam, 1999; Gliozzo, 2005] Named Entity Recognition contains(“Mr.”) [Collins & Singer, 1998]

12 MF model is provably distinct from standard smoothness assumptions in SSL  Cluster Assumption  Manifold Assumption  => MFs can complement other methods Unlike co-training, MF Model doesn’t require  labeled data  pre-defined “views” Isn’t this just ___ ?

13 One MF implies PAC-learnability without labeled data  …when MF is conditionally independent of other features & is minimally informative  Corollary to co-training theorem [Blum and Mitchell, 1998] MFs provide more information (vs. labels) about unlabeled examples as feature space grows  As number of features increases Information gain due to MFs stays constant, vs. Information gain due to labeled examples falls (under assumptions) Theoretical Results

14 M FA: Given MFs and unlabeled data  Use the MFs to produce noisy labels  Train any classifier Classification with the MF Model

15 20 Newsgroups dataset (MF:newsgroup name) Vs. Two SSL baselines (NB + EM, LP) Without labeled data: Experimental Results

16 M FA-SSL provides a 15% error reduction for 100-400 labeled examples. M FA-BOTH provides a 31% error reduction for 0-800 labeled examples. Experimental Results

17 Bad News: confusable MFs For more complex tasks, monotonicity is insufficient Example: City extractions MF: extraction frequency with e.g., “cities such as x”..also MF for: has skyscrapers has an opera house located on Earth, … New York 1488 Chicago 999 Los Angeles 859 … … Twisp 1 Northeast 1 MF Extraction value

18 Performance of M FA in UIE

19 MFA for SSL in UIE

21 If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x  C? Consider a single pattern suggesting C, e.g., countries such as x Redundancy: Single Pattern

22 “…countries such as Saudi Arabia…” “…countries such as the United States…” “…countries such as Saudi Arabia…” “…countries such as Japan…” “…countries such as Africa…” “…countries such as Japan…” “…countries such as the United Kingdom…” “…countries such as Iraq…” “…countries such as Afghanistan…” “…countries such as Australia…” C = Country n = 10 occurrences Redundancy: Single Pattern

23 C = Country n = 10 Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia k 2211111122111111 p = probability pattern yields a correct extraction, i.e., p = 0.9 0.99 0.9 Noisy-or ignores: –Sample size (n) –Distribution of C Naïve Model: Noisy-Or P noisy-or P noisy-or (x  C | x seen k times) = 1 – (1 – p) k [Agichtein & Gravano, 2000; Lin et al. 2003]

24 United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean C = Country n ~50,000 3899 1999 1 0.9999… 0.9 C = Country n = 10 Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia 2211111122111111 0.99 0.9 As sample size increases, noisy-or becomes inaccurate. Needed in Model: Sample Size P noisy-or k k

25 C = Country n ~50,000 3899 1999 1 0.9999… 0.9 Needed in Model: Distribution of C P noisy-or k P freq (x  C | x seen k times) = 1 – (1 – p)  k/n United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean

26 C = Country n ~50,000 3899 1999 1 0.9999… 0.05 Needed in Model: Distribution of C P freq k United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean P freq (x  C | x seen k times) = 1 – (1 – p)  k/n

27 New York Chicago... El Estor Nikki Ragaz Villegas Northeastwards C = City n ~50,000 1488 999 1 0.9999… 0.05 Probability x  C depends on the distribution of C. C = Country n ~50,000 3899 1999 1 0.9999… 0.05 Needed in Model: Distribution of C P freq k k United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean

28 Tokyo U.K. Sydney Cairo Tokyo Atlanta Yakima Utah U.K. …cities such asTokyo… Urn for C = City My solution: U RNS Model

29 C – set of unique target labels E – set of unique error labels num(C) – distribution of target labels num(E) – distribution of error labels Urn – Formal Definition

30 distribution of target labels: num(C) = {2, 2, 1, 1, 1} distribution of error labels: num(E) = {2, 1} U.K. Sydney Cairo Tokyo Atlanta Yakima Utah U.K. Urn for C = City Urn Example

31 If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x  C? Computing Probabilities

32 Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x  C? where s is the total number of balls in the urn Computing Probabilities

33 U RNS without labeled data Needed: num(C), num(E) Assumed to be Zipf  Frequency of ith element  i -z With assumptions, learn Zipfian parameters for any class C from unlabeled data alone

34 p1 - p C Zipf E Zipf Observed frequency distribution U RNS without labeled data Constant across C, for a given pattern Learn num(C) from unlabeled data! Constant across C

35 New York Chicago... El Estor Nikki Ragaz Villegas Cres Northeastwards C = City n ~50,000 1488 999 1 0.9999… 0.63 C = Country n ~50,000 3899 1999 1 0.9999… 0.03 Probabilities Assigned by U RNS P U RNS k k United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland

36 U RNS ’s probabilities are 15-22x closer to optimal. Probability Accuracy

37 Sensitivity Analysis U RNS assumes num(E), p are constant  If we alter parameter choices substantially, U RNS still outperforms noisy-or, PMI by at least 8x Most sensitive to p  p ~ 0.85 is relatively consistent across randomly selected classes from Wordnet (solvents, devices, thinkers, relaxants, mushrooms, mechanisms, resorts, flies, tones, machines, …)

38 Multiple urns  Target label frequencies correlated across urns  Error label frequencies can be uncorrelated PhraseHits “Omaha and other cities” 950 “Illinois and other cities” 24,400 “cities such as Omaha” 930 “cities such as Illinois” 6 Multiple Extraction Patterns

39 Benefits from Multiple Urns 101.0 200.98751.0 500.9250.955 1000.83750.845 2000.70750.71 Precision at K K Single Multiple Using multiple U RNS reduces error by 29%.

40 U RNS vs. M FA

41 U RNS + M FA in SSL MFA-ssl (urns) reduces error by 6%, on average.

42 U RNS : Learnable from unlabeled data All U RNS parameters can be learned from unlabeled data alone [Theorem 20] U RNS implies PAC learnability from unlabeled data alone [Theorem 21]  Even with confusable MFs (i.e. even without conditional independence) (with assumptions)

43 Parameters Learnable (1) We can express the U RNS model as: Compound Poisson Process  Mixture g C ( ) + g E ( ) can be learned, given enough samples [Loh, 1993]  Task: learn power-law distributions g C ( ), g E ( ) from their sum

44 Parameters Learnable (2) Assume:  Sufficiently high frequency => only target elements  Sufficiently low frequency => only errors Then: g C ( ) + g E ( ) =

46 A mixture of correct and incorrect e.g., ( Dave Shaver, Pickerington ) ( Ronald McDonald, McDonaldland ) Tend to be correct e.g., ( Bloomberg, New York City ) Challenge: the “long tail”

47 Mayor McCheese

48 Strategy 1) Model how common extractions occur in text 2) Rank sparse extractions by fit to model Assessing Sparse Extractions

49 Terms in the same class tend to appear in similar contexts. “cities including __” 42,0001 “__ and other cities” 37,9000 The Distributional Hypothesis Hits with Hits with Context Chicago Twisp “__ hotels” 2,000,0001,670 “mayor of __” 657,00082

50 Precomputed – scalable Handle sparsity Unsupervised Language Models

51 … cities such as Chicago, Boston, But Chicago isn’t the best cities such as Chicago, Boston, Los Angeles and Chicago. … Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005] 121 … … such as x, Boston But x isn’t the Angeles and x. Baseline: context vectors

52 Twisp : HMM( Twisp ): HMM provides “distributional summary”  Compact (efficient – 10-50x less data retrieved)  Dense (accurate – 23-46% error reduction)...0001 0.140.01…0.06 t=1 2 N HMM Compresses Context Vectors

53 Task: Ranking sparse TextRunner extractions. Metric: Area under precision-recall curve. Language models reduce missing area by 39% over nearest competitor. Experimental Results HeadquarteredMerged Average Frequency0.7100.7840.713 PL0.6510.851…0.785 LM0.8100.9080.851

54 Summary of Thesis Formalization of Monotonic Features (MFs)  One MF enables PAC Learnability from unlabeled data alone [Corollary 4.1]  MFs provide greater information gain vs. labels as feature space increases in size [Theorem 8]  The MF model is formally distinct from other SSL approaches [Theorems 9 and 10]  MF model is insufficient when “subconcepts” are present [Proposition 12]

55 Summary: MFs (Continued) MFA: General SSL algorithm for MFs  Given MFs, MFA perf. equivalent to state-of-the-art SSL algorithm with 160 labeled examples. [Table 2.1]  Even when MFs are not given, MFA can detect MFs in SSL, reducing error by 16%. [Figure 2.5]  MFA is not effective for UIE [Table 2.2 & Figure 2.6]

56 Summary: U RNS URNS: Formal model of redundancy in IE  Describes how probability increases with MF value [Proposition 13]  Models corroboration among multiple extraction mechanisms (multiple URNS) [Proposition 14]

57 U RNS Theoretical Results Uniform Special Case (USC)  Odds in USC increase exponentially with repetition [Theorem 15]  Error decreases exponentially when parameters are known [Theorem 16] Zipfian Case (ZC)  Closed-form expression for ZC probability given parameters and odds given repetitions [Theorem 17]  Error in ZC is bounded above by K / n 1-  for any  > 0 when parameters are known [Theorem 19]

58 U RNS Theoretical Results (cont.) Zipfian Case (ZC)  In ZC, with probability 1- , the parameters of U RNS can be estimated with error 0, given sufficient data [Theorem 20]  In ZC, U RNS guarantees PAC learnability given only unlabeled data, given that the MF is sufficiently informative and a “seperability” criterion is met in the concept space [Theorem 21]

59 U RNS Experimental Results Supervised Learning [Table 3.3]  19% error reduction over noisy-or  10% error reduction over logistic regression  Comparable performance to SVM Semi-supervised IE [Figure 3.4]  6% error reduction over LP Unsupervised IE [Figure 3.2]  1500% error reduction over noisy-or  2200% error reduction over PMI Improved Efficiency [Table 3.2]  8x faster than PMI

60 Other Applications of U RNS Estimating extraction precision and recall [Table 3.7] Identifying synonymous objects and relations (R ESOLVER ) [Yates & Etzioni, 2007] Identifying functional relations in text [Ritter et al., 2008]

61 Assessing Sparse Extractions Hidden Markov Model assessor (H MM-T ):  Error reduction of 23-46% over context vectors on typechecking task [Table 4.1]  Error reduction of 28% over context vectors on sparse unary extractions [Table 4.2]  10-50x more efficient vs. context vectors Sparse extraction assessment with language models:  Error reduction of 39% over previous work [Table 4.3]  Massively more scalable than previous techniques

62 Acknowledgements: Oren Etzioni Mike Cafarella Pedro Domingos Susan Dumais Eric Horvitz Alan Ritter Stef Schoenmackers Stephen Soderland Dan Weld

65 Extraction is sometimes “easy”: generic extraction patterns …cities such as Chicago… => Chicago  City C such as x => x  C [Hearst,1992] But most sentences are “tough”: We walked the tree-lined streets of the bustling metropolis that is Atlanta. Extracting Atlanta  City requires: Syntactic Parsing (Atlanta -> is -> metropolis) Subclass discovery (metropolis(x)=>city(x)) Challenging & difficult to scale e.g. [Collins, 1997; Snow & Ng 2006] Web IE without labeled examples

66 Extraction is sometimes “easy”: generic extraction patterns …cities such as Chicago… => Chicago  City C such as x => x  C [Hearst,1992] But most sentences are “tough”: We walked the tree-lined streets of the bustling metropolis that is Atlanta. “cities such as Atlanta” – 21,600 Hits Web IE without labeled examples

67 Web IE without labeled examples Extraction is sometimes “easy”: generic extraction patterns …cities such as Chicago… => Chicago  City C such as x => x  C [Hearst,1992] …Bloomberg, mayor of New York City…  (Bloomberg, New York City)  Mayor x, C of y => (x, y)  C The scale and redundancy of the Web makes a multitude of facts “easy” to extract.

68 http://www.cs.washington.edu/research/textrunner/ [Banko et al., 2007] TextRunner Search

69 Extraction patterns make errors: “Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…” Extraction patterns make errors: “Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…” But… Task: Assess which extractions are correct  Without hand-labeled examples  At Web-scale Thesis: “We can assess extraction correctness by leveraging redundancy and probabilistic models.”

70 1) Motivation 2) Background on Web IE 3) Estimating extraction correctness  U RNS model of redundancy [Downey et al., IJCAI 2005] (Distinguished Paper Award) 4) Challenge: The “long tail” 5) Machine learning generalization Outline

71 2) Multiple patterns PhraseHits 1) Repetition “Chicago and other cities” 94,400 “Illinois and other cities” 23,100 “cities such as Chicago” 42,500 “cities such as Illinois” 7 Redundancy – Two Intuitions Goal: a formal model of these intuitions. Given a term x and a set of sentences containing extraction patterns for a class C, what is the probability that x  C?

72 Given a term x and a set of sentences containing extraction patterns for a class C, what is the probability that x  C? If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x  C? Consider a single pattern suggesting C, e.g., countries such as x Redundancy: Single Pattern

73 “…countries such as Saudi Arabia…” “…countries such as the United States…” “…countries such as Saudi Arabia…” “…countries such as Japan…” “…countries such as Africa…” “…countries such as Japan…” “…countries such as the United Kingdom…” “…countries such as Iraq…” “…countries such as Afghanistan…” “…countries such as Australia…” C = Country n = 10 occurrences Redundancy: Single Pattern

74 C = Country n = 10 Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia k 2211111122111111 p = probability pattern yields a correct extraction, i.e., p = 0.9 0.99 0.9 Noisy-or ignores: –Sample size (n) –Distribution of C Naïve Model: Noisy-Or P noisy-or P noisy-or (x  C | x seen k times) = 1 – (1 – p) k [Agichtein & Gravano, 2000; Lin et al. 2003]

75 United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean C = Country n ~50,000 3899 1999 1 0.9999… 0.9 C = Country n = 10 Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia 2211111122111111 0.99 0.9 As sample size increases, noisy-or becomes inaccurate. Needed in Model: Sample Size P noisy-or k k

76 C = Country n ~50,000 3899 1999 1 0.9999… 0.9 Needed in Model: Distribution of C P noisy-or k P freq (x  C | x seen k times) = 1 – (1 – p)  k/n United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean

77 C = Country n ~50,000 3899 1999 1 0.9999… 0.05 Needed in Model: Distribution of C P freq k United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean P freq (x  C | x seen k times) = 1 – (1 – p)  k/n

78 New York Chicago... El Estor Nikki Ragaz Villegas Northeastwards C = City n ~50,000 1488 999 1 0.9999… 0.05 Probability x  C depends on the distribution of C. C = Country n ~50,000 3899 1999 1 0.9999… 0.05 Needed in Model: Distribution of C P freq k k United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean

79 Tokyo U.K. Sydney Cairo Tokyo Atlanta Yakima Utah U.K. …cities such asTokyo… Urn for C = City My solution: U RNS Model

80 C – set of unique target labels E – set of unique error labels num(C) – distribution of target labels num(E) – distribution of error labels Urn – Formal Definition

81 distribution of target labels: num(C) = {2, 2, 1, 1, 1} distribution of error labels: num(E) = {2, 1} U.K. Sydney Cairo Tokyo Atlanta Yakima Utah U.K. Urn for C = City Urn Example

82 If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x  C? Computing Probabilities

83 Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x  C? where s is the total number of balls in the urn Computing Probabilities

84 Multiple urns  Target label frequencies correlated across urns  Error label frequencies can be uncorrelated PhraseHits “Chicago and other cities” 94,400 “Illinois and other cities” 23,100 “cities such as Chicago” 42,500 “cities such as Illinois” 7 Multiple Extraction Patterns

85 U RNS without labeled data Needed: num(C), num(E) Assumed to be Zipf  Frequency of ith element  i -z With assumptions, learn Zipfian parameters for any class C from unlabeled data alone

86 p1 - p C Zipf E Zipf Observed frequency distribution U RNS without labeled data Constant across C, for a given pattern Learn num(C) from unlabeled data! Constant across C

87 New York Chicago... El Estor Nikki Ragaz Villegas Cres Northeastwards C = City n ~50,000 1488 999 1 0.9999… 0.63 C = Country n ~50,000 3899 1999 1 0.9999… 0.03 Probabilities Assigned by U RNS P U RNS k k United States China... OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland

88 U RNS ’s probabilities are 15-22x closer to optimal. Probability Accuracy

89 Computation is efficient  Continuous Zipf & Poisson approximations  => Closed form expression P(x  C | evidence) vs. Pointwise Mutual Information (PMI) [Etzioni et al. 2005]  PMI computed with search engine hit counts (inspired by [Turney, 2000])  U RNS requires no hit count queries (~8x faster) Scalability

90 Probabilistic model of redundancy Accurate without hand-labeled examples  15-22x improvement in accuracy Scalable  8x faster [Downey et al., IJCAI 2005] U RNS : Contributions

91 1) Motivation 2) Background on Web IE 3) Estimating extraction correctness 4) Challenge: The “long tail”  Language models to the rescue [Downey et al., ACL 2007] 5) Machine learning generalization Outline

92 A mixture of correct and incorrect e.g., ( Dave Shaver, Pickerington ) ( Ronald McDonald, McDonaldland ) Tend to be correct e.g., ( Bloomberg, New York City ) Challenge: the “long tail”

93 Mayor McCheese

94 Strategy 1) Model how common extractions occur in text 2) Rank sparse extractions by fit to model Unsupervised language models  Precomputed – scalable  Handle sparsity Assessing Sparse Extractions

95 The “distributional hypothesis”: Instances of the same relationship tend to appear in similar contexts. …David B. Shaver was elected as the new mayor of Pickerington, Ohio. http://www.law.capital.edu/ebriefsarchive/Summer2004/ClassActionsLeft.asp …Mike Bloomberg was elected as the new mayor of New York City. http://www.queenspress.com/archives/coverstories/2001/issue52/coverstory.htmwww.queenspress.com/archives/coverstories/2001/issue52/coverstory.htm Assessing Sparse Extractions

96 Type errors are common: Alexander the Great conquered Egypt…  (Great, Egypt)  Conquered Locally acquired malaria is now uncommon…  (Locally, malaria)  Acquired Type checking

97 … cities such as Chicago, Boston, But Chicago isn’t the best cities such as Chicago, Boston, Los Angeles and Chicago. … Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005] 121 … … such as x, Boston But x isn’t the Angeles and x. Baseline: context vectors (1)

98 Miami : Twisp : Problems:  Vectors are large  Intersections are sparse...71251513... when he visited X he visited X and visited X and other X and other cities...0001 Baseline: context vectors (2)

99 titi t i+1 t i+2 t i+3 wiwi w i+1 w i+2 w i+3 cities such as Seattle Hidden Markov Model (HMM) States – unobserved Words – observed Hidden States t i  {1, …, N}(N fairly small) Train on unlabeled data – P(t i | w i = w) is N-dim. distributional summary of w – Compare extractions using KL divergence

100 Twisp : P(t | Twisp ): Distributional Summary P(t | w)  Compact (efficient – 10-50x less data retrieved)  Dense (accurate – 23-46% error reduction)...0001 0.140.01…0.06 t=1 2 N HMM Compresses Context Vectors

101 Is Pickerington of the same type as Chicago ? Chicago, Illinois Pickerington, Ohio Chicago: Pickerington: => Context vectors say no, dot product is 0! 2910 …, Ohio, Illinois 01 … Example

102 HMM Generalizes: Chicago, Illinois Pickerington, Ohio Example

103 Task: Ranking sparse TextRunner extractions. Metric: Area under precision-recall curve. Language models reduce missing area by 39% over nearest competitor. Experimental Results HeadquarteredMerged Average Frequency0.7100.7840.713 PL0.6510.851…0.785 LM0.8100.9080.851

104 No hand-labeled data Scalability  Language models precomputed  => Can be queried at interactive speed Improved accuracy over previous work [Downey et al., ACL 2007] R EALM : Contributions

105 1) Motivation 2) Background on Web IE 3) Estimating extraction correctness 4) Challenge: The “long tail” 5) Machine learning generalization  Monotonic Features [Downey et al., 2008 (submitted)] Outline

106 Common Structure TaskHintBootstrap Web IE “x, C of y” Distributional Hypothesis Word Sense Disambiguation “plant and animal species” One sense per context, one sense per discourse [Yarowsky, 1995] Information Retrieval search queryPseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995] Document Classification Topic word, e.g.: “politics” Semi-supervised Learning [McCallum & Nigam, 1999; Gliozzo, 2005]

107 Common Structure TaskHintBootstrap Web IE “x, C of y” Distributional Hypothesis Word Sense Disambiguation “plant and animal species” One sense per context, one sense per discourse [Yarowsky, 1995] Information Retrieval search queryPseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995] Document Classification Topic word, e.g.: “politics” Bag-of-words and EM [McCallum & Nigam, 1999; Gliozzo, 2005] Identity of a monotonic feature x i such that: P(y = 1 | x i ) increases strictly monotonically with x i Classification of examples x = (x 1, …, x d ) into classes y  {0, 1}

109 Semi-Supervised Learning (SSL) Learn function from x = (x 1, …, x d ) to y  {0, 1} given labeled examples (x, y) and unlabeled examples (x) x1x1 x2x2

113 1. No labeled data, MFs given (M A )  With noisy labels from MFs, train any classifier 2. Labeled data, no MFs given (M A-SSL )  Detect MFs from labeled data, run M A 3. Labeled data and MFs given (M A-BOTH )  Run M A with given & detected MFs Exploiting MF Structure

114 20 Newsgroups dataset Task: Given text, determine newsgroup of origin (MFs: newsgroup name) Without labeled data: Experimental Results

115 M A-SSL provides a 15% error reduction for 100-400 labeled examples. M A-BOTH provides a 31% error reduction for 0-800 labeled examples. Experimental Results

116 Co-training  Requires labeled examples and known views Semi-supervised smoothness assumptions  Cluster assumption  Manifold assumption  …both provably distinct from MF structure Relationship to other approaches

117 Best known methods for IE without labeled data  Probabilities of correctness (U RNS ) Massive improvements in accuracy (15-22x)  Handling sparse data (Language models) Vastly more scalable than previous work Accuracy wins (39% error reduction) Generalization beyond IE  Monotonic Feature abstraction – widely applicable  Accuracy wins in document classification Summary of Results

118 IE  Web IE But still need:  A coherent knowledge base MayorOf(Chicago, Daley) – the same “Chicago” as Starred-in(Chicago, Zeta-Jones)? Future Work: entity resolution, schema discovery  Improved accuracy and coverage Currently, ignore character/document features, recursive structure, etc. Future work: more sophisticated language models (e.g. PCFGs) Conclusions and Future Work

119 Thanks! Acknowledgements: Oren Etzioni Mike Cafarella Pedro Domingos Susan Dumais Eric Horvitz Stef Schoenmackers Dan Weld

120 Self-Supervised Learning Input ExamplesOutput Supervised LabeledClassifier Semi-supervised Labeled & UnlabeledClassifier Self-supervised UnlabeledClassifier Unsupervised UnlabeledClustering

121 Language Modeling for IE  R EALM is simple, ignores: Character- or Document-Level Features Web structure Recursive structure (PCFGs)  Goal: x won an Oscar for playing a villain… What is P(x) ? From facts to knowledge  Entity resolution and inference Future Work

122 Named Entity Location  Lexical Statistics improve state of the art [Downey et al., IJCAI 2007] Modeling Web Search  Characterizing user behavior [Downey et al., SIGIR 2007] (poster) [Liebling et al., 2008] (submitted)  Predictive models [Downey et al., IJCAI 2007] Other Work

123 Web Fact-Finding Who has won three or more Academy Awards?

124 Web Fact-Finding Problems: User has to pick the right words, often a tedious process: " world foosball champion in 1998 “ – 0 hits “ world foosball champion ” 1998 – 2 hits, no answer What if I could just ask for P(x) in “x was world foosball champion in 1998?” How far can language modeling and the distributional hypothesis take us?

125 Miami Twisp Star Wars...9802025030513... 501211 110000211... X soundtrack he visited X and cities such as X X and other cities X lodging KnowItAll HypothesisDistributional Hypothesis

126 Miami Twisp Star Wars...9802025030513... 501211 110000211... X soundtrack he visited X and cities such as X X and other cities X lodging KnowItAll Hypothesis Distributional Hypothesis

127 invent in real time TextRunner Ranked by frequency REALM improves precision of the top 20 extractions by an average of 90%.

128 Tarantella, Santa Cruz International Business Machines Corporation, Armonk Mirapoint, Sunnyvale ALD, Sunnyvale PBS, Alexandria General Dynamics, Falls Church Jupitermedia Corporation, Darien Allegro, Worcester Trolltech, Oslo Corbis, Seattle TR Precision: 40% REALM Precision: 100% Improving TextRunner: Example (1) “headquartered” Top 10: company, Palo Alto held company, Santa Cruz storage hardware and software, Hopkinton Northwestern Mutual, Tacoma 1997, New York City Google, Mountain View PBS, Alexandria Linux provider, Raleigh Red Hat, Raleigh TI, Dallas TR Precision: 40%

129 Arabs, Rhodes Arabs, Istanbul Assyrians, Mesopotamia Great, Egypt Assyrians, Kassites Arabs, Samarkand Manchus, Outer Mongolia Vandals, North Africa Arabs, Persia Moors, Lagos TR Precision: 60% REALM Precision: 90% Improving TextRunner: Example (2) “conquered” Top 10: Great, Egypt conquistador, Mexico Normans, England Arabs, North Africa Great, Persia Romans, part Romans, Greeks Rome, Greece Napoleon, Egypt Visigoths, Suevi Kingdom TR Precision: 60%

130 Previous n-gram technique (1) 1) Form a context vector for each extracted argument: … cities such as Chicago, Boston, But Chicago isn’t the best cities such as Chicago, Boston, Los Angeles and Chicago. … 2) Compute dot products between extractions and seeds in this space [cf. Ravichandran et al. 2005]. 121 … … such as, Boston But isn’t the Angeles and.

131 Miami: Twisp: Problems:  Vectors are large  Intersections are sparse...71251513... when he visited X he visited X and visited X and other X and other cities...0001 Previous n-gram technique (2)

132 Miami: P(t | Miami): Latent state distribution P(t | w)  Compact (efficient – 10-50x less data retrieved)  Dense (accurate – 23-46% error reduction)...71251513... 0.140.01…0.06 t=1 2 N Compressing Context Vectors

133 Example: N-Grams on Sparse Data Is Pickerington of the same type as Chicago ? Chicago, Illinois Pickerington, Ohio Chicago: Pickerington: => N-grams says no, dot product is 0! 2910 …, Ohio, Illinois 01 …

134 HMM Generalizes: Chicago, Illinois Pickerington, Ohio Example: H MM-T on Sparse Data

135 H MM-T Limitations Learning iterations take time proportional to (corpus size *T k+1 ) T = number of latent states k = HMM order We use limited values T=20, k=3  Sufficient for typechecking ( Santa Clara is a city)  Too coarse for relation assessment ( Santa Clara is where Intel is headquartered)

136 The R EALM Architecture Two steps for assessing R(arg1, arg2) Typechecking  Ensure arg1 and arg2 are of proper type for R MayorOf ( Intel, Santa Clara ) Leverages all occurrences of each arg Relation Assessment  Ensure R actually holds between arg1 and arg2 MayorOf ( Giuliani, Seattle ) Both steps use pre-computed language models => Scales to Open IE

137 Type checking isn’t enough NY Mayor Giuliani toured downtown Seattle. Want: How do arguments behave in relation to each other? Relation Assessment

138 N-gram language model: P(w i, w i-1, … w i-k ) arg1, arg2 often far apart => large k (inaccurate) R EL-GRAMS (1)

139 Relational Language Model (R EL-GRAMS ): For any two arguments e 1, e 2 : P(w i, w i-1, … w i-k | w i = e 1, e 1 near e 2 ) k can be small – R EL-GRAMS still captures entity relationships  Mitigate sparsity with BM25 metric (from IR) Combine with H MM-T by multiplying ranks. R EL-GRAMS (2)

140 Experiments Task: Re-rank sparse TextRunner extractions for Conquered, Founded, Headquartered, Merged R EALM vs.  TextRunner (TR) – frequency ordering (equivalent to PMI [Etzioni et al, 2005] and Urns [Downey et al, 2005] )  Pattern Learning (PL) – based on Snowball [Agichtein 2000]  H MM-T and R EL-GRAMS in isolation

141 Learning num(C) and num(E) From untagged data: ill-posed problem num(C) can vary wildly with C e.g., countries vs. cities vs. mayors Assume: 1) Consistent precision of a single co-occurrence, e.g., in a randomly drawn phrase “ C such as x ”, x  C about p of the time. (0.9 for [Etzioni, 2005] ) 2) num(E) is constant for all C 3) num(C) is Zipf  Estimate num(C) from untagged data using EM [Downey et al. 2005] (Also: multiple contexts)

142 U RNS without labeled data 1 - P( x  C ) in “C such as x” Assumed ~0.9 Error Distribution Assumed large with Zipf parameter 1.0

143 U RNS without labeled data 1 - Can vary wildly (e.g. cities vs. countries). Learned from unlabeled data using EM

144 Distributional Similarity Naïve Approach – find sentences containing seed1&seed2 or arg1&arg2: Compare context distributions: P(w b,…, w e | seed1, seed2 ) P(w b,…, w e | arg1, arg2) But e – b can be large Many parameters, sparse data => inaccuracy wbwb …whwh seed1w h+2 …wiwi seed2w i+2 …wewe wbwb …whwh arg1w h+2 …wiwi arg2w i+2 …wewe

145 http://www.cs.washington.edu/research/textrunner/ TextRunner Search

146 Large textual corpora are redundant, and we can use this observation to bootstrap extraction and classification models from minimally labeled, or even completely unlabeled data. Thesis

147 Supervised classification task:  Feature space X of d-tuples x = (x 1, …, x d )  Binary output space Y = {0, 1}  Inputs Labeled examples D L = {(x, y)} ~ P(x, y)  Output: concept c: X -> {0, 1} that approximates P(y | x). Monotonic Features

148 Semi-supervised classification task:  Feature space X of d-tuples x = (x 1, …, x d )  Binary output space Y = {0, 1}  Inputs Labeled examples D L = {(x, y)} ~ P(x, y) Unlabeled examples D U = {(x)} ~ P(x)  Output: concept c: X -> {0, 1} that approximates P(y | x). Monotonic Features Smaller

149 Semi-supervised classification task:  Feature space X of d-tuples x = (x 1, …, x d )  Binary output space Y = {0, 1}  Inputs Labeled examples D L = {(x, y)} ~ P(x, y) Unlabeled examples D U = {(x)} ~ P(x) Monotonic features M  {1,…,d} such that: P(y=1 | x i ) increases strictly monotonically with x i for all i  M.  Output: concept c: X -> {0, 1} that approximates P(y | x). Potentially empty! Monotonic Features

150 Problem: num(C) can vary wildly  e.g. cities vs. countries Assume:  num(C), num(E) Zipf distributed freq. of ith element  i -z  p and num(E) independent of C Learn num(C) from unlabeled data alone  With Expectation Maximization U RNS without labeled data

151 20 Newsgroups dataset Task: Given text, determine newsgroup of origin (MFs: newsgroup name) Without labeled data: Experimental Results

152 Typecheck each arg by comparing HMM’s distributional summaries: Rank arguments in ascending order of f(arg) HMM Type-checking

154 Semi-supervised Learning (SSL) Learn function from x = (x 1, …, x d ) to y  {0, 1} given labeled examples (x, y) and unlabeled examples (x) x1x1 x2x2

155 Self-supervised Learning x1x1 x2x2 Learn function from x = (x 1, …, x d ) to y  {0, 1} given unlabeled examples (x)

156 Self-supervised Learning x1x1 x2x2 Learn function from x = (x 1, …, x d ) to y  {0, 1} given unlabeled examples (x) and system labels its own examples

157 Self-supervised Learning Input ExamplesOutput Supervised LabeledClassifier Semi-supervised Labeled & UnlabeledClassifier Self-supervised UnlabeledClassifier Unsupervised UnlabeledClustering

158 Supervised classification task:  Feature space X of d-tuples x = (x 1, …, x d )  Binary output space Y = {0, 1}  Inputs Labeled examples D L = {(x, y)} ~ P(x, y)  Output: concept c: X -> {0, 1} that approximates P(y | x). Monotonic Features

159 Semi-supervised classification task:  Feature space X of d-tuples x = (x 1, …, x d )  Binary output space Y = {0, 1}  Inputs Labeled examples D L = {(x, y)} ~ P(x, y) Unlabeled examples D U = {(x)} ~ P(x)  Output: concept c: X -> {0, 1} that approximates P(y | x). Monotonic Features Smaller

160 Semi-supervised classification task:  Feature space X of d-tuples x = (x 1, …, x d )  Binary output space Y = {0, 1}  Inputs Labeled examples D L = {(x, y)} ~ P(x, y) Unlabeled examples D U = {(x)} ~ P(x) Monotonic features M  {1,…,d} such that: P(y=1 | x i ) increases strictly monotonically with x i for all i  M.  Output: concept c: X -> {0, 1} that approximates P(y | x). Potentially empty! Monotonic Features

1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

Similar presentations

Presentation on theme: "1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

Similar presentations

Presentation on theme: "1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University."— Presentation transcript:

Similar presentations

About project

Feedback