A Probabilistic Model of Redundancy in Information Extraction University of Washington Department of Computer Science and Engineering

Slides:

Advertisements

Similar presentations

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Advertisements

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.

CMPUT 466/551 Principal Source: CMU

What is Statistical Modeling

Evaluating Search Engine

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.

An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.

Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.

Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.

Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.

Ensemble Learning (2), Tree and Forest

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Crash Course on Machine Learning

1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Unsupervised and Semi-Supervised Relation Extraction.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.

1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.

Text Classification, Active/Interactive learning.

Universit at Dortmund, LS VIII

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

KnowItAll Oren Etzioni, Stephen Soderland, Daniel Weld Michele Banko, Alex Beynenson, Jeff Bigham, Michael Cafarella, Doug Downey, Dave Ko, Stanley Kok,

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

KnowItAll Oren Etzioni, Stephen Soderland, Daniel Weld Michele Banko, Alex Beynenson, Jeff Bigham, Michael Cafarella, Doug Downey, Dave Ko, Stanley Kok,

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.

Classification Ensemble Methods 1

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Semi-Supervised Clustering

Data Science Algorithms: The Basic Methods

Data Integration with Dependent Sources

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Multivariate Methods Berlin Chen

Extracting Patterns and Relations from the World Wide Web

KnowItAll and TextRunner

Presentation transcript:

A Probabilistic Model of Redundancy in Information Extraction University of Washington Department of Computer Science and Engineering Doug Downey, Oren Etzioni, Stephen Soderland

2 Information Extraction and the Future of Web Search

3 Motivation for Web IE What universities have active biotech research and in what departments? What percentage of the reviews of the Thinkpad T-40 are positive? The answer is not on any single Web page!

4 Review: Unsupervised Web IE Goal: Extract information on any subject automatically.

5 Review: Extraction Patterns Generic extraction patterns (Hearst ’92): “…Cities such as Boston, Los Angeles, and Seattle…” (“C such as NP1, NP2, and NP3”) => IS-A(each(head(NP)), C), … “Detailed information for several countries such as maps, …” ProperNoun(head(NP)) “I listen to pretty much all music but prefer country such as Garth Brooks”

6 “Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from , and…” Binary Extraction Patterns R(I 1, I 2 )  I 1, R of I 2 Instantiated Pattern: Ceo(Person, Company) , CEO of “…Jeff Bezos, CEO of Amazon…” “..Matt Damon, star of The Bourne Supremacy..”

7 Review: Unsupervised Web IE Goal: Extract information on any subject automatically. →Generic extraction patterns Generic patterns can make mistakes. →Redundancy.

8 Redundancy in Information Extraction In large corpora, the same fact is often asserted multiple times: “…and the rolling hills surrounding Sun Belt cities such as Atlanta” “Atlanta is a city with a large number of museums, theatres…” “…has offices in several major metropolitan cities including Atlanta” Given a term x and a set of sentences about a class C, what is the probability that x  C?

9 Redundancy – Two Intuitions 2) Multiple extraction mechanisms PhraseHits 1) Repetition “Atlanta and other cities” 980 “Canada and other cities” 286 “cities such as Atlanta” 5860 “cities such as Canada” 7 Goal: A formal model of these intuitions.

10 Outline 1.Modeling redundancy – the problem 2.U RNS model 3.Parameter estimation for U RNS 4.Experimental results 5.Summary

11 1. Modeling Redundancy – The Problem Consider a single extraction pattern: “C such as x” Given a term x and a set of sentences about a class C, what is the probability that x  C?

12 1. Modeling Redundancy – The Problem Consider a single extraction pattern: “C such as x” If an extraction x appears k times in a set of n sentences containing this pattern, what is the probability that x  C?

13 Modeling with k “…countries such as Saudi Arabia…” “…countries such as the United States…” “…countries such as Saudi Arabia…” “…countries such as Japan…” “…countries such as Africa…” “…countries such as Japan…” “…countries such as the United Kingdom…” “…countries such as Iraq…” “…countries such as Afghanistan…” “…countries such as Australia…” Country(x) extractions, n = 10

14 Modeling with k Country(x) extractions, n = 10 Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia k Noisy-Or Model : p is the probability that a single sentence is true Important: –Sample size (n) –Distribution of C } Noisy-or ignores these p = 0.9

15 Needed in Model: Sample Size k Japan Norway Israil OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland Country(x) extractions, n ~50, … 0.9 Country(x) extractions, n = 10 Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia k As sample size increases, noisy-or becomes inaccurate.

16 Needed in Model: Distribution of C k Japan Norway Israil OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland Country(x) extractions, n ~50, … 0.9

17 Needed in Model: Distribution of C k Japan Norway Israil OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland Country(x) extractions, n ~50, … 0.05

18 Needed in Model: Distribution of C k Toronto Belgrade Lacombe Kent County Nikki Ragaz Villegas Cres Northeastwards City(x) extractions, n ~50, … Probability that x  C depends on the distribution of C. k Japan Norway Israil OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland Country(x) extractions, n ~50, … 0.05

19 Outline 1.Modeling redundancy – the problem 2.U RNS model 3.Parameter estimation for U RNS 4.Experimental results 5.Summary

20 2. The U RNS Model – Single Urn

21 2. The U RNS Model – Single Urn U.K. Sydney Urn for City(x) Cairo Tokyo Atlanta Yakima Utah U.K.

22 Tokyo 2. The U RNS Model – Single Urn U.K. Sydney Urn for City(x) Cairo Tokyo Atlanta Yakima Utah U.K. …cities such asTokyo…

23 Single Urn – Formal Definition C – set of unique target labels E – set of unique error labels num(b) – number of balls labeled by b  C  E num(B) –distribution giving the number of balls for each label b  B.

24 Single Urn Example num( “Atlanta” ) = 2 num(C) = {2, 2, 1, 1, 1} num(E) = {2, 1} Estimated from data U.K. Sydney Urn for City(x) Cairo Tokyo Atlanta Yakima Utah U.K.

25 Single Urn: Computing Probabilities If an extraction x appears k times in a set of n sentences containing a pattern, what is the probability that x  C?

26 Single Urn: Computing Probabilities Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x  C?

27 Consider the case where num(c i ) = R C and num(e j ) = R E for all c i  C, e j  E Then: Then using a Poisson Approximation: Odds increase exponentially with k, but decrease exponentially with n. Uniform Special Case

28 The U RNS Model – Multiple Urns Correlation across extraction mechanisms is higher for elements of C than for elements of E.

29 Outline 1.Modeling redundancy – the problem 2.U RNS model 3.Parameter estimation for U RNS 4.Experimental results 5.Summary

30 Simplifying Assumptions: –Assume that num(C) and num(E) are Zipf distributed. Frequency of ith most repeated label in C –Then num(C) and num(E) are characterized by five parameters: 3. Parameter Estimation for U RNS

31 Supervised Learning –Differential Evolution (maximizing conditional likelihood) Unsupervised Learning –Growing interest in IE without hand-tagged training data (e.g. DIPRE; Snowball; K NOW I T A LL ; Riloff and Jones 1999; Lin, Yangarber, and Grishman 2003) –How to estimate num(C) and num(E)? Parameter Estimation

32 Unsupervised Learning –EM, with additional assumptions: |E| = 1,000,000 z E = 1 p is given (p = 0.9 for KnowItAll patterns) Unsupervised Parameter Estimation

33 EM for Unsupervised IE: –E-Step: Assign probabilities to extracted facts using U RNS. –M-Step: 1.Estimate z C by linear regression on log-log scale. 2.Set |C| equal to expected number of true labels extracted, plus unseen true labels (using Good- Turing estimation). EM Process

34 Outline 1.Modeling redundancy – the problem 2.U RNS model 3.Parameter estimation for U RNS 4.Experimental results 5.Summary

35 Previous Approach: PMI (in K NOW I T A LL, inspired by Turney, 2001) PMI(“ hotels”, “Tacoma”) = –Expensive: several hit-count queries per extraction –Using U RNS improves efficiency by ~8x –‘Bootstrapped’ training data not representative –Probabilities are polarized (Naïve Bayes) 4. Experimental Results

36 Unsupervised Likelihood Performance

37 ParameterU RNS improvement over: Noisy-or PMI z E = 1, | E| = 10 6, p = x19x z E = 1.115x19x z E = 0.914x18x |E| = x18x |E| = x18x p = 0.959x12x p = 0.808x11x U RNS Robust to Parameter Changes

38 (False +) = 9(False –) Loss Functions: (False +) = (False –) Classification Accuracy

39 Supervised Results U RNS outperforms noisy-or by 19%, logistic regression by 10%, but SVM by less than 1%.

40 Modeling Redundancy – Summary Given a term x and a set of sentences about a class C, what is the probability that x  C?

41 U RNS Model of Redundancy in Text Classification Parameter learning algorithms Substantially improved performance for Unsupervised IE Modeling Redundancy – Summary

42 Pattern Learning City => –cities such as – and other cities –cities including – is a city, etc. But what about: – hotels –headquartered in –the greater area, etc.

43 Pattern Learning (PL) Seed Instances: Moscow Cleveland London Mexico City Web Search Engine

44 Pattern Learning (PL) Seed Instances: Moscow Cleveland London Mexico City …near the city of Cleveland you can find the … Web Search Engine Context Strings:

45 Pattern Learning (PL) Seed Instances: Moscow Cleveland London Mexico City …near the city of Cleveland you can find the … Large collection of context strings Web Search Engine The “best” patterns: city of Context Strings: A pattern is any substring of a context string that includes the seed. Repeat as desired

46 Which patterns are “best” Both precision and recall are important, but hard to measure.

47 Which patterns are “best” Where: –The pattern is found for c target seeds and n non-target seeds. –S is the total number of target seeds. –k/m is a prior estimate of pattern precision. Both precision and recall are important, but hard to measure.

48 Patterns as Extractors and Discriminators

49 “City” Execute domain-independent extractors e.g. cities such as Web Search Engine Parse web pages Compute PMI with domain- independent discriminators (e.g. “Tacoma and other cities” has 80 hits) City(“Tacoma”) with probability KnowItAll

50 “City” Execute domain-independent and learned extractors e.g. headquartered in Web Search Engine Parse web pages Compute PMI with domain- independent or learned discriminators (e.g. “Tacoma hotels” has 42,000 hits) City(“Tacoma”) with probability KnowItAll with Pattern Learning

51 “City” Web Search Engine Parse web pages City(“Tacoma”) with probability KnowItAll with Pattern Learning Experiment 2 Experiment 1 Execute domain-independent and learned extractors e.g. headquartered in Compute PMI with domain- independent or learned discriminators (e.g. “Tacoma hotels” has 42,000 hits)

52 Experiment 1: Learned patterns as extractors Baseline – KnowItAll with domain independent extractors. Baseline+PL – KnowItAll with both domain- independent and learned extractors. In both cases, domain independent discriminators. We compare coverage – i.e. the number of instances extracted at a fixed level of precision (0.90).

53 Experiment 1: Learned patterns as extractors Film Adding PL improves coverage by 50% to 80%. City

54 Experiment 1: Learned patterns as extractors PatternCorrect Extractions Precision the cities of headquartered in for the city of in the movie the movie starring movie review of

55 Experiment 2: Learned patterns as discriminators Baseline – Uses domain independent discriminators. Baseline+PL – Uses both domain independent and learned discriminators. We compare the classification accuracy of the two methods (the fraction of extractions classified correctly as positive or negative) after running two discriminators on each of 300 extractions.

56 Experiment 2: Learned patterns as discriminators Baseline – Uses domain independent discriminators. Baseline+PL – Uses both domain independent and learned discriminators. We compare the classification accuracy of the two methods (the fraction of extractions classified correctly as positive or negative) after running two discriminators on each of 300 extractions. Adding PL reduces classification errors by 28% to 35%

57 Selecting discriminators In Experiment 2, for each extraction we executed: –a fixed pair of discriminators –choosing those with the highest precision This approach can be improved.

58 Selecting discriminators The baseline ordering can be improved in several ways: –Precision and recall are important for accuracy.

59 Selecting discriminators The baseline ordering can be improved in several ways: –Precision and recall are important for accuracy. –Discriminators can perform better on some extractions than on others: E.g. rare extractions: –A high-precision but rare discriminator might falsely return a PMI a zero (e.g. “cities such as Fort Calhoun” has 0 hits) –Using a more prevalent discriminator on rare facts could improve accuracy (e.g. “Fort Calhoun hotels” has 20 hits).

60 Selecting discriminators The baseline ordering can be improved in several ways: –Precision and recall are important for accuracy. –Discriminators can perform better on some extractions than on others: E.g. rare extractions: –A high-precision but rare discriminator might falsely return a PMI a zero (e.g. “cities such as Fort Calhoun” has 0 hits) –Using a more prevalent discriminator on rare facts could improve accuracy (e.g. “Fort Calhoun hotels” has 20 hits). –The system should prioritize uncertain extractions.

61 The Discriminator Selection Problem Goal: given a set of extractions and discriminators, find a policy that maximizes expected accuracy. –Known as “active classification.” Assume discriminators are conditionally independent (as in Guo, 2002). The general optimization problem is NP-hard. The MU Heuristic is optimal in important special cases and improves performance in practice.

62 The MU Heuristic Greedily choose the action with maximal marginal utility MU: We can compute MU given –the discriminator’s precision and recall (adjusted according to the extraction’s hit count) –the system’s current belief in the extraction. (similar to Etzioni 1991).

63 Experiment 3: Testing the MU Heuristic As in experiment 2, the Baseline and Baseline+PL configurations execute two discriminators (ordered by precision) on each of 300 extractions. The MU configurations are constrained to execute the same total number of discriminators (600), but can dynamically choose to execute the discriminator and extraction with highest marginal utility.

64 Experiment 3: Testing the MU Heuristic Ordering by MU further reduces classification errors by 19% to 35%, for a total error reduction of 47% to 53%.

65 Summary Pattern Learning –Increased coverage by 50% to 80%. –Decreased errors 28% to 35%. Theoretical Model –decreased errors an additional 19% to 35%.

66 Extensions to PL Complex patterns –Syntax (Snow and Ng 2004), Classifiers (Snowball) –Tend to require good training data Iteration (Patterns->Seeds->Patterns->….) –(Brin 1998, Agichtein and Gravano 2000, Riloff 1999) –Scope creep…U RNS ?

67 Backup

68 Additional Experiments Normalization/Negative Evidence –Don’t mistake cities for countries, etc (e.g. Lin et al 2003, Thelen & Riloff 2002) Learning extraction patterns –E.g. DIPRE, Snowball Other applications –E.g. PMI applied to synonymy (Turney, 2001) Future Work

69 kpnR C /R E P noisy-or P urns , U RNS adjusts for sample size and distribution of C and E , ,

70 U RNS works when the confusion region is small. When is U RNS effective?

71 The U RNS Model – Multiple Urns PhraseHits “Atlanta and other cities” “cities such as Atlanta” 6x “Canada and other cities” 286 “cities such as Canada” x Correlation between counts for different extractors is informative. “Texas and other cities” 4710 “cities such as Texas” x

72 Modeling the Urns: –z C, z E, |C|, |E| the same for all urns. –Different extraction precisions p. Modeling correlation between Urns: –Relative frequencies are perfectly correlated for elements of C, and some elements of E. –The remaining elements of E appear for only one kind of extraction mechanisms. Multi-urn Assumptions

73 A m (x, k, m) = Event that extraction x is seen k times in urn m. Multi-urn Assumptions With our assumptions, we can obtain the above expression in closed form.

74 Recall – Distribution of C k Toronto Belgrade Lacombe Kent County Nikki Ragaz Villegas Cres Northeastwards City(x) extractions, n ~50, … Probability that x  C depends on the distribution of C. k Japan Norway Israil OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland Country(x) extractions, n ~50, … 0.05

75 Untagged Data A mixture of samples from num(C) and num(E): Challenge: Estimate num(C), num(E).

76 Redundancy in IE –Heuristics/noisy-or models (e.g. Riloff & Jones 1999; Brin 1998; Agichtien & Gravano 2000; Lin et al. 2003) –Supervised models (Skounakis & Craven, 2003) –Do not model n, num(C), num(E) BLOG models (Milch et al. 2004) –Our focus is on IE/Text Classification; we give algorithms, experimental results Related Work

77 CRFs for confidence estimation (Culotta & McCallum, 2004) –Our interest is combining evidence from multiple extractions. Related Work

78 Supervised Results Deviation from the ideal log-likelihood.