Machine Reading of Web Text Oren Etzioni Turing Center University of Washington

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Chapter 5: Introduction to Information Retrieval
TEXTRUNNER Turing Center Computer Science and Engineering
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Sparse Information Extraction: Unsupervised Language Models to the Rescue Doug Downey, Stef Schoenmackers, Oren Etzioni Turing Center University of Washington.
Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
1 Unsupervised Semantic Parsing Hoifung Poon and Pedro Domingos EMNLP 2009 Best Paper Award Speaker: Hao Xiong.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Information Retrieval in Practice
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
Open Information Extraction From The Web Rani Qumsiyeh.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Vector Space Model CS 652 Information Extraction and Integration.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Information Retrieval in Practice
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Unsupervised and Semi-Supervised Relation Extraction.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
SEEKING STATEMENT-SUPPORTING TOP-K WITNESSES Date: 2012/03/12 Source: Steffen Metzger (CIKM’11) Speaker: Er-gang Liu Advisor: Dr. Jia-ling Koh 1.
1 The BT Digital Library A case study in intelligent content management Paul Warren
ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko Presenter: Shahina.
Machine Learning.
KnowItAll Oren Etzioni, Stephen Soderland, Daniel Weld Michele Banko, Alex Beynenson, Jeff Bigham, Michael Cafarella, Doug Downey, Dave Ko, Stanley Kok,
Amy Dai Machine learning techniques for detecting topics in research papers.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
Question Answering over Implicitly Structured Web Content
Machine Reading at Web Scale Oren Etzioni
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
KnowItAll Oren Etzioni, Stephen Soderland, Daniel Weld Michele Banko, Alex Beynenson, Jeff Bigham, Michael Cafarella, Doug Downey, Dave Ko, Stanley Kok,
Automatic Question Answering  Introduction  Factoid Based Question Answering.
KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
The Unreasonable Effectiveness of Data
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Semantic Processing with Context Analysis
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
University of Washington
Open Information Extraction from the Web
Statistical NLP: Lecture 10
KnowItAll and TextRunner
Presentation transcript:

Machine Reading of Web Text Oren Etzioni Turing Center University of Washington

2 Rorschach Test

3 Rorschach Test for CS

4 Moore’s Law?

5 Storage Capacity?

6 Number of Web Pages?

7 Number of Facebook Users?

8

9 Turing Center Foci Scale MT to 49,000,000 language pairs  2,500,000 word translation graph  P(V  F  C)?  PanImages PanImages Accumulate knowledge from the Web A new paradigm for Web Search

10 Outline 1. A New Paradigm for Search 2. Open Information Extraction 3. Tractable Inference 4. Conclusions

11 Web Search in 2020? Type key words into a search box? Social or “human powered” Search? The Semantic Web? What about our technology exponentials? “The best way to predict the future is to invent it!”

12 Intelligent Search Instead of merely retrieving Web pages, read ‘em! Machine Reading = Information Extraction (IE) + tractable inference IE(sentence) = who did what?  speaker(Alon Halevy, UW) Inference = uncover implicit information  Will Alon visit Seattle?

13 Application: Information Fusion What kills bacteria? What west coast, nano-technology companies are hiring? Compare Obama’s “buzz” versus Hillary’s? What is a quiet, inexpensive, 4-star hotel in Vancouver?

14 Opine (Popescu & Etzioni, EMNLP ’05) IE(product reviews)  Informative  Abundant, but varied  Textual Summarize reviews without any prior knowledge of product category Opinion Mining

15

16

17 But “Reading” the Web is Tough Traditional IE is narrow IE has been applied to small, homogenous corpora No parser achieves high accuracy No named-entity taggers No supervised learning How about semi-supervised learning?

18 Semi-Supervised Learning Few hand-labeled examples  Limit on the number of concepts  Concepts are pre-specified  Problematic for the Web Alternative: self-supervised learning  Learner discovers concepts on the fly  Learner automatically labels examples per concept!

19 2. Open IE = Self-supervised IE (Banko, Cafarella, Soderland, et. al, IJCAI ’07) Traditional IEOpen IE Input: Corpus + Hand- labeled Data Corpus Relations: Specified in Advance Discovered Automatically Complexity: Text analysis: O(D * R) R relations Parser + Named- entity tagger O(D) D documents NP Chunker

20 Extractor Overview (Banko & Etzioni, ’08) 1. Use a simple model of relationships in English to label extractions 2. Bootstrap a general model of relationships in English sentences, encoded as a CRF 3. Decompose each sentence into one or more (NP1, VP, NP2) “chunks” 4. Use CRF model to retain relevant parts of each NP and VP. The extractor is relation-independent!

21 TextRunner Extraction Extract Triple representing binary relation (Arg1, Relation, Arg2) from sentence. Internet powerhouse, EBay, was originally founded by Pierre Omidyar. (Ebay, Founded by, Pierre Omidyar)

22 Numerous Extraction Challenges Drop non-essential info: “was originally founded by”  founded by Retain key distinctions Ebay founded by Pierr ≠ Ebay founded Pierre Non-verb relationships “George Bush, president of the U.S…” Synonymy & aliasing Albert Einstein = Einstein ≠ Einstein Bros.

23 TextRunner (Web’s 1 st Open IE system) 1. Self-Supervised Learner: automatically labels example extractions & learns an extractor 2. Single-Pass Extractor: single pass over corpus, identifying extractions in each sentence 3. Query Processor: indexes extractions  enables queries at interactive speeds

TextRunnerTextRunner Demo

25

26

27 Triples 11.3 million With Well-Formed Relation 9.3 million With Well-Formed Entities 7.8 million Abstract 6.8 million 79.2% correct Concrete 1.0 million 88.1% correct Sample of 9 million Web Pages Concrete facts: (Oppenheimer, taught at, Berkeley) Abstract facts: (fruit, contain, vitamins)

28 3. Tractable Inference Much of textual information is implicit I. Entity and predicate resolution II. Probability of correctness III. Composing facts to draw conclusions

29 I. Entity Resolution Resolver (Yates & Etzioni, HLT ’07): determines synonymy based on relations found by TextRunner (cf. Pantel & Lin ‘01) (X, born in, 1941) (M, born in, 1941) (X, citizen of, US) (M, citizen of, US) (X, friend of, Joe) (M, friend of, Mary) P(X = M) ~ shared relations

30 Relation Synonymy (1, R, 2) (2, R 4) (4, R, 8) Etc. (1, R’ 2) (2, R’, 4) (4, R’ 8) Etc. P(R = R’) ~ shared argument pairs Unsupervised probabilistic model O(N log N) algorithm run on millions of docs

31 II. Probability of Correctness How likely is an extraction to be correct? Factors to consider include: Authoritativeness of source Confidence in extraction method Number of independent extractions

32 Counting Extractions Lexico-syntactic patterns: (Hearst ’92) “…cities such as Seattle, Boston, and…” Turney’s PMI-IR, ACL ’02: PMI ~ co-occur frequency  # results # results  confidence in class membership.

33 Formal Problem Statement If an extraction x appears k times in a set of n distinct sentences each suggesting that x belongs to C, what is the probability that x  C ? C is a class (“cities”) or a relation (“mayor of”) Note: we only count distinct sentences!

34 Combinatorial Model (“Urns”) Odds increase exponentially with k, but decrease exponentially with n See Downey et al.’s IJCAI ’05 paper for formal details.

35 Performance (15x Improvement) Self supervised, domain independent method

36 U RNS limited on “sparse” facts A mixture of correct and incorrect e.g., ( Dave Shaver, Pickerington ) ( Ronald McDonald, McDonaldland ) context Tend to be correct e.g., ( Michael Bloomberg, New York City )

37 Language Models to the Rescue (Downey, Schoenmackers, Etzioni, ACL ’07) Instead of only lexico-syntactic patterns, leverage all contexts of a particular entity Statistical ‘type check’: does Pickerington “behave” like a city? does Shaver “behave” like a mayor? Language model = HMM (built once per corpus) Project string to point in 20-dimensional space Measure proximity of Pickerington to Seattle, Boston, etc.

38 III Compositional Inference (work in progress, Schoenmackers, Etzioni, Weld) Implicit information, (2+2=4) TextRunner: (Turing, born in, London) WordNet: (London, part of, England) Rule: ‘born in’ is transitive thru ‘part of’ Conclusion: (Turing, born in, England) Mechanism: MLN instantiated on the fly Rules: learned from corpus (future work) Inference Demo

39 Mulder ‘01 WebKB ‘99 PMI-IR ‘01 KnowItAll, ‘04 Urns BE ‘05 KnowItNow ‘05 TextRunner ‘07 KnowItAll Family Tree Opine ‘05 Woodward ‘06 Resolver ‘07 REALM ‘07Inference ‘08

40 KnowItAll Team Michele Banko Michael Cafarella Doug Downey Alan Ritter Dr. Stephen Soderland Stefan Schoenmackers Prof. Dan Weld Mausam Alumni: Dr. Ana-Maria Popescu, Dr. Alex Yates, and others.

41 Related Work Sekine’s “pre-empty IE” Powerset Textual Entailment AAAI ‘07 Symposium on “Machine Reading” Growing body of work on IE from the Web

42 4. Conclusions Imagine search systems that operate over a (more) semantic space Key words, documents  extractions TF-IDF, pagerank  relational models Web pages, hyper links  entities, relns Reading the Web  new Search Paradigm

43

44 Machine Reading = Unsupervised understanding of text Much is implicit  tractable inference is key!

45 HMM in more detail Training: seek to maximize probability of corpus w given latent states t using EM: titi t i+1 t i+2 t i+3 t i+4 wiwi w i+1 w i+2 w i+3 w i+4 cities such as Los Angeles

46 Using the HMM at Query Time Given a set of extractions (Arg1, Rln, Arg2) Seeds = most frequent Args for Rln 1. Distribution over t is read from the HMM 2. Compute KL divergence via f(arg, seeds) 3.For each extraction, average f over Arg1 & Arg2 4.Sort “sparse” extractions in ascending order

47 Language Modeling & Open IE Self supervised Illuminating phrases  full context  Handles sparse extractions

48 Focus: Open IE on Web Text Advantages Challenges “Semantically tractable” sentences Redundancy Search engines Difficult, ungrammatical sentences Unreliable information Heterogeneous corpus

49 II. Probability of Correctness How likely is an extraction to be correct? Distributional Hypothesis: “words that occur in the same contexts tend to have similar meanings ” KnowItAll Hypothesis: extractions that occur in the same informative contexts more frequently are more likely to be correct.

50 Relation’s arguments are “typed”: (Person, Mayor Of, City) Training: Model distribution of Person & City contexts in corpus (Distributional Hypothesis) Query time: Rank sparse triples by how well each argument’s context distribution matches that of its type Argument “Type checking” via HMM

51 Silly Example (Shaver, Mayor of, Pickerington) over (Spice Girls, Mayor of, Microsoft) Because: Shaver’s contexts are more like “other mayors” than Spice Girls’, and Pickerington's contexts are more like “other cities” than Microsoft’s

52 Utilizing HMMs to Check Types Challenges: Argument types are not known Can’t build model for each argument type “Textual types” are fuzzy Solution: Train an HMM for the corpus using EM & bootstrap REALM improves precision by 90%

53 MLN Knowledge Bases Query Formula Find Best Query Run Query Find Implied Nodes & Cliques Results Best KB + Query Query Results New Nodes + Cliques TextRunner, WordNet BornIn(Turing, England)?Inference Rules BornIn(X, city) -> BornIn(X, country) WordNet: X is in England London is in England In(London, England) TextRunner: Turing born in X Turing was born in London BornIn(Turing, London) BornIn(Turing, England) Query: Was Turing born in England? In(London, England) BornIn(Turing, London) BornIn(Turing, England) Yes! Turing was born in England!