Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun 2009-03-01.

Slides:

Advertisements

Similar presentations

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.

Advertisements

Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.

Automatic Metaphor Interpretation as a Paraphrasing Task Ekaterina Shutova Computer Lab, University of Cambridge NAACL 2010.

Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.

Assuming normally distributed data! Naïve Bayes Classifier.

KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

Overall Information Extraction vs. Annotating the Data Conference proceedings by O. Etzioni, Washington U, Seattle; S. Handschuh, Uni Krlsruhe.

Open Information Extraction From The Web Rani Qumsiyeh.

Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.

Presented by Zeehasham Rasheed

Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.

Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun

1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,

(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.

1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

1 Extracting Product Feature Assessments from Reviews Ana-Maria Popescu Oren Etzioni

Mining and Summarizing Customer Reviews

A Probabilistic Model of Redundancy in Information Extraction University of Washington Department of Computer Science and Engineering

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Unsupervised and Semi-Supervised Relation Extraction.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

Survey of Semantic Annotation Platforms

ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

Open IE and Universal Schema Discovery Heng Ji Acknowledgement: some slides from Daniel Weld and Dan Roth.

Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

A Language Independent Method for Question Classification COLING 2004.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

KnowItAll Oren Etzioni, Stephen Soderland, Daniel Weld Michele Banko, Alex Beynenson, Jeff Bigham, Michael Cafarella, Doug Downey, Dave Ko, Stanley Kok,

Presenter: Shanshan Lu 03/04/2010

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,

Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.

Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

Presented By- Shahina Ferdous, Student ID – , Spring 2010.

KnowItAll Oren Etzioni, Stephen Soderland, Daniel Weld Michele Banko, Alex Beynenson, Jeff Bigham, Michael Cafarella, Doug Downey, Dave Ko, Stanley Kok,

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

SEMANTIC VERIFICATION IN AN ONLINE FACT SEEKING ENVIRONMENT DMITRI ROUSSINOV, OZGUR TURETKEN Speaker: Li, HueiJyun Advisor: Koh, JiaLing Date: 2008/5/1.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

iSRD Spam Review Detection with Imbalanced Data Distributions

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Unsupervised learning of visual sense models for Polysemous words

KnowItAll and TextRunner

Presentation transcript:

Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun

1. Motivation  Recall from previous papers such as Snow 04 about Hypernym/hyponym extraction that their systems’ recall are low.  Motivation of this paper: How can we improve system’s recall and extraction rate without sacrificing precision ?

2. KnowItAll: The Baseline System  KnowItAll is an autonomous, domain-independent system that extracts facts, concepts, and relationships from the Web (Etzioni et al. 2004).  Overview of KnowItAll’s architecture 1) Generate: KNOWITALL utilizes a set of eight domain-independent extraction patterns to generate candidate facts (cf. (Hearst 1992)). 2) Test: KNOWITALL automatically tests the plausibility of the candidate facts it extracts using pointwise mutual information (PMI) statistics computed by treating the Web as a massive corpus of text.

2. KnowItAll: The Baseline System(Cont’) Extractor: The Extractor automatically instantiates domain-independent extraction patterns to class-specific patterns. NP1 "such as" NPList2 & head(NP1)= plural(Class1) & properNoun(head(each(NPList2))) => instanceOf(Class1, head(each(NPList2))) keywords: "plural(Class1) such as” For example, if Class1 is set to “City” then the rule spawns the search engine query “cities such as”, downloads the Web pages, and extracts every proper noun immediately following that phrase as a potential city. Search Engine Interface: (They actually used Google API) For the above example, it would lead KNOWITALL to  1) automatically formulates queries based on its extraction rules and in this example it issues the search-engine query “cities such as”;  2) downloads in parallel all pages named in the engine’s results;  3) applies the Extractor to the appropriate sentences on each downloaded page.

2. KnowItAll: The Baseline System(Cont’)  Assessor: Uses pointwise mutual information (PMI) to assess the likelihood that the Extractor’s conjectures are correct.  I: Extracted Instance (such as New York)  D: Automatically generated discriminator phrases associated with the class (such as “city of” for the class City). Note: D is multiple. They use class names and the keywords of extraction rules to automatically generate these discriminator phrases; they can also be derived from rules learned using RL(Rule Learning, will explain later) techniques  Hits: The number of hits for a query from Search Engine

3.1 Improving Recall— Method 1: Rule Learning  Motivation: many of the best extraction rules for a domain do not match a generic pattern. E.g., “headquartered in ” is not in Hearst’s patterns.  Learning Process: 1) starts with a set of seed instances generated automatically by KNOWITALL’s bootstrapping phase. 2) extracts context string. Given an extraction instance i, extract W-k, W1-k, …W-1,, W1,…Wk, where is the class of i 3) Select the ‘best’ substrings and convert them into extraction rules.  Rule Evaluation and Selection: Evaluation is hard, because there is no labeled data. Many of the most precise rules have low recall, and vice versa. They address these challenges with two heuristics: H1: prefer substrings that appear in multiple context strings across different seeds. Do NOT select substrings that matched only a single seed, 96% of the potential rules are eliminated. H2: penalize substrings whose rules would have many false positives by Laplace correction: where c is the number of times the rule matched a seed in the target class, n is the number of times it matched known members of other classes, and k/m is the prior estimate of a rule’s precision, obtained by testing a sample of the learned rules’ extractions using the KNOWITALL Assessor.

3.1 Improving Recall— Method 1: Rule Learning(cont’)  Contribution of H1: In experiments with the class City, H1 was found to improve extraction rate substantially, increasing the average number of unique cities extracted by a factor of five.  Contribution of H2: On experiments with the class City, ranking rules by H2 further increased the average number of unique instances extracted (by 63% over H1) and significantly improved average precision (from 0.32 to 0.58).  Most productive rules learned:

3.2 Improving Recall— Method 2: Subclass Extraction  Observation: Many ‘facts’ are related to subclasses(hyponyms) of a class rather than the class itself. For example, only a small fraction of mentioned candidates scientists are referred to as “scientists”, many others are referred to as physicists, chemists, and etc.  Basic idea: sequentially instantiating the generic patterns with extracted subclasses.  Subclass Extraction:  Use the same generic patterns for instance extraction but require a common noun test. (they use POS and Capitalization to test Proper noun)  Use WordNet’s Hypernym taxonomy to verify  Use PMI to assess.  Use coordinate terms to enhance SE

3.3 Improving Recall— Method 3: List Extraction  Motivation: Beyond natural language sentences, the web contains numerous regularly-formatted lists that enumerate many of the elements of a class.  Find and extract list elements:  1) Prepares seeds(high-probability instances extracted by KnowItAll)  2) Selects a random subset of k of seeds as keywords for a search engine query  3) Downloads the documents corresponding to the engine’s results.  4) Repeat 5000–10,000 times with different random subsets.  5) Search each document for regularly-formatted list that includes these keywords  Assess the accuracy of extracted instances:  Method 1: treat list membership as a feature and use a Naive Bayesian classifier to determine accuracy of an instance  Method 2: model each list as a uniform distribution, perform EM to determine the probability that a randomly-drawn element of a list is a class member, and then use naive Bayes to compute the probability of each instance  Method 3: simply sort the instances according to the number of lists in which they appear  Method 4: LE + A, which is Method 3 plus PMI assessment. 3 outperformed 1 and 2, 4 worked the best in experiments.

4. Experimental Results  They conducted a series of experiments to evaluate the effectiveness of Subclass Extraction (SE), Rule Learning (RL), and List Extraction (LE) in increasing the recall of the baseline KNOWITALL system on three classes City, Scientist, Film.  They used the Google API as search engine.  They estimated the number of correct instances extracted by manually tagging samples of the instances grouped by probability, and computed precision as the proportion of correct instances at or above a given probability.  In the case of City, they also automatically marked instances as correct when they appeared in the Tipster Gazetteer.

4. Experimental Results(cont’)  In the City and Film classes we see that each of the methods resulted in some improvement over the baseline, but the methods were dominated by LE+A, which resulted in a 5.5-fold improvement, and found virtually all the extractions found by other methods.  In the case of Scientist, SE’s ability to extract subclasses made it the dominant method; SE is particularly powerful for general, naturally decomposable classes such as Plant, Animal, or Machine.

4. Experimental Results(cont’)  While the recall is clearly enhanced, what impact do these methods have on system’s extraction rate?  Search engine queries (with a “courtesy wait” between queries) are the system’s main bottleneck.  Measure extraction rate by the number of unique instances extracted per search engine query.  They focus on unique extractions because each of their methods extracts “popular” instances multiple times.  Table 4 shows that LE+A enhancement to recall comes at a cost. In contrast, LE’s extraction rate is two orders of magnitude better than any of the other methods.