KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.

Slides:

Advertisements

Similar presentations

TEXTRUNNER Turing Center Computer Science and Engineering

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

Natural Language Processing WEB SEARCH ENGINES August, 2002.

Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.

Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)

Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.

Evaluating Search Engine

Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

Overall Information Extraction vs. Annotating the Data Conference proceedings by O. Etzioni, Washington U, Seattle; S. Handschuh, Uni Krlsruhe.

Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun

ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.

Information Retrieval

SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

CMPE 421 Parallel Computer Architecture

Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

KnowItAll Oren Etzioni, Stephen Soderland, Daniel Weld Michele Banko, Alex Beynenson, Jeff Bigham, Michael Cafarella, Doug Downey, Dave Ko, Stanley Kok,

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

A Novel Pattern Learning Method for Open Domain Question Answering IJCNLP 2004 Yongping Du, Xuanjing Huang, Xin Li, Lide Wu.

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

Facilitating Document Annotation using Content and Querying Value.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.

Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Chapter 23: Probabilistic Language Models April 13, 2004.

Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.

Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.

Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

KnowItAll Oren Etzioni, Stephen Soderland, Daniel Weld Michele Banko, Alex Beynenson, Jeff Bigham, Michael Cafarella, Doug Downey, Dave Ko, Stanley Kok,

KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.

8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

©2003 Paula Matuszek GOOGLE API l Search requests: submit a query string and a set of parameters to the Google Web APIs service and receive in return a.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.

SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.

Text Based Information Retrieval

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Data Mining Chapter 6 Search Engines

Introduction to Information Retrieval

University of Washington

CS246: Information Retrieval

Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.

Open Information Extraction from the Web

Information Retrieval and Web Design

Presentation transcript:

KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni

The Problem Numerous NLP applications rely on search- engine queries to: – Extract information from the web. – Compute statistics over the Web corpus. Search engines are extremely helpful for several linguistic tasks such as: – Computing usage statistics. – Finding a subset of web documents to analyze in depth.

Problem With Search Engines Search engines were not designed as building blocks for NLP applications. As a result: – An NLP application is forced to issue literally millions of queries to search engines; increasing processing time and limiting scalability. – Fetching web documents is also time-consuming. – Search engines are limiting the use of programmatic queries to their engines Google has placed hard quotas on the number of daily queries a program can issue. Other engines force applications to introduce “courtesy waits” between queries.

Example of the Problem “KnowItAll” KnowItAll works in a generate-and-test architecture extracting Information in 2 stages: – First, it Utilizes a small set of domain independent extraction patterns to generate candidate facts. – Second, it automatically tests the plausibility of the candidate facts it extracts using pointwise mutual information (PMI) statistics computed from search-engine hit counts.

1 st Stage in KnowItAll Take the generic pattern “NP1 such as NPList2”. This indicates that the head of each simple noun phrase (NP) in NPList2 is a member of the class named in NP1. – Take as example the pattern for class City, and the sentence “We provide tours to cities such as Paris, London, and Berlin.” – KNOWITALL extracts three candidate cities from the sentence: Paris, London, Berlin.

2 nd Stage in KnowItAll KnowItAll needs to assess the likelihood of the information it found. Verify that Paris is actually a city. It does that by computing the PMI between Paris and a set of k discriminator phrases that tend to have high mutual information with city names. (Paris is a city) This requires at least k search-engine queries for every candidate extraction!

The Solution A novel architecture for Information Extraction which does not depend on Web search-engine queries; KnowItNow. Works over 2 stages like KnowItAll: – Uses a specialized search engine called the Binding Engine (or BE) which efficiently returns bindings in response to variabilized queries. – Uses URNS, a combinatorial model, which estimates the probability that each extraction is correct without using any additional search engine queries

The Binding Engine vs. The Traditional Engine

The Traditional Engine Take the search query (“Cities such as ”). Perform a traditional search engine query. For each such URL: – obtain the document contents. – find the searched-for terms in the document text. – Run the noun phrase recognizer to determine if text found satisfies the linguistic type requirement – If it does, return the string.

Problems With Traditional Engine The search itself doesn’t take a long time. Even if there are multiple search queries The second stage fetches a large number of documents, each fetch likely resulting in a random disk seek; this stage executes slowly. this disk access is slow regardless of whether it happens on a locally-cached copy or on a remote document server.

The Binding Engine Why not use a table to store a list of terms and documents containing them?! The Binding Engine supports these queries: – Typed variables (such as NounPhrase) – String-processing functions (such as “head(X)” or “ProperNoun(X)”). – Standard query terms. It processes a variable by returning every possible string in the corpus that has a matching type, and that can be substituted for the variable and still satisfy the user's query.

How the Binding Engine Works? It uses a novel approach called the “neighborhood index” The neighborhood index is an augmented inverted index structure. – For each term in the corpus, the index keeps a list of documents in which the term appears and a list of positions where the term occurs. – The index also keeps a list of left-hand and right- hand neighbors at each position. (Adjacent text strings that satisfy a recognizer, e.g. NounPhrase)

How is The Binding Engine Better? K is the number of concrete terms in the query. B is the number of variable bindings found in the corpus. N is the number of documents in the corpus. Expensive processing such as part-of-speech tagging or shallow syntactic parsing is performed only once, while building the index, and is not needed at query time.

How is The Binding Engine Better? Average time to return the relevant bindings in response to a set of queries CPU minutes for BE CPU minutes for Nutch (Private search engine)

Disadvantages of The Binding Engine It consumes a large amount of disk space, as parts of the corpus text are folded into the index several times. The neighborhood index increased disk space four times that of a standard inverted index

The URNS Model We need a way to test that the extractions from the Binding Engine are correct KnowItAll issues queries to search engines and uses the PMI model to verify extractions. PMI is very efficient but it is also very slow.

How URNS works? URNS is a probabilistic model – It takes the form of a classic “balls-and- urns” model from combinatorics. Each extraction is modeled as a labeled ball in an urn. A label represents either an instance of the target class or relation, or represents an error

How URNS works? C - the set of unique target labels; |C| is the number of unique target labels in the urn. E - the set of unique error labels; |E| is the number of unique error labels in the urn. num(b) - the function giving the number of balls labeled by b where b is a subset of C U E. num(B) is the multi-set giving the number of balls for each label b, where b is a subset of B.

How URNS works? The goal of an IE system is to discern which of the labels it extracts are in fact elements of C. – Given that a particular label x was extracted k times in a set of n draws from the urn, what is the probability that x is a subset of C?

Alternative to URNS Items that were extracted more often are more likely to be true. – i.e. Extractions with higher frequencies are true.

Experiments Recall: how many distinct extractions does each system return at high precision? Time: how long did each system take to produce and rank its extractions? Extraction Rate: how many distinct high quality extractions does the system return per minute? The extraction rate is simply recall divided by time.

KnowItNow vs. KnowItAll Tested on relation “Country”

KnowItNow vs. KnowItAll Tested on relation “CapitalOf”

KnowItNow vs. KnowItAll Tested on relation “Corp”

KnowItNow vs. KnowItAll Tested on relation “CeoOf”

KnowItNow vs. KnowItAll

Contributions A novel architecture for Information Extraction which does not depend on Web search-engine queries. Extract tens of thousands of facts from the Web in minutes instead of days. KnowItNow's extraction rate is two to three orders of magnitude greater than KnowItAll's.