To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay.

Slides:



Advertisements
Similar presentations
Downloading Textual Hidden-Web Content Through Keyword Queries
Advertisements

Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia.
1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.
Evaluating Search Engine
Information Retrieval in Practice
Search Engines and Information Retrieval
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Overview of Search Engines
Get Another Label? Using Multiple, Noisy Labelers Joint work with Victor Sheng and Foster Provost Panos Ipeirotis Stern School of Business New York University.
1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.
Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.
Towards a Query Optimizer for Text-Centric Tasks Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis Gravano Presenter: Avinandan Sengupta.
1 Querying Text Databases and the Web: Beyond Traditional Keyword Search Luis Gravano Columbia University Many thanks to Eugene Agichtein, AnHai Doan,
1 Accessing, Managing, and Mining Unstructured Data Eugene Agichtein.
1 Scalable Information Extraction Eugene Agichtein.
Chapter 13: Inference in Regression
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Search Engines and Information Retrieval Chapter 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Master Thesis Defense Jan Fiedler 04/17/98
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.
Querying Text Databases for Efficient Information Extraction Eugene Agichtein Luis Gravano Columbia University.
Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research.
CpSc 810: Machine Learning Evaluation of Classifier.
GECCO Papers Same research group, different lead authors Same conference Paper 1: Embodied Distributed Evolutionary Algorithm (EDEA) for on-line, on-board.
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
- University of North Texas - DSCI 5240 Fall Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.
Information Extraction Lecture 3 – Rule-based Named Entity Recognition Dr. Alexander Fraser, U. Munich September 3rd, 2014 ISSALE: University of Colombo.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Information Retrieval in Practice
Text Based Information Retrieval
Panagiotis G. Ipeirotis Tom Barry Luis Gravano
Preference Query Evaluation Over Expensive Attributes
Chapter 15 QUERY EXECUTION.
CS 430: Information Discovery
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Panos Ipeirotis Luis Gravano
Panagiotis G. Ipeirotis Luis Gravano
Presentation transcript:

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay Jain – Columbia University Luis Gravano – Columbia University

2 Text-Centric Task I: Information Extraction Information extraction applications extract structured relations from unstructured text May , Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is finding itself hard pressed to cope with the crisis… DateDisease NameLocation Jan. 1995MalariaEthiopia July 1995Mad Cow DiseaseU.K. Feb. 1995PneumoniaU.S. May 1995EbolaZaire Information Extraction System (e.g., NYU’s Proteus) Disease Outbreaks in The New York Times Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan

3 Text-Centric Task II: Metasearching Metasearchers create content summaries of databases (words + frequencies) to direct queries appropriately Friday June 16, NEW YORK (Forbes) - Starbucks Corp. may be next on the target list of CSPI, a consumer-health group that this week sued the operator of the KFC restaurant chain WordFrequency Starbucks102 consumer215 soccer1295 …… Content Summary Extractor WordFrequency Starbucks103 consumer216 soccer1295 …… Content Summary of Forbes.com

4 Text-Centric Task III: Focused Resource Discovery Identify web pages about a given topic (multiple techniques proposed: simple classifiers, focused crawlers, focused querying,…) URL Web Page Classifier Web Pages about Botany

5 An Abstract View of Text-Centric Tasks Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve documents from database TaskToken Information ExtractionRelation Tuple Database SelectionWord (+Frequency) Focused CrawlingWeb Page about a Topic For the rest of the talk

6 Executing a Text-Centric Task Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve documents from database Similar to relational world Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results Unlike the relational world Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed) → underlying data distribution dictates what is best

7 Execution Plan Characteristics Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve documents from database Execution Plans have two main characteristics: Execution Time Recall (fraction of tokens retrieved) Question: How do we choose the fastest execution plan for reaching a target recall ? “What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?”

8 Outline Description and analysis of crawl- and query-based plans  Scan  Filtered Scan  Iterative Set Expansion  Automatic Query Generation Optimization strategy Experimental results and conclusions Crawl-based Query-based (Index-based)

9 Scan Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve docs from database Scan Scan retrieves and processes documents sequentially (until reaching target recall) Execution time = |Retrieved Docs| · (R + P) Time for retrieving a document Question: How many documents does Scan retrieve to reach target recall? Time for processing a document Filtered Scan Filtered Scan uses a classifier to identify and process only promising documents (details in paper)

10 Estimating Recall of Scan Modeling Scan for Token t: What is the probability of seeing t (with frequency g(t)) after retrieving S documents? A “sampling without replacement” process After retrieving S documents, frequency of token t follows hypergeometric distribution Recall for token t is the probability that frequency of t in S docs > 0 S documents Probability of seeing token t after retrieving S documents g(t) = frequency of token t

11 Estimating Recall of Scan Modeling Scan: Multiple “sampling without replacement” processes, one for each token Overall recall is average recall across tokens → We can compute number of documents required to reach target recall Execution time = |Retrieved Docs| · (R + P)

12 Scan and Filtered Scan Output Tokens … Extraction System Text Database 4.Extract output tokens 3.Process documents 1.Retrieve docs from database Scan Scan retrieves and processes all documents (until reaching target recall) Filtered Scan Filtered Scan uses a classifier to identify and process only promising documents (e.g., the Sports section of NYT is unlikely to describe disease outbreaks) Execution time = |Retrieved Docs| * ( R + F + P) Time for retrieving a document Time for filtering a document Question: How many documents does (Filtered) Scan retrieve to reach target recall? Classifier 2.Filter documents Time for processing a document Classifier selectivity (σ≤1) σ filtered

13 Estimating Recall of Filtered Scan Modeling Filtered Scan: Analysis similar to Scan Main difference: the classifier rejects documents and  Decreases effective database size from | D| to σ·|D| (σ: classifier selectivity)  Decreases effective token frequency from g(t) to r·g(t) (r: classifier recall) Documents rejected by classifier decrease effective database size Tokens in rejected documents have lower effective token frequency

14 Outline Description and analysis of crawl- and query-based plans  Scan  Filtered Scan  Iterative Set Expansion  Automatic Query Generation Optimization strategy Experimental results and conclusions Crawl-based Query-based

15 Iterative Set Expansion Output Tokens … Extraction System Text Database 3.Extract tokens from docs 2.Process retrieved documents 1.Query database with seed tokens Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Time for retrieving a document Time for answering a query Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? Time for processing a document Query Generation 4.Augment seed tokens with new tokens Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? (e.g., [Ebola AND Zaire]) (e.g., )

16 Querying Graph The querying graph is a bipartite graph, containing tokens and documents Each token (transformed to a keyword query) retrieves documents Documents contain tokens TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5

17 Using Querying Graph for Analysis We need to compute the: Number of documents retrieved after sending Q tokens as queries (estimates time ) Number of tokens that appear in the retrieved documents (estimates recall ) To estimate these we need to compute the: Degree distribution of the tokens discovered by retrieving documents Degree distribution of the documents retrieved by the tokens (Not the same as the degree distribution of a randomly chosen token or document – it is easier to discover documents and tokens with high degrees) TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5 Elegant analysis framework based on generating functions – details in the paper

18 Recall Limit: Reachability Graph t 1 retrieves document d 1 that contains t 2 t1t1 t2t2 t3t3 t4t4 t5t5 TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5 Upper recall limit: determined by the size of the biggest connected component Reachability Graph

19 Automatic Query Generation Iterative Set Expansion Iterative Set Expansion has recall limitation due to iterative nature of query generation Automatic Query Generation Automatic Query Generation avoids this problem by creating queries offline (using machine learning), which are designed to return documents with tokens

20 Automatic Query Generation Output Tokens … Extraction System Text Database 4.Extract tokens from docs 3.Process retrieved documents 2.Query database Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Time for retrieving a document Time for answering a query Time for processing a document Offline Query Generation 1.Generate queries that tend to retrieve documents with tokens

21 Estimating Recall of Automatic Query Generation Query q retrieves g(q) docs Query has precision p(q)  p(q)·g(q) useful docs  [1-p(q)]·g(q) useless docs  We compute total number of useful (and useless) documents retrieved  Analysis similar to Filtered Scan:  Effective database size is |D useful |  Sample size S is number of useful documents retrieved Text Database Useful Doc Useless Doc q p(q)·g(q) (1-p(q))·g(q)

22 Outline Description and analysis of crawl- and query-based plans Optimization strategy Experimental results and conclusions

23 Summary of Cost Analysis Our analysis so far:  Takes as input a target recall  Gives as output the time for each plan to reach target recall (time = infinity, if plan cannot reach target recall) Time and recall depend on task-specific properties of database:  Token degree distribution  Document degree distribution Next, we show how to estimate degree distributions on-the-fly

24 Estimating Cost Model Parameters Token and document degree distributions belong to known distribution families Can characterize distributions with only a few parameters! TaskDocument DistributionToken Distribution Information ExtractionPower-law Content Summary ConstructionLognormalPower-law (Zipf) Focused Resource DiscoveryUniform

25 Parameter Estimation Naïve solution for parameter estimation:  Start with separate, “parameter-estimation” phase  Perform random sampling on database  Stop when cross-validation indicates high confidence We can do better than this! No need for separate sampling phase Sampling is equivalent to executing the task: →Piggyback parameter estimation into execution

26 On-the-fly Parameter Estimation Pick most promising execution plan for target recall assuming “default” parameter values Start executing task Update parameter estimates during execution Switch plan if updated statistics indicate so Important  Only Scan acts as “random sampling”  All other execution plan need parameter adjustment (see paper) Correct (but unknown) distribution Initial, default estimationUpdated estimation

27 Outline Description and analysis of crawl- and query-based plans Optimization strategy Experimental results and conclusions

28 Correctness of Theoretical Analysis Solid lines: Actual time Dotted lines: Predicted time with correct parameters Task: Disease Outbreaks Snowball IE system 182,531 documents from NYT 16,921 tokens

29 Experimental Results (Information Extraction) Solid lines: Actual time Green line: Time with optimizer (results similar in other experiments – see paper)

30 Conclusions Common execution plans for multiple text-centric tasks Analytic models for predicting execution time and recall of various crawl- and query-based plans Techniques for on-the-fly parameter estimation Optimization framework picks on-the-fly the fastest plan for target recall

31 Future Work Incorporate precision and recall of extraction system in framework Create non-parametric optimization (i.e., no assumption about distribution families) Examine other text-centric tasks and analyze new execution plans Create adaptive, “next-K” optimizer

32 Thank you! TaskFiltered ScanIterative Set Expansion Automatic Query Generation Information Extraction Grishman et al., J.of Biomed. Inf Agichtein and Gravano, ICDE 2003 Content Summary Construction -Callan et al., SIGMOD 1999 Ipeirotis and Gravano, VLDB 2002 Focused Resource Discovery Chakrabarti et al., WWW Cohen and Singer, AAAI WIBIS 1996

33 Overflow Slides

34 Experimental Results (IE, Headquarters) Task: Company Headquarters Snowball IE system 182,531 documents from NYT 16,921 tokens

35 Experimental Results (Content Summaries) Content Summary Extraction 19,997 documents from 20newsgroups 120,024 tokens

36 Experimental Results (Content Summaries) Content Summary Extraction 19,997 documents from 20newsgroups 120,024 tokens ISE is a cheap plan for low target recall but becomes the most expensive for high target recall

37 Experimental Results (Content Summaries) Content Summary Extraction 19,997 documents from 20newsgroups 120,024 tokens Underestimated recall for AQG, switched to ISE

38 Experimental Results (Information Extraction) OPTIMIZED is faster than “best plan”: overestimated F.S. recall, but after F.S. run to completion, OPTIMIZED just switched to Scan

39 Focused Resource Discovery 800,000 web pages 12,000 tokens