Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Identifying the Most Influential Data Objects with Reverse Top-k Queries By Akrivi Vlachou 1, Christos Doulkeridis 1, Kjetil Nørvag 1 and Yannis Kotidis.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Database-Based Hand Pose Estimation CSE 6367 – Computer Vision Vassilis Athitsos University of Texas at Arlington.
Matchin: Eliciting User Preferences with an Online Game Severin Hacker, and Luis von Ahn Carnegie Mellon University SIGCHI 2009.
Evaluating Search Engine
Presenters: Başak Çakar Şadiye Kaptanoğlu.  Typical output of an IR system – static predefined summary ◦ Title ◦ First few sentences  Not a clear view.
Recsplorer: Recommendation Algorithms Based on Precedence Mining Aditya Parameswaran Stanford University (Joint work with G. Koutrika, B. Bercovitz & H.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Construction of Index: (Page 197) Objective: Given a document, find the number of occurrences of each word in the document. Example: Computer Science students.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
Fast multiresolution image querying CS474/674 – Prof. Bebis.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
A Social Help Engine for Online Social Network Mobile Users Tam Vu, Akash Baid WINLAB, Rutgers University May 21,
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Frame an IR Research Problem and Form Hypotheses ChengXiang Zhai Department.
Querying Structured Text in an XML Database By Xuemei Luo.
A General Discussion on Functional (Black-box) Testing What are some of the concerns of testers ? –Have we got enough time to test (effort & schedule)?
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.
Systems and Internet Infrastructure Security (SIIS) LaboratoryPage Systems and Internet Infrastructure Security Network and Security Research Center Department.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
“Artificial Intelligence” in my research Seung-won Hwang Department of CSE POSTECH.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Querying Web Data – The WebQA Approach Author: Sunny K.S.Lam and M.Tamer Özsu CSI5311 Presentation Dongmei Jiang and Zhiping Duan.
CPS ® and CAP ® Examination Review OFFICE SYTEMS AND TECHNOLOGY, Fifth Edition By Schroeder and Graf ©2005 Pearson Education, Inc. Pearson Prentice Hall.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Digital Planet: Tomorrow’s Technology and You Chapter 7 Database Applications and Privacy Implications Copyright © 2012 Pearson Education, Inc. publishing.
Performance Measurement. 2 Testing Environment.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Evaluation of Information Retrieval Systems Xiangming Mu.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Lecture 1: Introduction and the Boolean Model Information Retrieval
Proposal for Term Project
Information Retrieval in Practice
Lecture 26 Hand Pose Estimation Using a Database of Hand Images
Panagiotis G. Ipeirotis Tom Barry Luis Gravano
Research Issues in Electronic Commerce
Panos Ipeirotis Luis Gravano
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Panagiotis G. Ipeirotis Luis Gravano
Probabilistic Databases
CS246: Information Retrieval
Query processing: phrase queries and positional indexes
Information Retrieval and Web Design
Prefer: A System for the Efficient Execution
Databases WOW!! A database is a collection of related data.
Presentation transcript:

Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University

Luis Gravano2 Users Have Many Available Information Sources Source 1 h 11, h 12, h 13,... Source 2... Nothing! User Query Query Results “ Houses near Palo Alto for around $300K.”

Stanford University Luis Gravano3 Challenges Sources are too numerousSources are too numerous Sources are heterogeneous (query language, model, results)Sources are heterogeneous (query language, model, results) Users want a single query resultUsers want a single query result

Stanford University Luis Gravano4 Metasearcher Selects the good sources for a querySelects the good sources for a query Extracts and combines the query results from the sourcesExtracts and combines the query results from the sources

Stanford University Luis Gravano5 Text Sources Rank Query Results Text Source Doc 1: 0.8 Doc 2: “Distributed Databases”

Stanford University Luis Gravano6 Structured Sources on the Internet also Rank Results A real-estate agent receives queries on Location and Price: Q: “Houses with preferred location in Palo Alto and preferred price around $300K.”

Stanford University Luis Gravano7 The Agent Ranks its Houses Based on its Own Scoring Function Q: “Houses with preferred location in Palo Alto and preferred price around $300K.”

Stanford University Luis Gravano8 A Metasearcher then Faces Two Problems Extracting the top objects from the underlying sourcesExtracting the top objects from the underlying sources Merging the results from the various sourcesMerging the results from the various sources

Stanford University Luis Gravano9 Merging Query Results is Easy with Enough Information Given a record like: the metasearcher ignores the Source score and computes its Target score from the Location and Price

Stanford University Luis Gravano10 Extracting the Top Objects from a Source is Hard The metasearcher’s scoring function might be different from the source’s!

Stanford University Luis Gravano11 We Want to Avoid Extracting All the Source’s Contents Assume a house h with: Source(Q, h) = 0 (worst for source)Source(Q, h) = 0 (worst for source) Target(Q, h) = 1 (best for metasearcher)Target(Q, h) = 1 (best for metasearcher) Problem!

Stanford University Luis Gravano12 The Example Query is Not Manageable at the Agent A query Q is manageable at a source if   < 1 such that: Source Target (0,0) (1,1)    Source(Q, h)  Target(Q, h)- 

Stanford University Luis Gravano13 Single-Attribute Queries Are More Likely to be Manageable Single-attribute queries for Q: Q 1 : Location = Palo AltoQ 1 : Location = Palo Alto Q 2 : Price = $300KQ 2 : Price = $300K

Stanford University Luis Gravano14 The Example Becomes Tractable! … if the top Target objects for Q are among the top Source objects for Q 1 and Q 2

Stanford University Luis Gravano15 A Cover Bounds the Target Scores for Q Q 1, …, Q m single-attribute queries form a cover for Q if  g 1, …, g m, G such that: Target(Q i, h)  g i Target(Q, h)  G

Stanford University Luis Gravano16 Having a Manageable Cover for a Query is Sufficient... Manageable Cover for query Q at source S “Efficient” Executions Possible at S

Stanford University Luis Gravano17 Having a Manageable Cover for a Query is Sufficient... (1) Pick a manageable cover C = {Q 1,..., Q m } for Q at S (2) For i = 1 to m: Find  i for Q i (3) Pick 0  g 1,..., g m, G < 1 for cover C (4) For i = 1 to m (5) Retrieve all objects t with Source(Q i, t)  G i = g i -  i (6) Compute Target(Q, t) for all objects t retrieved (7) If  i such that G i  0 Then Go to Step (11) (8) If for all t retrieved, Target(Q, t)  G Then (9) Find new, lower 0  g 1,..., g m, G < 1 for C (10) Go to Step (4) (11) Output those objects retrieved with the highest Target score

Stanford University Luis Gravano18 Algorithm to Extract Top Target Objects Q1Q1Q1Q1 Q2Q2Q2Q2 0 1 g1g1g1g1 g2g2g2g2 Target(Q, h)  G

Stanford University Luis Gravano19 Algorithm to Extract Top Target Objects Q1Q1Q1Q1 Q2Q2Q2Q2 0 1 g1’g1’g1’g1’ g2’g2’g2’g2’ Target(Q, h)  G’ Target(Q, h’)  G’! h’

Stanford University Luis Gravano20 Preliminary Performance Results for our Algorithm Target=Min: 14% objects retrievedTarget=Min: 14% objects retrieved Target=Max: 4% objects retrievedTarget=Max: 4% objects retrieved 10,000 objects 4 query attributes  =0

Stanford University Luis Gravano21 Preliminary Performance Results for our Algorithm Target=Min: 25% objects retrievedTarget=Min: 25% objects retrieved Target=Max: 44% objects retrievedTarget=Max: 44% objects retrieved 10,000 objects 4 query attributes  =0.10

Stanford University Luis Gravano22 Having a Manageable Cover for a Query is Also Necessary... No Manageable Cover No Manageable Cover for query Q at source S Efficient Executions Impossible at S

Stanford University Luis Gravano23 A Manageable Cover is Necessary: Proof Consider Q 1, Q 2, Q 3 minimal cover for Q with: Q 1, Q 2 manageable, Q 3 not manageable For any “efficient “execution, build h such that: h is not retrieved h is not retrieved Target(Q, h) > G = max{Target(Q, o) | o retrieved} Target(Q, h) > G = max{Target(Q, o) | o retrieved}

Stanford University Luis Gravano24 A Manageable Cover is Necessary: Proof Q1Q1Q1Q1 Q2Q2Q2Q2 Q3Q3Q3Q3 0 1 g1g1g1g1 g2g2g2g2 g3g3g3g3

h’h’ h’ Target(Q, h’) > G!

h h’ h h’ Target(Q, h) > G! h Target(Q 3, h)  Target(Q, h’) Target(Q, h’) > G

Stanford University Luis Gravano27 We Studied Two Metasearching Problems Extracting the top objects from the underlying sourcesExtracting the top objects from the underlying sources Merging the results from the various sourcesMerging the results from the various sources

Stanford University Luis Gravano28 Related Work: Collection Fusion Voorhees et al.Voorhees et al. Callan/Lu/CroftCallan/Lu/Croft Gauch/WangGauch/Wang