Winter Semester 2003/2004Selected Topics in Web IR and Mining7-1 7 Top-k Queries on Web Sources and Structured Data 7.1 Top-k Queries over Autonomous Web.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Davide Mottin, Senjuti Basu Roy, Alice Marascu, Yannis Velegrakis, Themis Palpanas, Gautam Das A Probabilistic Optimization Framework for the Empty-Answer.
Web Information Retrieval
Chapter 5: Introduction to Information Retrieval
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Best-Effort Top-k Query Processing Under Budgetary Constraints
PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries Vagelis Hristidis University of California, San Diego Nick Koudas AT&T.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
Improving Query Results using Answer Corroboration Amélie Marian Rutgers University.
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
Top-k Query Processing and Optimization 198:541 (slides courtesy of Ihab F. Ilyas and Walid G. Aref)
Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.
Winter Semester 2003/2004Selected Topics in Web IR and Mining6-1 6 Rank Aggregation and Top-k Queries 6.1 Fagin‘s Threshold Algorithm 6.2 Rank Aggregation.
Efficient Processing of Top-k Spatial Preference Queries
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Vector Space Models.
Combining Fuzzy Information: An Overview Ronald Fagin.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.
Adaptive Processing of Top-k Queries in XML Amelie Marian, Sihem Amer-Yahia Nick Koudas, Divesh Srivastava Proceedings of the 21st International Conference.
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Chapter 13: Query Processing
Neighborhood - based Tag Prediction
Supporting Ranking in Queries Score-based Paradigm
Seung-won Hwang, Kevin Chen-Chuan Chang
Probabilistic Data Management
Chapter 12: Query Processing
Top-k Query Processing
Spatio-temporal Pattern Queries
Probabilistic Ranking of Database Query Results
Rank Aggregation.
Laks V.S. Lakshmanan Depf. of CS UBC
Structure and Content Scoring for XML
Panagiotis G. Ipeirotis Luis Gravano
Structure and Content Scoring for XML
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Efficient Processing of Top-k Spatial Preference Queries
Relax and Adapt: Computing Top-k Matches to XPath Queries
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Winter Semester 2003/2004Selected Topics in Web IR and Mining7-1 7 Top-k Queries on Web Sources and Structured Data 7.1 Top-k Queries over Autonomous Web Sources 7.2 Ranking of SQL Query Results

Winter Semester 2003/2004Selected Topics in Web IR and Mining Computational Model for Top-k Queries over Web Sources Typical example: Address = „2590 Broadway“ and Price = $ 25 and Rating = 30 issued against mapquest.com, nytoday.com, zagat.com Major complication: some sources do not allow sorted access Major opportunity: sources can be accessed in parallel  extension/generalization of TA distinguish S-sources, R-sources, SR-sources

Winter Semester 2003/2004Selected Topics in Web IR and Mining7-3 Source-Type-Aware TA For each R-source S i  S m+1.. S m+r set high i := 1 Scan SR- or S-sources S 1.. S m Choose SR- or S-source S i for next sorted access for object d retrieved from SR- or S-source L i do { E(d) := E(d)  {i}; high i := si(q,d); bestscore(d) := aggr{x1,..., xm) with xi := si(q,d) for i  E(d), high i for i  E(d); worstscore(d) := aggr{x1,..., xm) with xi := si(q,d) for i  E(d), 0 for i  E(d); }; Choose SR- or R-source Si for next random access for object d retrieved from SR- or R-source L i do { E(d) := E(d)  {i}; bestscore(d) := aggr{x1,..., xm) with xi := si(q,d) for i  E(d), high i for i  E(d); worstscore(d) := aggr{x1,..., xm) with xi := si(q,d) for i  E(d), 0 for i  E(d); }; current top-k := k docs with largest worstscore; worstmin k := minimum worstscore among current top-k; Stop when bestscore(d | d not in current top-k results)  worstmin k ; Return current top-k; in contrast to Fagin‘s TA, keep complete list of candidate objects

Winter Semester 2003/2004Selected Topics in Web IR and Mining7-4 Strategies for Choosing the Source for Next Access for next sorted acccess: Escore(Si) := expected si value for next sorted access to Si (e.g.: high i ) rank(Si) := w i * Escore(Si) / c s (Si) // w i is weight of Si in aggr choose SR- or S-source with highest rank(Si) for next random acccess (probe): Escore(Si) := expected si value for next random access to Si (e.g.: (high i  low i ) / 2) rank(Si) := w i * Escore(Si) / c r (Si) choose SR- or S-source with highest rank(Si)

Winter Semester 2003/2004Selected Topics in Web IR and Mining7-5 The Upper Strategy for Choosing Next Object and Source (Marian et al.: TODS 2004) for next random acccess: among all objects with E(d)  and R(d)  choose d‘ with the highest bestscore(d‘); if bestscore(d‘) < bestscore(d‘‘) for object d‘‘ with E(d‘‘)=  then perform sorted access next (i.e., don‘t probe d‘) else {  := bestscore(d‘)  worstmin-k; if  > 0 then { consider Si as „redundant“ for d‘ if for all Y  R(d‘)  {Si}  j  Y w j * high j + w i * high i     j  Y w j * high j   ; choose „non-redundant“ source with highest rank(Si) } else choose source with lowest c r (Si); }; Alternative for early stage of algorithm: could prioritize random access for objects in current top-k to improve worstmin-k idea: eagerly prove that candidate objects cannot qualify for top-k

Winter Semester 2003/2004Selected Topics in Web IR and Mining7-6 The Parallel Strategy pUpper (Marian et al.: TODS 2004) idea: consider up to MPL(Si) parallel probes to the same R-source Si choose objects to be probed based on bestscore reduction and expected response time for next random acccess: probe-candidates := m objects d with E(d)  and R(d)  such that d is among the m highest values of bestscore(d); for each object d in probe-candidates do {  := bestscore(d)  worstmin-k; if  > 0 then { choose subset Y(d)  R(d) such that  j  Y w j * high j   and expected response time  Sj  Y(d) ( |{d‘|bestscore(d‘)>bestscore(d) and Y(d)  Y(d‘)  }| * c R (Sj) / MPL(Sj) ) is minimum }; }; enqueue probe(d) to queue(Si) for all Si  Y(d) with expected response time as priority;

Winter Semester 2003/2004Selected Topics in Web IR and Mining7-7 Experimental Evaluation pTA: parallelized TA (with asynchronous probes, but same probe order as TA) real Web sources SR: superpages (Verizon yellow pages) R: subwaynavigator R: mapquest R: altavista R: zagat R: nytoday synthetic data from: A. Marian et al., TODS 2004

Winter Semester 2003/2004Selected Topics in Web IR and Mining Ranking of SQL Query Results motivated by applications such as: realtor databases (e.g., homeadvisor.msn.com) with attributes City, Datebuilt, Price, Sqft, Bedrooms, Bathrooms, Deck, Fenced, Culdesac, Pool, Spa,... SQL queries such as Select * From Homes Where City In (‚Redmond‘, ‚Bellevue‘) And Bedrooms >= 3 And Bathrooms >= 2 And Price < often return too many answers or empty answers Similar examples are car sales databases, customer support data, etc. Straightforward idea: apply Fagin‘s TA algorithm, treating indexed attributes as SR-sources and non-indexed attributes as R-sources, implemented in SQL Less obvious issue: similarity measures on numerical and categorical attributes

Winter Semester 2003/2004Selected Topics in Web IR and Mining7-9 IDF Based Similarities for query condition Ai=qi with categorical attribute Ai the score of tuple t with t.Ai=tj is: si(tj,qi) := idf(tj) for qi=tj, 0 else with idf(tj) := log (|R| / |{t|t  R and t.Ai=tj}|) the total score for a query over multiple categorical attributes then is: s(t,q) :=  i si(t.Ai,qi) Example: q: Type=‚Convertible‘ And Make=‚Nissan‘ ranks non-Nissan convertibles higher than Nissan standard cars, 4WDs, etc. for (practically continuous) numerical attributes A (e.g., price) define: for conditions Ai In (qi 1,..., qi k ) set si(tj, qi) := max j=1..k si(tj, qi j )

Winter Semester 2003/2004Selected Topics in Web IR and Mining7-10 QF Based Similarities idf can be misleading about importance (e.g., old houses are infrequent, new houses are usually looked for) Query frequency (qf), as reflected in the current workload, can indicate importance, too: si(tj,qi) := qf(tj) for qi=tj, 0 else Workloads can also reveal correlations in user preferences (e.g., frequent queries with Make In (‚Honda Accord‘, ‚Toyota Camry‘) or City In (‚Redmond‘, ‚Bellevue‘, ‚Kirkland‘)) W(t): set of previous queries with In predicate containing value t For query value q and tuple with value t define: si(q,t) := J(W(q),W(t))*qf(t) with Jaccard coefficient

Winter Semester 2003/2004Selected Topics in Web IR and Mining7-11 Experimental Evaluation 4000 tuples about homes for sale in the Seattle Eastside area 80 queries each with 2-5 attributes specified human relevance assessment by 5 colleagues Ranking quality metric: weighted precision R for top n (n=10) with r i =1 if rank is relevant, 0 otherwise from: S. Agrawal et al., CIDR 2003

Winter Semester 2003/2004Selected Topics in Web IR and Mining7-12 Literature Amelie Marian, Nicolas Bruno, Luis Gravano: Evaluating Top-k Queries over Web-Accessible Databases, to appear in ACM TODS 2004 Nicolas Bruno, Luis Gravano, Amélie Marian: Evaluating Top-k Queries over Web-Accessible Databases, ICDE Conf Ronald Fagin, Amnon Lotem, Moni Naor: Optimal Aggregation Algorithms for Middleware, Journal of Computer and System Sciences Vol. 66, 2003 Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, Aristides Gionis: Automated Ranking of Database Query Results, CIDR Conf Panayiotis Tsaparas, Themistoklis Palpanas, Yannis Kotidis, Nick Koudas, Divesh Srivastava: Ranked Join Indices, ICDE Conf. 2003