 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Normal Distribution 2 To be able to transform a normal distribution into Z and use tables To be able to use normal tables to find and To use the normal.
Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.
1 Top-K Algorithms: Concepts and Applications by Demetris Zeinalipour Visiting Lecturer Department of Computer Science University of Cyprus Department.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
Supporting Top-k join Queries in Relational Databases By:Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Calvin R Noronha ( )
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
Quality-aware Collaborative Question Answering: Methods and Evaluation Maggy Anastasia Suryanto, Ee-Peng Lim Singapore Management University Aixin Sun.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Topic 1Topic 2Topic 3Topic 4Topic
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Histograms for Selectivity Estimation
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.
Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),
Recommendation for English multiple-choice cloze questions based on expected test scores 2011, International Journal of Knowledge-Based and Intelligent.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
ACCESS CHAPTER 2 Introduction to ACCESS Learning Objectives: Understand ACCESS icons. Use ACCESS objects, including tables, queries, forms, and reports.
Packet Classification Using Multi- Iteration RFC Author: Chun-Hui Tsai, Hung-Mao Chu, Pi-Chung Wang Publisher: 2013 IEEE 37th Annual Computer Software.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
Query Reranking As A Service
Indexing & querying text
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
Probabilistic Data Management
Top-k Query Processing
Spatio-temporal Pattern Queries
Popular Ranking Algorithms
Feature Selection for Ranking
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Prefer: A System for the Efficient Execution
Probabilistic Ranking of Database Query Results
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results

 Answering Top-k Queries Active research topic Retrieve quickly a number(k) of highest ranking tuples in presence of monotone ranking functions defined on attributes of underlying relations  Algorithms Threshold Algorithm (TA) by Fagin et. al., Independently by Guntzer et. al., Nepal et. al., INTRODUCTIONINTRODUCTION

 Materialized Views A database table that contains the results of the query previously asked. Actually constructed and stored.  Problem Discussed To find efficient methods of answering a query using a set of previously defined materialized views over the database.  Why Views? Relevance to a variety of data management problems. Promised increased in performance. Views are materialized (incurring a space overhead) with the hope to gain in performance for some queries. INTRODUCTIONINTRODUCTION

Views do not specify any selection conditions on the attributes they aim to rank. Example: (TOP-k) INTRODUCTIONINTRODUCTION tidX1X2X R tidScore tidScore f1=2x1+5x2 f2=x2+2x3 View1 (V1) Top-5 Query View2 (V2) Top-3 Query

Given a top-2 query defined using function f3=3x1+10x2+5x3, we can apply standard top-k algorithm(e.g., TA) using the data from R and obtain answer to the query. Using Views? Feasibility Guarantee an answer Speed of using R directly vs. Using Views INTRODUCTIONINTRODUCTION

Multimedia Context: Uses ordered lists Threshold Algorithm: This algorithm requires the scoring function to be monotonic. i.e. For tuples t and u, t[i]<u[i], 1≤i≤100, then Score Q (t)≤Score Q (u). TA requires that each attribute has an index mechanism that allows all tids to be accessible in sorted order. A single random access is required to resolve all attributes of a tid. In our paper we focus on Additive scoring functions(monotonic), where Score Q (t)=w 1 t[1]+ w 2 t[2]+….+ w m t[m] RELATEDWORKRELATEDWORK

Variants: TA-Sorted - Lists are always accessed sequentially and NO random accesses are performed. PREFER [Hristidis et. al.,] : Storing multiple copies of ‘R’. It assumes to utilize only one copy of a relation which is closest to the new query to answer the new query. RELATEDWORKRELATEDWORK

Consider Relation R with m numeric attributes (X1, X2…Xm) Dom i =[lb i, ub i ] domain of ith attribute. Tuple t is viewed as numeric vector t=(t[1], t[2]… t[m]) Top-k Ranking Queries in SQL-like syntax: SELECT TOP[k] FROM R WHERE Range Q ORDER BY Score Q Expressed as a triple Q=( Score Q, k, Range Q ) Score Q : Function that assigns a numeric score to any tuple ‘t’. Range Q : Boolean function that defines a selection condition for the tuples of ‘R’. The semantics requires that the system retrieve the k tuples with the top scores satisfying the selection condition. PRELIMINARIESPRELIMINARIES

Materialized Ranking View(V): Materialized result of the tuples of a previously executed top-k query Q, ordered according to the scoring function Score Q. Q’=(Score Q’, k’, Range Q’ ) Corresponding materialized ranking view’ is a set of k(tid, Score Q (tid) pairs, ordered by decreasing the values of Score Q (tid). PRELIMINARIESPRELIMINARIES

Problem 1: TOP-k QUERY ANSWERING USING VIEWS Given a set of views and a query Q, obtain an answer to Q combining all the information conveyed by the views in U. SOLUTION: Algorithm named LPTA. Problem 2: VIEW SELECTION Given a collection of views V={V 1, V 2 … V R } that includes the base views(thus r ≥ m) and a query Q, determine the most efficient subset U ⊆ V to execute Q on. Such a subset U will be provided as input to LPTA. Should identify a set of views that can provide an answer to the query and at same time provide the answer faster than running TA on the base set of views, if possible. PROBLEMSPROBLEMS

ALGORITHMLPTAALGORITHMLPTA An adaptation of TA algorithm in the sense that it answers top- k queries using multiple ranking views Requires the scoring functions of the query & the views to be linear and additive Sorted access on pairs (tid, score Q (tid)) Views and Queries are of the form V’ = (Score V’, n, *) and Q=(Score Q, k, *) respectively. Pseudo code Example General Approach

ALGORITHMLPTAALGORITHMLPTA Pseudo code Initialize top-k buffer to empty. Retrieve the tids from the views V1 and V2 in a lock-step fashion, in the order of decreasing score. Retrieve corresponding tuple by random access on R. Compute score according to f3 and update top-k buffer to contain largest scores. Check the stopping condition. Once the stopping condition is satisfied we will have the results in the top-k buffer.

ALGORITHMLPTAALGORITHMLPTA Stopping Condition: After dth iteration, let the tuple read from V1= (tid 1 d, s 1 d ) and V2= (tid 2 d, s 2 d ) and minimum score in the top-k buffer be top-k min At this point the unseen tuples have to satisfy the following inequalities: ( Domain of each attribute of R = [1, 100]) 0≤X1, X2, X3≤100 2x1 + 5x2 ≤ s 1 d x2 + 2x3 ≤ s 2 d This will represent a convex region in 3-d space. unseen max will be the solution to the linear program where we maximize the function f3=3x1+10x2+5x3

ALGORITHMLPTAALGORITHMLPTA Example: (TOP-k Query Answering using Views) tidX1X2X R tidScore tidScore f1=2x1+5x2 f2=x2+2x3 View1 (V1) Top-5 Query View2 (V2) Top-3 Query f3=3x1+10x2+5x3 Query = (f3, k, *) top-2 buffer {tid i d, s i d }={(7,1248), (6,996)} Linear Programming Solution with s 1 d =527 and s 2 d =219 gives unseen max = 1388 (7,1248) (6,996)

ALGORITHMLPTAALGORITHMLPTA Example: (TOP-k Query Answering using Views) tidX1X2X R tidScore tidScore f1=2x1+5x2 f2=x2+2x3 View1 (V1) Top-5 Query View2 (V2) Top-3 Query f3=3x1+10x2+5x3 Query = (f3, k, *) top-2 buffer (7, 1248) (6, 996) {tid i d, s i d }={(6,996), (4, 910)} Linear Programming Solution with s 1 d =299 and s 2 d =202 gives unseen max = ≤ top-k min

ALGORITHMLPTAALGORITHMLPTA V1 s11s11 tid 1 2 s12s12 tid 1 3 s13s13 tid 1 4 s14s14 tid 1 5 s15s15 V2 s21s21 tid 2 2 s22s22 tid 2 3 s23s23 tid 2 4 s24s24 tid 2 5 s25s25 tid 1 1 R(X 1, X 2 )Top-1 V1 V2 Q stopping condition X1X1 X2X2 R=(1,1) tid 2 1 tid 1 1 P=(1,0) O=(0,0) T=(0,1)

ALGORITHMLPTAALGORITHMLPTA 0 ≤ x1, x2, x3 ≤ 100 2x1 + 5x2 ≤ s 1 d x2 + 2x3 ≤ s 2 d fV1=2x1+5x2 fV2=x2+2x3 Q: fQ=3x1+10x2+5x3 R(X 1, X 2 ) tidscore tid 1 d s1ds1d tidscore tid 2 d s2ds2d d iteration View1 (V1) View2 (V2) unseen max ≤ top-k min

ALGORITHMLPTAALGORITHMLPTA V1 tid 1 1 s11s11 s12s12 tid 1 3 s13s13 tid 1 4 s14s14 tid 1 5 s15s15 V2 tid 2 1 s21s21 s22s22 tid 2 3 s23s23 tid 2 4 s24s24 tid 2 5 s25s25 R(X 1, X 2 ) tid 1 2 tid 2 2 V1 V2 Q stopping condition Top-1 X1X1 X2X2 P=(1,0) O=(0,0) T=(0,1) R=(1,1) tid 2 1 tid 1 1

TAVSLPTATAVSLPTA LPTA essentially becomes TA when the set of views U equal to the set of base views In terms of execution cost both have Sequential as well as Random Access Execution Efficiency: I/O Operations play a significant role – they overshadow the costs of CPU operations such as updated top-k buffer, testing for stopping condition & so on. Highly correlated: every sequential access incurs a random access. Determining factor: If d = number of lock-step iterations and r = no. of views, then running Cost is O(dr).

VIEWSELECTIONVIEWSELECTION Given a collection of views Ѵ = {V 1,V 2,…. V r } that includes base views determine the most efficient subset U ⊆ Ѵ to execute the query Q on. Conceptual Discussion View Selection in Two Dimensions View Selection in Higher Dimensions

V I E W S E L E C T I O N 2D R=(1,1) O=(0,0) P=(1,0) T=(1,0) V2V2 V1V1 Q A1 A’1 A A’2 M B’1 B’2 B2 B Min top-k tuple X Y

V I E W S E L E C T I O N HD For Ѵ = {V 1,V 2,…. V r } being a set of views for m- dimensional dataset, Q being query, the optimal execution of LPTA requires the use of a subset of the views U ⊆ Ѵ such that |U| < m.

COSTESTIMATIONCOSTESTIMATION Compute histograms representing the distribution of scores along each view in U. Estimate top k min from H q by determining the bucket which contains the k th highest tuple. “Walkdown” these histograms until the stopping condition is reached. Check stopping condition by linear programming. When Unseen max < top k min then perform logarithmic search within last bucket. Number of sorted accesses ((d-1)n/b + n’)r’. Running time of algorithm is O((d-1)+log n’)

SELECTVIEWSSELECTVIEWS Consider MinCost and MinCurCost = ∞, U={ }, Vє Ѵ-U Compare the cost estimate for V with MinCurCost, if EstimateCost < MinCurCost, add V to MinV. MinCurCost is now is EstimateCost of V. ∀ V, above steps are followed When MinCurCost < MinCost, V is added U This is repeated for all the attributes m considered.

Select Views(Q,V) / Exhaustive : Estimates cost of all possible ( r p )subsets of V to select one with minimum cost. Simple Greedy Heuristic : Iterates the set of views, selects the one that reduces the total cost by the greatest amount. SELECTVIEWSSELECTVIEWS

Select Views Spherical(Q,V) : it has to solve linear program just once and is very effective for highly restrictive data sets. Select view By Angles : sorts the view vectors by increasing angle with query vector returning top-m views. SELECTVIEWSSELECTVIEWS

Views that Only Materialize their Top-k Tuples Truncate the histograms Accommodating Range Conditions Select the views that cover the range conditions. Truncate each attribute’s histogram MOREGENERALQUERIES&VIEWSMOREGENERALQUERIES&VIEWS

EXPERIMENTALRESULTSEXPERIMENTALRESULTS Real Data, performance comparison of PREFER, LPTA, TA (2d) (3d)

REFERENCESREFERENCES Answering Top-k Queries Using Views: Gautam Das, Dimitrios Gunopulos, Nick Koudas aitrc.kaist.ac.kr/~vldb06/slides/R13-1.ppt