Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer.

Slides:



Advertisements
Similar presentations
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Advertisements

 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
1 RankSQL: Query Algebra and Optimization for Relational Top-k Queries Chengkai Li (UIUC) joint work with Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (U.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries Vagelis Hristidis University of California, San Diego Nick Koudas AT&T.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Supporting Ad-Hoc Ranking Aggregates Chengkai Li (UIUC) joint work with Kevin Chang (UIUC) Ihab Ilyas (Waterloo)
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Indexing Spatio-Temporal Data Warehouses Dimitris Papadias, Yufei Tao, Panos Kalnis, Jun Zhang Department of Computer Science Hong Kong University of Science.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)
Query Execution Chapter 15 Section 15.1 Presented by Khadke, Suvarna CS 257 (Section II) Id
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
ARCube: supporting ranking aggregate queries in partially materialized data cubes SIGMOD 2008 Tianyi Wu Tianyi Wu 1 Dong Xin 2 Jiawei Han 1Dong XinJiawei.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Histograms for Selectivity Estimation
Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.
PrefixCube: Prefix-sharing Condensed Data Cube Jianlin FengQiong Fang Hulin Ding Huazhong Univ. of Sci. & Tech. Nov 12, 2004.
Efficient Processing of Top-k Spatial Preference Queries
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
To Tune or not to Tune? A Lightweight Physical Design Alerter Nico Bruno, Surajit Chaudhuri DMX Group, Microsoft Research VLDB’06.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
Spatial Range Querying for Gaussian-Based Imprecise Query Objects Yoshiharu Ishikawa, Yuichi Iijima Nagoya University Jeffrey Xu Yu The Chinese University.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
HKU CSIS DB Seminar Skyline Queries HKU CSIS DB Seminar 9 April 2003 Speaker: Eric Lo.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
1 Chengkai Li Kevin-Chen-Chuan Chang Ihab Ilyas Sumin Song Presented by: Mariam John CSE /20/2006 RankSQL: Query Algebra and Optimization for Relational.
Dense-Region Based Compact Data Cube
Indexing Multidimensional Data
Module 11: File Structure
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5 —
Parallel Databases.
Supporting Ad-Hoc Ranking Aggregates
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
CS 412 Intro. to Data Mining Chapter 5. Data Cube Technology
Preference Query Evaluation Over Expensive Attributes
Chapter 15 QUERY EXECUTION.
Spatio-temporal Pattern Queries
Sofian Maabout University of Bordeaux. CNRS
Query Execution Presented by Jiten Oswal CS 257 Chapter 15
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer Science University of Illinois at Urbana-Champaign VLDB 2006

2 Outline Introduction Ranking cube Answering top-k queries by Ranking Cube Ranking Fragments Performance study Discussion and Conclusions

3 Multi-Dimensional Ranking Analysis Consider an online used car database R Type (e.g., sedan, convertible) Maker (e.g., Ford, Hyundai) Color (e.g., red, silver) Price Mileage select top 10 * from R where type = “convertible” order by price + mileage asc select top 10 * from R where type = “convertible” and color = “red” order by price + mileage asc select top 10 * from R where type = “convertible” and color = “red” and maker = “Ford” order by price + mileage asc Roll Up Drill Down

4 OLAP with Ranking? Data cube revisited –Pre-compute multi-dimensional group-bys –Traditional measures: SUM, COUNT, AVG Materializing all top-k results is not feasible –Different k values –Various ranking functions e.g., order by (price-20k)^2 + (mileage-10k)^2 asc Our Proposal: Ranking Cube –Semi-online computation with semi-offline materialization –Support a broad class of ranking functions

5 More on Rank-Aware Data Cube Given a relation R –A 1, A 2, …, A s are selection dimensions –N 1,N 2,…,N r are ranking dimensions –{A i } and {N j } are not exclusive Our goal: efficiently answering top-k queries in a multi-dimensional space Ranking function f(x) satisfies: 1.Given the sub-domain of x, the extreme point x* can be computed 2.Given a sub-domain of x, the upper and lower bounds of f(x) can be computed

6 Rank-Aware Query Processing Rank-aware materialization for linear functions –Onion [Chang et al, SIGMOD’00], PREFER [Hristidis et al. SIGMOD’01], Robust Indexing [Xin et al. VLDB’06] Rank-aware query transformation –Map rank query to range query [Chaudhuri et al. VLDB’99, Bruno et al. TODS’02] Rank-aware query optimization –TA [Fagin et al. PODS’ 01], RankSQL [Li et al. SIGMOD’05], Boolean+Ranking [Zhang et al. SIGMOD’06] Rank aggregate –RankAgg [Li et al. SIGMOD’06], ObjectFinder [Kaushik et al. SIGMOD’06] Rank query with Joins –Ranked Join indices [Tsaparas et al. ICDE’03], Rank-Join [Ilyas et al, VLDB’03, SIGMOD ’04] And more…

7 What’s New with Ranking Cube An effort made to enrich the data cube –Inherit the power of multi-dimensional analysis –A new rank-aware materialization without assuming particular (e.g., linear) function Top-k query processing based on Rank Cube –Transfer a top-k query to a sequence of selection queries –Block-level access instead of tuple-level access –No modification needed in DBMS

8 Outline Introduction Ranking cube Answering top-k queries by Ranking Cube Ranking Fragments Performance study Discussion and Conclusions

9 Ranking Cube Intuition –Given a ranking function, the ranking cube should be able to: Quickly locate the most promising data region How many tuples are there, and which tuples? Efficient data retrieval Approach –Step 1: Create logical block space for rank analysis Group geometrically closed tuples into blocks by data partitioning –Equi-depth –R-tree –Clustering Compute (logical) block ID for each block The logical block space constitutes the basis for data cubing

10 Ranking Cube (cont.) Approach (cont.) –Step 1: Create logical block space for rank analysis Each tuple is associated with a (logical) block ID –Step 2: Compute measures in ranking cube Group-by with selection dimensions Straight-forward measure: logical block IDs, as well as the list of tuple ID (TID) inside Alternative measure: Compressed version (will discuss later) –Step 3: Create physical block space for efficient I/O The size of the logical block differs in each cuboids due to the multi-dimensional selections Group nearby logical blocks into a physical block for efficient data retrieval

11 Constructing Ranking Cube A1A2{B: TID} 11{1: 1,4} {5: 3} 12{11: 2}.. … Expected Logical Block Size P Measure in Ranking Cube A cell in ranking cube Generating Logical Block Dimension A1,A2: Selection Dimensions N1,N2: Ranking Dimensions Create Logical Block Space N1 N2 Data Cubing Table for data cubing Block table

12 Constructing Ranking Cube (cont.) A1A2{B:TID} 11{1:1,4}.. … A1{B:TID} 1{1: 1, 4}, {5: 3}..… The sizes of TID list in different cuboids are not balanced due to the different cardinality of each dimension Physical block: Merge nearby logical blocks Logical block: Original block partitions A1A2B’{B:TID} 111{1:1,4},{5:3} 113{9:17}.. … Physical Block

13 Outline Introduction Ranking cube Answering top-k queries by Ranking Cube Ranking Fragments Performance study Discussion and Conclusions

14 Query Processing (1) Data access methods Get physical block from Ranking Cube: Clustered index on Cell identifiers (A1, A2, B’) Get logical block from Block Table: Clustered index on B A1A2B’{B:TID} 111{1:1,4},{5:3} 113{9:17}.. …

15 Query Processing (2) Locate the first logical block (b1) The target physical block (t1, t3, t4) is retrieved (t1,t4) is returned, t3 is buffered Query processing works on logical block space Data accessing works on physical block space Ranking Cube maintains the mapping between logical block and physical block S list: maintains the current top answers H list: maintains the best possible unseen answers Locate the second logical block (b5) The target physical block (t1, t3, t4) is identified t3 was buffered, thus is directed returned Select top 2 * from R where A1=1 and A2=1 order by N1+N2 asc

16 Query Processing (3) Determine next logical blocks to be retrieved –First logical block: analyzed by ranking function –Continuing logical blocks Found in neighboring blocks (for convex functions) Decompose the space and analyze each of them (for other functions) Ranking Function: N1+N2 First Block Second Block Ranking Function: (N1-0.5)^2+(N2-0.5)^2 First Block Second Block

17 Outline Introduction Ranking cube Answering top-k queries by Ranking Cube Ranking Fragments Performance study Discussion and Conclusions

18 Ranking Fragments (1) Curse of dimensionality? ABCD ABC ABD ACDBCD AC BC AD BD CD A DBC AB Partition dimensions into several groups Materialize low dimensional cuboids offline Assembly high dimensional cuboids online Mining Cube Approach [Li et al, VLDB’04]

19 Ranking Fragments (2) Materialized: Cuboids A1, Cuboids A2 To Assemble: Cuboids A1A2 Do not assemble the whole cuboids Assemble the required cells only Requested Cell in Cuboids A1A2 Cuboids A1: Request (A1=1, B=b1), return { t1, t4, t10,…} Cuboids A2: Request (A2=1, B=b1), return {t1, t4, t3,…} Merge two lists for cuboids A1A2: Request (A1=1,A2=1, B=b1), return { t1, t4,}

20 Decouple the cubing and partitioning modules Unified logical block space for all cuboids –Reduces computation and space comparing with m-dim indices –Makes the online fragment assembly easier –Advanced partitioning methods for high-dimensional and structural data Compressing TID list –Lossless compression: e.g., dictionary encoding, null suppression –Lossy compression: e.g., bloom filter –High-level summary: e.g., count (mean) in each logical block –Compressing across cuboids: e.g., correlation between cells Block-level data access instead of tuple-level access Ranking Cube: Beyond the index

21 Outline Introduction Ranking cube Answering top-k queries by Ranking Cube Ranking Fragments Performance study Discussion and Conclusions

22 Experimental Results Performance study –Baseline: SQL Server Index on each dimension –Query transformation [ Chaudhuri et al. VLDB’99 ] Transform a ranked query to a range selection query Multi-dimensional index –Ranking cube (fragment) Index on cube cells (cuboids) Index on block ids (block tables)

23 Experiment Setting Implementation details –C# + MS SQL Server 2005 –Store all ranking cubes, block tables in SQL Server Synthetic data sets Real data set: Forest CoverType –12 selection dimensions, 3 ranking dimensions, 3.5M tuples

24 Execution Time w.r.t. K Synthetic data Default parameter setting

25 # Dimensions in Ranking Function Synthetic data with 4 ranking dimensions Partitioning was built on all 4 ranking dimensions The number of ranking attributes in queries are varied from 2 to 4

26 Number of Data Tuples Vary the size of the database from 1M to 10M Very promising performance on large datasets

27 Ranking Fragments Forest CoverType data Partition selection dimensions into groups with size 3 Build ranking fragments on each group Vary the fragment size

28 Space Usage 1. Space usage grows linearly with number of selection dimensions 2. Most space is used to store the cell identifiers in the relational table 3. The space usage can be greatly reduced by storing the ranking cube out of the relational table Build ranking fragments with group size 2

29 Conclusions and Future work OLAP with Ranking –Ranking Cube as semi-offline materialization –Ranked query processing by semi-online computation Current status –Extended to multi-relational ranked queries using multi-rank-cube Future work –Apply compression techniques –Exploit and compare different partitioning strategies –Support more query types