ARCube: supporting ranking aggregate queries in partially materialized data cubes SIGMOD 2008 Tianyi Wu Tianyi Wu 1 Dong Xin 2 Jiawei Han 1Dong XinJiawei.

ARCube: supporting ranking aggregate queries in partially materialized data cubes SIGMOD 2008 Tianyi Wu Tianyi Wu 1 Dong Xin 2 Jiawei Han 1Dong XinJiawei Han 1 University of Illinois, Urbana-Champaign, Urbana, IL, USA 2 Microsoft Research, Redmond, WA, USA Presenter : Chun Kit Chui (Kit) Supervisor : Dr. Ben Kao

Outline Introduction  Traditional top-k queries  Aggregate queries What is AR-cube? The basic query execution framework using AR-cube Optimizations  Chunking  Scheduling Supporting general ranking functions  SUM, MIN, MAX, STDEV, VAR, MAD, AVG Experimental results

Traditional top-k queries Traditional techniques for top-k analysis are often tailored to ranking functions on individual tuples.  Each tuples has an aggregated value that is an aggregation over multiple measure attributes.  E.g. Linear weighted sum Restaurant ID TasteServicePriceAmbiance Ranking function 10.91.00.30.820.73 20.80.7 0.60.664 30.750.9 0.50.81 ……………… Restaurant Review Database (R) Measure attributes Each tuple represents a restaurant and the scores it takes. E.g. Score of food taste, service, price, ambiace.

Traditional top-k queries Traditional techniques for top-k analysis are often tailored to ranking functions on individual tuples.  Each tuples has an aggregated value that is an aggregation over multiple measure attributes.  Example top-k query: To find the top rated restaurant according to a user specified ranking function. Ranking function : Linear weighted sum Restaurant ID TasteServicePriceAmbiance Ranking function 10.91.00.30.820.73 20.80.7 0.60.664 30.750.9 0.50.81 ……………… Example Top-k Query SELECT * FROM R ORDER BY Taste*0.6+Service*0.1+Price*0.3 desc LIMIT 1 Restaurant Review Database (R) Ranking function The top-k parameter Each tuple has its own ranking score. The top ranked tuple is returned to user.

Aggregate queries IDTimeLocationTypeSales 12007ChicagoSedan13 22007ChicagoPickup12 32007VancouverSUV10 42008VancouverSedan37 52008VancouverSUV20 …………… Car sales database (S) Dimension attributesMeasure attribute Each tuple shows the number of car sales of a particular time, location, and car type. E.g. 13 Sedan cars were sold in Chicago in year 2007.

Aggregate queries An aggregate query consist of 2 core parts  Group by Define on the dimension attributes Determine the dimension of the returned result (cuboid) Define the grouping of tuples : tuples that share the same values in those dimension attributes are grouped together.  Aggregate measure Define on measure attributes Apply to the measure attributes of the records in each group. E.g. SUM,MIN,MAX,AVG, STDEV…etc IDTimeLocationTypeSales 12007ChicagoSedan13 22007ChicagoPickup12 32007VancouverSUV10 42008VancouverSedan37 52008VancouverSUV20 …………… Car sales database (S) Dimension attributesMeasure attribute Example Aggregate Query SUM SELECT Time, Location, SUM(Sales) FROM S GROUP BY Time, Location SUM ORDER BY SUM ( Sales ) desc LIMIT 2 We want to group the tuples according to the Time and Location attributes. TimeLocation Query result (2D cuboid) Since there are two dimension attributes, the resulting cuboid is a 2-Dimensional summary.

Aggregate queries An aggregate query consist of 2 core parts  Group by Define on the dimension attributes Determine the number of cells in the returned result (cuboid) Define the grouping of tuples : tuples that share the same values in those dimension attributes are grouped together.  Aggregate measure Define on measure attributes Apply to the measure attributes of the records in each group. E.g. SUM,MIN,MAX,AVG, STDEV…etc IDTimeLocationTypeSales 12007ChicagoSedan13 22007ChicagoPickup12 32007VancouverSUV10 42008VancouverSedan37 52008VancouverSUV20 …………… Car sales database (S) Dimension attributesMeasure attribute Example Aggregate Query SUM SELECT Time, Location, SUM(Sales) FROM S GROUP BY Time, Location SUM ORDER BY SUM ( Sales ) desc LIMIT 2 TimeLocationTuplesSales 2007Chicago1,213+12=25 2007Vancouver310 2008Vancouver4,537+20=57 2008Chicago... … ……… Query result (2D cuboid) If we consider the time from year 2000 to 2009 and 30 locations, then there will be 10*30=300 cells. The number of cells in the cuboid is exponential to the number of dimensions of the cuboid.

Aggregate queries An aggregate query consist of 2 core parts  Group by Define on the dimension attributes Determine the number of cells in the returned result (cuboid) Define the grouping of tuples : tuples that share the same values in those dimension attributes are grouped together.  Aggregate measure Define on measure attributes Apply to the measure attributes of the records in each group. E.g. SUM,MIN,MAX,AVG, STDEV…etc IDTimeLocationTypeSales 12007ChicagoSedan13 22007ChicagoPickup12 32007VancouverSUV10 42008VancouverSedan37 52008VancouverSUV20 …………… Car sales database (S) Dimension attributesMeasure attribute Example Aggregate Query SUM SELECT Time, Location, SUM(Sales) FROM S GROUP BY Time, Location SUM ORDER BY SUM ( Sales ) desc LIMIT 2 TimeLocationTuplesSales 2007Chicago1,213+12=25 2007Vancouver310 2008Vancouver4,537+20=57 2008Chicago... … ……… Query result (2D cuboid)

Aggregate queries An aggregate query consist of 2 core parts  Group by Define on the dimension attributes Determine the number of cells in the returned result (cuboid) Define the grouping of tuples : tuples that share the same values in those dimension attributes are grouped together.  Aggregate measure Define on measure attributes Apply to the measure attributes of the tuples in each group. E.g. SUM,MIN,MAX,AVG, STDEV…etc IDTimeLocationTypeSales 12007ChicagoSedan13 22007ChicagoPickup12 32007VancouverSUV10 42008VancouverSedan37 52008VancouverSUV20 …………… Car sales database (S) Dimension attributesMeasure attribute Example Aggregate Query SUM SELECT Time, Location, SUM(Sales) FROM S GROUP BY Time, Location SUM ORDER BY SUM ( Sales ) desc LIMIT 2 TimeLocationTuplesSUM(Sale) 2007Chicago1,213+12=25 2007Vancouver310 2008Vancouver4,537+20=57 2008Chicago... … ……… Query result (2D cuboid) The SUM aggregate measure is defined on the Sales attribute

Aggregate queries IDTimeLocationTypeSales 12007ChicagoSedan13 22007ChicagoPickup12 32007VancouverSUV10 42008VancouverSedan37 52008VancouverSUV20 …………… Car sales database (S) Dimension attributesMeasure attribute Example Aggregate Query SUM SELECT Time, Location, SUM(Sales) FROM S GROUP BY Time, Location SUM ORDER BY SUM ( Sales ) desc LIMIT 2 TimeLocationTuplesSUM(Sale) 2007Chicago1,213+12=25 2007Vancouver310 2008Vancouver4,537+20=57 2008Chicago... … ……… Query result (2D cuboid) The number of cuboid cells is exponential to the number of dimension attributes.  When the dimensionality is high, the returned cuboid is often gigantic (i.e. may cuboid cells)  Inefficient to compute the full cuboid.

Aggregate queries Example Top K Aggregate Query SUM SELECT Time, Location, SUM(Sales) FROM S GROUP BY Time, Location SUM ORDER BY SUM ( Sales ) desc LIMIT 1 TimeLocationTuplesSUM(Sale) 2007Chicago1,213+12=25 2007Vancouver310 2008Vancouver4,537+20=57 2008Chicago... … ……… Query result (cuboid) Top-k aggregate queries  Return only the top-k ranked cells  The presentation will be more comprehensible  Computation is potentially more efficient E.g. Instead of knowing the total number of car sales in each year and in each location, the manager would like to find in which year and in which location having the most number of car sales.

Contributions Study the problem of Top-k aggregate queries processing. Propose the Aggregate Ranking cube (AR-Cube) for supporting Top-k aggregate queries  Novel partial cube  Unified structure for supporting various aggregate measures (e.g. SUM, MAX, MIN, AVG, RANGE, STDEV, VAR, MAD) Basic query execution algorithm  Thresholding technique I/O optimizations

Aggregate Ranking Cube (AR-cube)

Intuition Query: Find the Top-1 populated city in US. Naive approach : Find out the populations of all US cities, and then sort them in descending order of their populations, and return the first city in the list. The evaluation can be better if we have some more information to guide our search. If the state population (i.e. higher level statistics) is known, then we can use the state population as a guide to our search for the top-1 populated city. Basic intuition : We should search the city in the most populated states first.

Intuition By checking the cities in the top states, we found out that …  NYC: 8M  LA: 4M  Chicago: 3M Since the state population is the maximal possible population of their cities, cities in the 39 states (in red) can be pruned because all cities in these states cannot have population over 8M (current top-1). California36MVirginia7M Texas23MWashington6M New York19MMassachusetts6M Florida18MIndiana6M Illinois12MArizona6M Pennsylvania12MTennessee6M Ohio11MMissouri5M Michigan10MMaryland5M Georgia9MWisconsin5M N. Carolina9MMinnesota5M New Jersey8M 29 more … <5M PRUNED

Intuition California36MVirginia7M Texas23MWashington6M New York19MMassachusetts6M Florida18MIndiana6M Illinois12MArizona6M Pennsylvania12MTennessee6M Ohio11MMissouri5M Michigan10MMaryland5M Georgia9MWisconsin5M N. Carolina9MMinnesota5M New Jersey8M 29 more … <5M PRUNED This example demonstrates that  Some high level statistics can guide our search for top-k results.  Also, the current top-k values can help us to eliminate a lot of candidates.

Aggregate-Ranking Cube (AR- Cube) AR-cube consists of  Guiding cuboids Store high-level statistics to guide the search of promising candidate cells.  Supporting cuboids Verify the true aggregate values of the cuboid cells. It contains inverted index to support efficient online aggregation Unified structure to support various aggregate measures  Monotonic: SUM, COUNT, MAX, etc.  Non-monotonic: AVG, STDDEV, RANGE, etc.

Guiding cuboid TidABCScore T1a1b1c363 T2a1b2c110 T3a1b2c350 T4a2b1c316 T5a2b2c152 T6a3b1c135 T7a3b1c240 T8a3b2c145 ASUM a1123 a268 a3120 C gd (A,SUM) Guiding measure Group-by dimension attribute Guiding cuboids store high-level statistics to guide the search of promising candidate cells. To define a guiding cuboid, we need to specify:  Group by Defined on dimension attributes Determine the grouping of tuples.  Guiding measure Define on measure attribute Apply on the tuples in each group. Database (S) Since the group by is on attribute A, and there are 3 distinct values in attribute A, there are 3 cells in total. The guiding measure is applied on the tuples in each group. T1, t2, t3 are in cell a1, so their scores are sum up and stored in this cell

Guiding cuboid TidABCScore T1a1b1c363 T2a1b2c110 T3a1b2c350 T4a2b1c316 T5a2b2c152 T6a3b1c135 T7a3b1c240 T8a3b2c145 ASUM a1123 a268 a3120 C gd (A,SUM) Guiding measure Group-by dimension attribute BSUM b1154 b2157 C gd (B,SUM) ABSUM a1, b163 a1, b260 a2, b116 a2, b252 a3, b175 a3, b245 C gd (AB,SUM) Guiding cuboids store high-level statistics to guide the search of promising candidate cells To define a guiding cuboid, we need to specify:  Group by Defined on dimension attributes Determine the grouping of tuples.  Guiding measure Define on measure attribute Apply on the tuples in each group. Database (S)

Supporting cuboid TidABCScore T1a1b1c363 T2a1b2c110 T3a1b2c350 T4a2b1c316 T5a2b2c152 T6a3b1c135 T7a3b1c240 T8a3b2c145 Database (S) AInverted index a1(t1,63), (t2, 10), (t3, 50) a2(t4, 16), (t5, 52) a3(t6, 35), (t7, 40), (t8, 45) C sp (A) BInverted index b1(t1,63), (t4, 16), (t6, 35), (t7, 40) b2(t2, 10), (t3, 50), (t5, 52), (t8, 45) C sp (B) Supporting cuboids help verify the true aggregate values of the cuboid cells. To define a supporting cuboid, we need to specify :  Group by Define on dimension attribute Determine the grouping of tuples. Stores the raw values of the measure attribute.  Each cuboid cell g contains the inverted index of g. Group-by dimension attribute

AR-cube Given a set of group-by’s, A 1,...,A D,and a set of aggregate measures M An AR-cube, C(A 1,…,A D ; M) consists of D guiding cuboids C gd (A i,M), 1<=i<=D, and D supporting cuboids C sp (A i ). ASUM a1123 a268 a3120 C gd (A,SUM) BSUM b1154 b2157 C gd (B,SUM) AInverted index a1(t1,63), (t2, 10), (t3, 50) a2(t4, 16), (t5, 52) a3(t6, 35), (t7, 40), (t8, 45) C sp (A) BInverted index b1(t1,63), (t4, 16), (t6, 35), (t7, 40) b2(t2, 10), (t3, 50), (t5, 52), (t8, 45) C sp (B) C (A,B;SUM) Guiding cuboidsSupporting cuboids AR-cube

Query Execution Models

Motivating example Query  Top-1 cell  Group-by (A,B)  Aggregate measure: SUM If C gd (AB,SUM) is materialized, the answer can be returned very quickly. Otherwise, we have to compute the result with the help of materialized guiding and supporting cuboids. ABSUM a1, b163 a1, b260 a2, b116 a2, b252 a3, b175 a3, b245 C gd (AB,SUM) TidABCScore T1a1b1c363 T2a1b2c110 T3a1b2c350 T4a2b1c316 T5a2b2c152 T6a3b1c135 T7a3b1c240 T8a3b2c145 Database (S)

Motivating example Query  Top-1 cell  Group-by (A,B)  Aggregate measure: SUM If C gd (AB,SUM) is materialized, the answer can be returned very quickly. Otherwise, we have to compute the result with the help of materialized guiding and supporting cuboids. To compute the query with A,B as dimension attributes and SUM aggregate measure, we need the guiding cuboids with dimension attribute A and B and SUM guiding measure. ASUM a1123 a268 a3120 C gd (A,SUM) BSUM b1154 b2157 C gd (B,SUM) TidABCScore T1a1b1c363 T2a1b2c110 T3a1b2c350 T4a2b1c316 T5a2b2c152 T6a3b1c135 T7a3b1c240 T8a3b2c145 Database (S)

Query execution model ASUM a1123 a268 a3120 C gd (A,SUM) BSUM b1154 b2157 C gd (B,SUM) A candidate generation and verification framework  Candidate generation The most promising candidate is generated by considering the high level statistics (guiding cuboids).  Verification The true aggregate measure of the candidate is verified (supporting cuboids).  Update and pruning Knowing the true aggregate measure of a candidate help us to refine the upper bound of other candidates Sorted lists initialization Sorted lists initialization Candidate generation Candidate generation Verification Update sorted lists and pruning Update sorted lists and pruning

Query execution model ASUM a1123 a268 a3120 C gd (A,SUM) BSUM b1154 b2157 C gd (B,SUM) Candidate generation Candidate generation Verification Update sorted lists and pruning Update sorted lists and pruning Step 1. Sorted lists initialization In this step, we initialize one sorted list for each guiding cuboid. The sorted lists tells  Largest aggregate a combined candidate cell could achieve (i.e., the aggregate bound of the cuboid cells)  e.g. a1=123 means that the aggregate measures of the unseen cells, with a1 as their value in attribute A, are upper bounded by 123. Sorted lists initialization Sorted lists initialization ABound a1123 a3120 a268 BBound b2157 b1154 Sorted list ASorted list B

Query execution model ASUM a1123 a268 a3120 C gd (A,SUM) BSUM b1154 b2157 C gd (B,SUM) Candidate generation Candidate generation Verification Update sorted lists and pruning Update sorted lists and pruning Step 1. Sorted lists initialization In this step, we initialize one sorted list for each guiding cuboid. The sorted lists tells  Largest aggregate a combined candidate cell could achieve (i.e., the aggregate bound of the cuboid cells)  e.g. a1=123 means that the aggregate measures of the unseen cells, with a1 as their value in attribute A, are upper bounded by 123. Sorted lists initialization Sorted lists initialization ABound a1123 a3120 a268 BBound b2157 b1154 Sorted list ASorted list B Guiding cells Since the aggregate bounds store in the cells of the sorted list will guide our search for the top-k candidates, we call the cells guiding cells.

Sorted lists initialization Sorted lists initialization Query execution model ASUM a1123 a268 a3120 C gd (A,SUM) BSUM b1154 b2157 C gd (B,SUM) Verification Update sorted lists and pruning Update sorted lists and pruning Step 2. Candidate generation  Intuition of candidate generation: to generate the cell that likely to have large aggregate value.  Generate the cell according to the top entry of the sorted lists. In this case, (a1,b2) is generated as the next promising candidate cell. ABound a1123 a3120 a268 BBound b2157 b1154 Sorted list ASorted list B Candidate generation Candidate generation b2 b1 a1 a2 a3

Sorted lists initialization Sorted lists initialization Query execution model Update sorted lists and pruning Update sorted lists and pruning Step 3. Verify the true aggregate value of the candidate cell  We need to consult the supporting cuboids and fetch the corresponding inverted-indices.  Perform list intersection We now know the true aggregate value of the cell (a1, b2) is 60. (a1,b2)=60 is the current top-1 aggregate measure. ABound a1123 a3120 a268 BBound b2157 b1154 Sorted list ASorted list B Candidate generation Candidate generation Verification AInverted index a1(t1,63), (t2, 10), (t3, 50) a2(t4, 16), (t5, 52) a3(t6, 35), (t7, 40), (t8, 45) C sp (A) BInverted index b1(t1,63), (t4, 16), (t6, 35), (t7, 40) b2(t2, 10), (t3, 50), (t5, 52), (t8, 45) C sp (B) 60 b2 b1 a1 a2 a3

Sorted lists initialization Sorted lists initialization Query execution model Step 4. Update the sorted lists  Since we now know that (a1,b2) = 60, we can refine the aggregate bound of the unseen cells with a1 as their values in attribute A i.e. (a1, *) as 123-60 = 63.  Similarly we update the aggregate bound of the the sorted list B. ABound a1123-60 a3120 a268 BBound b2157-60 b1154 Sorted list ASorted list B Candidate generation Candidate generation Verification Update sorted lists and pruning Update sorted lists and pruning ABound a3120 a268 a163 BBound b1154 b297 Sorted list ASorted list B Update 60 b2 b1 a1 a2 a3 After the update, the order of entries in the sorted list is also updated

Sorted lists initialization Sorted lists initialization Query execution model With the upper bound refined, we may use the current top-k threshold to prune some candidates  Pruning conditions: If the value of a cell in the sorted list is smaller than the threshold, then all combined candidates cannot have aggregate measure larger than the threshold and can be pruned. If an entry in the sorted list is pruned, all the subsequent entries in the list can be pruned. Candidate generation Candidate generation Verification Update sorted lists and pruning Update sorted lists and pruning ABound a3120 a268 a163 BBound b1154 b297 Sorted list ASorted list B 60 b2 b1 a1 a2 a3 In this case, none of the cells have value below 60, no cells can be pruned.

Update sorted lists and pruning Update sorted lists and pruning Sorted lists initialization Sorted lists initialization Query execution model Verification ABound a3120 a268 a163 BBound b1154 b297 Sorted list ASorted list B 60 b2 b1 a1 a2 a3 Candidate generation Candidate generation Generate the next promising candidate cell according to the first entry of the sorted lists. i.e. (a3, b1)

Update sorted lists and pruning Update sorted lists and pruning Sorted lists initialization Sorted lists initialization Query execution model ABound a3120 a268 a163 BBound b1154 b297 Sorted list ASorted list B 60 75 b2 b1 a1 a2 a3 Candidate generation Candidate generation AInverted index a1(t1,63), (t2, 10), (t3, 50) a2(t4, 16), (t5, 52) a3(t6, 35), (t7, 40), (t8, 45) C sp (A) BInverted index b1(t1,63), (t4, 16), (t6, 35), (t7, 40) b2(t2, 10), (t3, 50), (t5, 52), (t8, 45) C sp (B) Verification Retrieve the corresponding inverted indices from supporting cuboids and verify the true aggregate value of the candidate cell.

Sorted lists initialization Sorted lists initialization Query execution model ABound a3120-75 a268 a163 BBound b1154-75 b297 Sorted list ASorted list B 60 75 b2 b1 a1 a2 a3 Candidate generation Candidate generation AInverted index a1(t1,63), (t2, 10), (t3, 50) a2(t4, 16), (t5, 52) a3(t6, 35), (t7, 40), (t8, 45) C sp (A) BInverted index b1(t1,63), (t4, 16), (t6, 35), (t7, 40) b2(t2, 10), (t3, 50), (t5, 52), (t8, 45) C sp (B) Verification Update sorted lists and pruning Update sorted lists and pruning ABound a268 a163 a345 BBound b297 b179 Sorted list ASorted list B Update Refine the aggregate bounds.

Sorted lists initialization Sorted lists initialization Query execution model ABound a3120-75 a268 a163 BBound b1154-75 b297 Sorted list ASorted list B 60 75 b2 b1 a1 a2 a3 Candidate generation Candidate generation AInverted index a1(t1,63), (t2, 10), (t3, 50) a2(t4, 16), (t5, 52) a3(t6, 35), (t7, 40), (t8, 45) C sp (A) BInverted index b1(t1,63), (t4, 16), (t6, 35), (t7, 40) b2(t2, 10), (t3, 50), (t5, 52), (t8, 45) C sp (B) Verification Update sorted lists and pruning Update sorted lists and pruning ABound a268 a163 a345 BBound b297 b179 Sorted list ASorted list B Update Pruned Since the aggregate bounds in sorted list A are smaller than the current top-k threshold 75, the corresponding candidates generated will not have aggregate measure larger than 75, therefore they can be pruned.

Sorted lists initialization Sorted lists initialization Query execution model ABound a3120-75 a268 a163 BBound b1154-75 b297 Sorted list ASorted list B 60 75 b2 b1 a1 a2 a3 Candidate generation Candidate generation AInverted index a1(t1,63), (t2, 10), (t3, 50) a2(t4, 16), (t5, 52) a3(t6, 35), (t7, 40), (t8, 45) C sp (A) BInverted index b1(t1,63), (t4, 16), (t6, 35), (t7, 40) b2(t2, 10), (t3, 50), (t5, 52), (t8, 45) C sp (B) Verification Update sorted lists and pruning Update sorted lists and pruning ABound a268 a163 a345 BBound b297 b179 Sorted list ASorted list B Update Pruned Since list A is empty, no more candidates can be generated, we can conclude that (a3,b1)=75 is the top-1 aggregate. The algorithm terminates.

Optimization

I/O optimization Motivation  Verifying candidates one by one is low-efficient.  For example, consider two consecutive candidate cells (a1, b1, c1, d1) and (a1, b1, c2, d1) If the two cells are individually evaluated, 8 random accesses have to be performed to access the disk-resident inverted-lists (4 per cell) The inverted indices of a1, b1, and d1 are repeatedly accessed in the evaluation of the two cells. Idea  Temporarily stores the fetched inverted-lists in an in- memory buffer so that repeat accesses of a list do not require extra random disk accesses.  To enhance list reuse, they propose Chunking technique (bulk processing) for intra-chunk list reuse. Chunk scheduling to facilitate inter-chunk list reuse.

Chunking Candidate generation Candidate generation Verification Update sorted lists and pruning Update sorted lists and pruning a6 a12 a8 a1 a10 … a2 a7 a3 b7 b3 b4 b6 b5 b9 … ABound a6150 a12115 a890 a184 a1079 …… a218 a718 a313 BBound b7120 b395 b485 b682 …… b526 b922 Candidate space Sorted lists initialization Sorted lists initialization Basic idea: Instead of verifying the candidates one by one, a chunk of candidate cells are processed at a time. How to define the size of a chunk?  Defined according to the buffer size B.  Adopt an equi-depth partition method and partition each sorted list into some sublists, and  the total size of the inverted indices corresponding to each sublist must not exceed B/N, N is the number of sorted lists Sorted list ASorted list B

Chunking Candidate generation Candidate generation Verification Update sorted lists and pruning Update sorted lists and pruning Sorted lists initialization Sorted lists initialization Basic idea: Instead of verifying the candidates one by one, a chunk of candidate cells are processed at a time. How to define the size of a chunk?  Defined according to the buffer size B.  Adopt an equi-depth partition method and partition each sorted list into some sublists, and  the total size of the inverted indices corresponding to each sublist must not exceed B/N, N is the number of sorted lists. a6 a12 a8 a1 a10 … a2 a7 a3 b7 b3 b4 b6 b5 b9 … Candidate space ABound a6150 a12115 a890 a184 a1079 …… a218 a718 a313 BBound b7120 b395 b485 b682 …… b526 b922 Sorted list ASorted list B Sublist A1 Sublist A2 Chunk space

Chunking facilitates Intra- chunk buffer reuse: To compute the 4 cells in this chunk, we need to fetch the inverted index of a6,a12, b7, b3 to buffer, which requires 4 random accesses only (compare to 8 random accesses w/o chunking) Chunking Candidate generation Candidate generation Verification Update sorted lists and pruning Update sorted lists and pruning Sorted lists initialization Sorted lists initialization Basic idea: Instead of verifying the candidates one by one, a chunk of candidate cells are processed at a time. How to define the size of a chunk?  Defined according to the buffer size B.  Adopt an equi-depth partition method and partition each sorted list into some sublists, and  the total size of the inverted indices corresponding to each sublist must not exceed B/N, N is the number of sorted lists. a6 a12 a8 a1 a10 … a2 a7 a3 b7 b3 b4 b6 b5 b9 … Candidate space ABound a6150 a12115 a890 a184 a1079 …… a218 a718 a313 BBound b7120 b395 b485 b682 …… b526 b922 Sorted list ASorted list B

Chunking Candidate generation Candidate generation Verification Update sorted lists and pruning Update sorted lists and pruning Sorted lists initialization Sorted lists initialization To facilitate chunk pruning, we need an aggregate bound associate with each chunk that represents the upper bound of the cell aggregates. For example  Aggregate bound of red chunk:  min{ max{a6,a12}, max {b7,b3} } = min {150, 120}  = 120 a6 a12 a8 a1 a10 … a2 a7 a3 b7 b3 b4 b6 b5 b9 … Candidate space ABound a6150 a12115 a890 a184 a1079 …… a218 a718 a313 BBound b7120 b395 b485 b682 …… b526 b922 Sorted list ASorted list B First, for each of the N sublists that form the chunk, obtain the maximum of the aggregate bounds in each sub-list. Then the aggregate bound can be obtained by getting the minimum of those maximum values.

Inter-chunk list reuse To facilitate inter-chunk list reuse, we have to consider the order of chunk visit. To maximize the inter-chunk list reuse, we have to visit the chunks in axis order (Buffer-guided scheduling)  Adv : Inter-chunk list reuse is maximized.  Dis : The axis order does not prioritize the chunk with largest aggregate-bound (i.e. not visiting the most promising cells) Candidate space a6 a12 a8 a1 a10 … a2 a7 a3 b7 b3 b4 b6 b5 b9 … The inverted indices of b3, b7 is reused in the blue chunk.

Chunk Scheduling Methods Method 1: Prioritize the chunk with largest aggregate-bound and generate promising candidates Goal #1 aim for cell pruinng Method 2: Traverse the space in axis order Goal #2 maximize list reuse Method 3: Contiguous chunks often share the same aggregate-bound (Goal#1#2).

Hybrid scheduling Basic idea  Contiguous chunks often share the same aggregate-bounds.  Based on the priority queue in top-k guided scheduling, further group together chunks with the same aggregate-bound and use buffer-guided scheduling to schedule the chunks within a group.

Supporting various aggregate measures AVG, MAX, MIN, STDEV, VAR, MAD, SUM

General Measures The AR-cube structure is able to support other aggregate measures  E.g. AVG, STDDEV, RANGE, MAD, etc.  The query execution model is the same for all these aggregate measures.  The only difference is the initialization and update of the aggregate bounds. Aggregate measure in the query Guiding measure required The aggregate bound (value in the sorted lists) For example, the mean absolute deviation MAD of a set of values is always upper bounded by half of the range of the values.

Example Given a query with aggregate measure as MAD  We can use the guiding cuboids with MIN and MAX guiding measures to guide the computation. ASUMCOUNTMAXMIN a112336310 a26824535 a312035216 C gd (A,SUM,COUNT,MAX,MIN) AAggregate bound of MAD a1(63-10)/2 = 26.5 a3(52-16)/2 = 18 a2(45-35)/2 = 5 Sorted list A Sorted lists initialization Sorted lists initialization Candidate generation Candidate generation Verification Update sorted lists and pruning Update sorted lists and pruning Only the guiding cuboids with guiding measures MAX, MIN are needed to support efficient processing The aggregate bounds are computed by (MAX-MIN) /2, e.g. a1 = 26.5, which mean that any unseen cells with a1 as the value of attribute A has aggregate measure (MAD) no greater than 26.5.

General Query Execution The query execution framework is the same except  Aggregate-bound computation  Updating SUM, COUNT: subtraction MAX, MIN: using inverted index Guaranteed to be monotonically decreasing AAggregate bound of MIN a150 => 5 a252 a345 BAggregate bound of MIN b163 b250 => 45 Sorted list ASorted list B AInverted index a1(t1,5), (t2, 10), (t3, 50) a2(t4, 16), (t5, 52) a3(t6, 35), (t7, 40), (t8, 45) C sp (A) BInverted index b1(t1,63), (t4, 16), (t6, 35), (t7, 40) b2(t2, 10), (t3, 50), (t5, 41), (t8, 45) C sp (B) Suppose we have computed (a1,b2) and have to update the aggregate bound of MIN. Because t2, t3 are already known to be a member of the cell (a1,b2), so the aggregate bound can be refined to exclude the measure value of the t2, t3.

Experiments

Experimental setup Compare four different query execution algorithms  Tablescan : sequentially scans the data file and computes top-k.  The chunk-based query execution approaches HYBRID, BUFFER, TOPK, which use the hybrid, buffer-guided, and top-k-guided scheduling methods Implementation  Platform : Pentium CPU 3Ghz with 1G RAM.  OS : Window XP  Coding : JAVA Synthetic data and query

Vary K When k=1000, tablescan is faster than all algorithms, since the pruning power of the top-k threshold is no longer large. In k=1, 10, 100, chunk based algorithm consistently outperform tablescan in terms of both disk access and execution time. Compare the three chunk based algorithms, HYBRID consumes less I/Os and is faster than the other two.

Vary k BUFFER needs more disk accesses than TOPK in general since its traversal path does not give particular preference to promising candidates. It may visit chunks that could have been pruned by TOPK and HYBRID.

Performance w.r.t. query measures HYBRID is better than the other methods. The number of I/O is 1/15 of tablescan for AVG 1/27 for MAX, and 1/9 for VAR. The values hinges upon pruning effectiveness, in another words, tightness of the aggregate bounds, a tighter bound will have a larger ratio. Tightness is MAX > AVG > VAR.

Performance w.r.t. query measures MAX is the tightest aggregate bound because  A candidate’s MAX value can be directly computed from its guiding cells’s MAX value.  This favor TOP-K and HYBRID algorithms that schedule to visit the most promising cells first. TidABCScore T1a1b1c363 T2a1b2c110 T3a1b2c350 T4a2b1c316 T5a2b2c152 T6a3b1c135 T7a3b1c240 T8a3b2c145 Database (S) AMAX a163 a252 a345 C gd (A,MAX) BMAX b163 b252 C gd (B,MAX) ABound a163 a252 a345 BBound b163 b252 Sorted list ASorted list B We don’t need to access the supporting cuboids, top-1 cell with MAX aggregate measure must be (a1,b1)

Performance w.r.t. N N denotes the number of guiding cuboids to answer a query. N=3N=4

Performance w.r.t. N N=3N=4 For k<1,000, if the top-k threshold is reasonably large, may guiding cells(sublists) can be pruned and thus the total number of chunks to be verified is not very sensitive to N. On the contrary, the top-k threshold is small (because k is large), the total number of chunks to be verified would grow exponentially.

Varying data characteristics When alpha is large, different cells are likely to have skewed aggregate scores and, Conversely, when alpha approaches 0, different cells are likely to have more uniform aggregate scores. HYBRID favors more skewed score distributions. It is because when the distribution is skewed, it becomes easier for the top-k threshold to prune more guiding cells at the tail of the sorted lists

Varying data characteristics

TPC-H Benchmark Use the dbgen module to generate a database ad ten extract the largest relation lineitem.tbl 2M tuples 15 attributes One measure attribute “extededprice” 14 dimension attributes  6 attributes have cardinality below 10  2 attributes have cardinality between 2400~2600  The rest have cardinality above 10000

Experiments on TPC-H Benchmark

Conclusion Proposed a novel cube structure ARCube for supporting efficient ranking aggregate query processing.  Guiding cuboids  Supporting cuboids A query execution framework has been developed based on the ARCube I/O Optimization techniques are presented. The efficiency of the proposed techniques are verified.

Experiments Synthetic (k = 1, 10, 100, 1000) SUM queries

Experiments Insensitive to the original dimensionality Sensitive to the number of guiding cuboids  Candidate space explosion

Pruning power (TPC-H) High to low MAX: tight AVG: linear aggregate- bound VAR: quadratic aggregate-bound SUM: worse due to containment relationship of parent-children cells

The AR-cube structure Introduce the motivating example  The database Dimension attributes Measuring attributes  The AR cube structure Guiding cuboid Supporting cuboid

Running Example Illustrate the query execution framework Explain the chunk based execution

Illustration of the Cube Structure The guiding and supporting cuboids form a lattice Partial materialization approach Cube lattice Low-level base table High-level Middle-level: combinatorial explosion Ranking queries can be guided by high-level cuboids Not materialized

Comparison of Techniques Partial materialization approach ApproachOnlineOffline Full cube Very fast Curse of dimensionality  No pre- computation Aggregation is costly  Index on group- by’s (rankagg) AR-CubeEfficient aggregation and pruning Much smaller than a full cube

Applications Data warehousing and OLAP  Dimensionality: each group-by produces a lot of cells, hard for users to digest  Top-ranked answers are interesting to data analysts  Example Finding the locations having top sales; Returning the population groups with the largest standard deviation of income. Efficiency is particularly important in OLAP environment Explorative data analysis  Generate interesting results for users  Short response time

ARCube: supporting ranking aggregate queries in partially materialized data cubes SIGMOD 2008 Tianyi Wu Tianyi Wu 1 Dong Xin 2 Jiawei Han 1Dong XinJiawei.

Similar presentations

Presentation on theme: "ARCube: supporting ranking aggregate queries in partially materialized data cubes SIGMOD 2008 Tianyi Wu Tianyi Wu 1 Dong Xin 2 Jiawei Han 1Dong XinJiawei."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ARCube: supporting ranking aggregate queries in partially materialized data cubes SIGMOD 2008 Tianyi Wu Tianyi Wu 1 Dong Xin 2 Jiawei Han 1Dong XinJiawei.

Similar presentations

Presentation on theme: "ARCube: supporting ranking aggregate queries in partially materialized data cubes SIGMOD 2008 Tianyi Wu Tianyi Wu 1 Dong Xin 2 Jiawei Han 1Dong XinJiawei."— Presentation transcript:

Similar presentations

About project

Feedback