Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.

Slides:



Advertisements
Similar presentations
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.
Advertisements

Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
Chapter 15 Algorithms for Query Processing and Optimization Copyright © 2004 Pearson Education, Inc.
The Efficiency of Algorithms
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
15.8 Algorithms using more than two passes Presented By: Seungbeom Ma (ID 125) Professor: Dr. T. Y. Lin Computer Science Department San Jose State University.
Fundamentals of Python: From First Programs Through Data Structures
Chapter 6 Additional Relational Operations Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2008.
Concepts of Database Management Seventh Edition
Concepts of Database Management Sixth Edition
Concepts of Database Management Seventh Edition
Chapter 11 Group Functions
The University of Akron Dept of Business Technology Computer Information Systems The Relational Model: Query-By-Example (QBE) 2440: 180 Database Concepts.
ONE PASS ALGORITHM PRESENTED BY: PRADHYUMAN RAOL ID : 114 Instructor: Dr T.Y. LIN.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
ONE PASS ALGORITHM PRESENTED BY: PRADHYUMAN RAOL ID : 114 Instructor: Dr T.Y. LIN.
Introduction to Oracle9i: SQL1 SQL Group Functions.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Concepts of Database Management Sixth Edition
Advanced Databases 5841 DATA CUBE. Index of Content 1. The “ALL” value and ALL() function 2. The New Features added in CUBE 3. Computing the CUBE and.
Concepts of Database Management, Fifth Edition
Xin  Syntax ◦ SELECT field1 AS title1, field2 AS title2,... ◦ FROM table1, table2 ◦ WHERE conditions  Make a query that returns all records.
1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Concepts of Database Management Seventh Edition
Concepts of Database Management Seventh Edition
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
Chapter 6 The Relational Algebra Copyright © 2004 Ramez Elmasri and Shamkant Navathe.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Statistics 1: Introduction to Probability and Statistics Section 3-2.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Slide 6- 1 Additional Relational Operations Aggregate Functions and Grouping A type of request that cannot be expressed in the basic relational algebra.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Mining real world data RDBMS and SQL. Index RDBMS introduction SQL (Structured Query language)
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.
Chapter 5 Describing Distributions Numerically Describing a Quantitative Variable using Percentiles Percentile –A given percent of the observations are.
DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )
Concepts of Database Management, Fifth Edition Chapter 3: The Relational Model 2: SQL.
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Database Systems Chapter 6
15.1 – Introduction to physical-Query-plan operators
CACTUS-Clustering Categorical Data Using Summaries
Data Mining: EXPLORING DATA
Chapter # 6 The Relational Algebra and Calculus
Database Management System
Lecturer : Dr. Pavle Mogin
Chapter 3 Introduction to SQL(3)
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Efficient Image Classification on Vertically Decomposed Data
Efficient Ranking of Keyword Queries Using P-trees
Yue (Jenny) Cui and William Perrizo North Dakota State University
Chapter 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Yue (Jenny) Cui and William Perrizo North Dakota State University
Sameh Shohdy, Yu Su, and Gagan Agrawal
Efficient Image Classification on Vertically Decomposed Data
A Fast and Scalable Nearest Neighbor Based Classification
Chapter 11 Indexing And Hashing (1)
Statistics 1: Introduction to Probability and Statistics
Lesson 4: Introduction to Functions
The Relational Algebra
Query Functions.
Algorithm of Aggregate Function SUM
Algorithm for the Aggregate Function SUM
LINQ to SQL Part 3.
Relational Algebra Chapter 4 - part I.
Introduction to SQL Server and the Structure Query Language
CS 405G: Introduction to Database Systems
Presentation transcript:

Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department of Computer Science North Dakota State University

Outline  Introduction Review of Aggregate Functions Review of Iceberg Queries  Algorithms of Aggregate Function Computation Using P-trees SUM, COUNT, and AVERAGE. MAX, MIN, MEDIAN, RANK, and TOP-K.  Iceberg Query Operation Using P-trees An Iceberg Query Example  Performance Analysis  Conclusion

Introduction  The commonly used aggregation functions include COUNT, SUM, AVERAGE, MIN, MAX, MEDIAN, RANK, and TOP-K.  There are three types of aggregate functions: T is a set of tuple, {S i | i = 1... n} U i S i = T and ∩ i S i = {} Distributive  An aggregate function F is distributive if there is a function G such that F (T) = G ({F (S i )| i = 1... n}). SUM, MIN, and MAX are distributive with G = F. Count is distributive with G = SUM.

Review of Aggregate Functions (Cont.) Algebraic  An Aggregate function F is algebraic if there is an M-tuple valued function G and a function H such that F (T) = H ({G (S i ) | i = 1... n}). Average, Standard Deviation, MaxN, MinN, and Center_of_Mass are all algebraic. Holistic  An aggregate function F is holistic if there is no constant bound on the size of the storage needed to describe a sub-aggregate. Median, MostFrequent (also called the Mode), and Rank are common examples of holistic functions.

Review of Iceberg Queries  Iceberg queries perform aggregate functions across attributes and then eliminate aggregate values that are below some specified threshold.  We use an example to review iceberg queries. SELECT Location, Product Type, Sum (# Product) FROM Relation Sales GROUPBY Location, Product Type HAVING Sum (# Product) >= T

Review of Iceberg Queries (Cont.)  We illustrate the procedure of calculating by three steps.  Step one: Generate Location-list. SELECT Location, Sum (# Product) FROM Relation Sales GROUPBY Location HAVING Sum (# Product) >= T  Step Two: Generate Product Type-list. SELECT Type, Sum (# Product) FROM Relation Sales GROUPBY Product Type HAVING Sum (# Product) >= T

Review of Iceberg Queries (Cont.)  Step Three: Generate location & Product Type pair groups.  From the Location-list and the Type-list we generated in first two steps, we can eliminate many of the location & Product Type pair groups according to the threshold T.

Algorithms of Aggregate Function Computation Using P-trees IdMonLocTypeOn line# Product 1JanNew YorkNotebookY10 2JanMinneapolisDesktopN5 3FebNew YorkPrinterY6 4MarNew YorkNotebookY7 5MarMinneapolisNotebookY11 6MarChicagoDesktopY9 7AprMinneapolisFaxN3  The dataset we used in our example.  We use the data in relation Sales to illustrate algorithms of aggregate function. Table 1. Relation Sales.

Algorithms of Aggregate Function Computation Using P-trees (Cont.) IdMonLocTypeOn line# Product P 0,3 P 0,2 P 0,1 P 0,0 P 1,4 P 1,3 P 1,2 P 1,1 P 1,0 P 2,2 P 2,1 P 2,0 P 3,0 P 4,3 P 4,2 P 4,1 P 4,  Table 2 shows the binary representation of data in relation Sales. Table 2. Binary Form of Sales.

Algorithm of Aggregate Function COUNT  COUNT function: It is not necessary to write special function for COUNT because P-tree RootCount function has already provided the mechanism to implement it. Given a P-tree P i, RootCount(P i ) returns the number of 1s in P i. IdMonLocTypeOn line# Product P 0,3 P 0,2 P 0,1 P 0,0 P 1,4 P 1,3 P 1,2 P 1,1 P 1,0 P 2,2 P 2,1 P 2,0 P 3,0 P 4,3 P 4,2 P 4,1 P 4, Table 1. Relation Sales.

Algorithm of Aggregate Function SUM  SUM function: Sum function can total a field of numerical values. Algorithm 4.1 Evaluating sum () with P-tree. total = 0.00; For i = 0 to n { total = total + 2 i * RootCount (P i ); } Return total Algorithm Sum Aggregate

Algorithm of Aggregate Function SUM P 4,3 P 4,2 P 4,1 P 4, {3} {5} 2 3 * * * * = 51  For example, if we want to know the total number of products which were sold out in relation S, the procedure is showed on left

Algorithm of Aggregate Function AVERAGE  Average function: Average function will show the average value in a field. It can be calculated from function COUNT and SUM. Average () = Sum ()/Count ().

Algorithm of Aggregate Function MAX  Max function: Max function returns the largest value in a field. Algorithm 4.2 Evaluating max () with P-tree. max = 0.00; c = 0; P c is set all 1s For i = n to 0 { c = RootCount (P c AND P i ); If (c >= 1) P c = P c AND P i ; max = max + 2 i ; } Return max; Algorithm Max Aggregate.

Algorithm of Aggregate Function MAX P 4,3 P 4,2 P 4,1 P 4, {1} {0} {1} 1. P c = P 4,3 RootCount (P c ) = 3 >= 1 2. RootCount (P c AND P 4,2 ) = 0 < 1 P c = P c AND P’ 4,2 3. RootCount (P c AND P 4,1 ) = 2 >= 1 P c = P c AND P 4,1 4. RootCount (P c AND P 4,0 ) = 1 >= Steps IF Pos Bits 2 3 * * * * = {1} {0} {1} 11

Algorithm of Aggregate Function MIN  Min function: Min function returns the smallest value in a field. Algorithm 4.3. Evaluating Min () with P-tree. min = 0.00; c = 0; P c is set all 1s For i = n to 0 { c = RootCount (P c AND NOT (P i )); If (c >= 1) P c = P c AND NOT (P i ); Else min = min + 2 i ; } Return min; Algorithm Max Aggregate.

Algorithm of Aggregate Function MIN P 4,3 P 4,2 P 4,1 P 4, {0} {1} 1. P c = P’ 4,3 RootCount (P c ) = 4 > = 1 2. RootCount (P c AND P’ 4,2 ) = 1 >= 1 P c = P c AND P’ 4,2 3. RootCount (P c AND P’ 4,1 ) = 0 < 1 P c = P c AND P 4,1 4. RootCount (P c AND P’ 4,0 ) = 0 < Steps IF Pos Bits 2 3 * * * * = {0} {1} 3

Algorithms of Aggregate Function MEDIAN and RANK  Median/Rank: Median function returns the median value in a field.  Rank (K) function returns the value that is the kth largest value in a field. Algorithm 4.4. Evaluating Median () with P-tree median = 0.00; pos = N/2; for rank pos = K; c = 0; P c is set all 1s for single attribute For i = n to 0 { c = RootCount (P c AND P i ); If (c >= pos) median = median + 2 i ; P c = P c AND P i ; Else pos = pos - c; P c = P c AND NOT (P i ); } Return median; Algorithm Median Aggregate.

Algorithm of Aggregate Function MEDIAN P 4,3 P 4,2 P 4,1 P 4, {0} {1} 1. P c = P 4,3 RootCount (P c ) = 3 < 4 Pc = P’ 4,3 pos = 4 – 3 = 1 2. RootCount (P c AND P 4,2 ) = 3 >= 1 P c = P c AND P 4,2 3. RootCount (P c AND P 4,1 ) = 2 >= 1 P c = P c AND P 4,1 4. RootCount (P c AND P 4,0 ) = 1 >= Steps IF Pos Bits 2 3 * * * * = {0}{1} 7

Algorithm of Aggregate Function TOP-K  Top-k function: In order to get the largest k values in a field, first, we will find rank k value V k using function Rank (K).  Second, we will find all the tuples whose values are greater than or equal to V k. Using ENRING technology of P-tree

Iceberg Query Operation Using P-rees  We demonstrate the computation procedure of iceberg querying with the following example: SELECT Loc, Type, Sum (# Product) FROM Relation S GROUPBY Loc, Type HAVING Sum (# Product) >= 15

Iceberg Query Operation Using P- trees (Step One)  Step one: We build value P-trees for the 4 values, {Loc| New York, Minneapolis, Chicago}, of attribute Loc. P MN P NY P CH Figure 4. Value P-trees of Attribute Loc

Iceberg Query Operation Using P- trees (Step One) LOC P 1,4 P 1,3 P 1,2 P 1.1 P 1.0 P’ 1,4 P’ 1,3 P’ 1,2 P’ 1.1 P 1.0 P NY Figure 5. Procedure of Calculating P NY  Figure 5 illustrates the calculation procedure of value P-tree P NY. Because the binary value of New York is 00001, we will get formula 1. P NY = P ’ 1,4 AND P ’ 1,3 AND P ’ 1,2 AND P ’ 1,1 AND P 1,0 (1)

Iceberg Query Operation Using P- trees (Step One)  After getting all the value P-trees for each location, we calculate the total number of products sold in each place. We still use the value, New York, as our example. Sum(# product | New York) = 2 3 * RootCount (P 4,3 AND P NY ) * RootCount (P 4,2 AND P NY ) * RootCount (P 4,1 AND P NY ) * RootCount (P 4,0 AND P NY ) = 8 * * * * 1 = 23 (2)

Iceberg Query Operation Using P- trees (Step One) Loc ValuesSum (# Product)Threshold New York23Y Minneapolis18Y Chicago9N Table 3 shows the total number of products sold out in each of the three of the locations. Because our threshold T is 15, we eliminate the city Chicago. Table 3. the Summary Table of Attribute Loc.

Iceberg Query Operation Using P- trees (Step Two)  Step two: Similarly we build value P-trees for every value of attribute Type. Attribute Type has four values {Type | Notebook, desktop, Printer, Fax}. Figure 6 shows the value P-tree of the four values of attribute Type P Notebook P Desktop P Printer P FAX Figure 6. Value P-trees of Attribute Type.

Iceberg Query Operation Using P- trees (Step Two) Type ValuesSum (# Product)Threshold Notebook28Y Desktop14N FAX3N Printer6N Similarly we get the summary table for each value of attribute Type. According to the threshold, T equals 15, only value P-tree of notebook will be used in the future. Table 4. Summary Table of Attribute Type.

Iceberg Query Operation Using P- trees (Step Three)  Step three: We only generate candidate Loc and Type pairs for local store and Product type, which can pass the threshold T. By Performing And operation on P NY with P Notebook, we obtain value P-tree P NY AND Notebook P NY P Notebook P NY AND Notebook AND = Figure 7. Procedure of Calculating PNY AND Notebook

Iceberg Query Operation Using P- trees (Step Three)  We calculate the total number of notebooks sold out in New York by formula 3. Sum(# Product | New York) = 2 3 * RootCount (P 4,3 AND P NY AND Notebook ) * RootCount (P 4,2 AND P NY AND Notebook ) * RootCount (P 4,1 AND P NY AND Notebook ) * RootCount (P 4,0 AND P NY AND Notebook ) = 8 * * * 2 + 1* 1 = 17 (3)

Iceberg Query Operation Using P- trees (Step Three)  By performing And operations on P MN with P Notebook, we obtain value P-tree P MN AND Notebook P MN P Notebook P MN AND Notebook AND= Figure 8. Procedure of Calculating PMN AND Notebook

Iceberg Query Operation Using P- trees (Step Three)  We calculate the total number of notebook sold out in Minneapolis by formula 4. Sum (# product | Minneapolis) = 2 3 * RootCount (P 4,3 AND P MN AND Notbook ) * RootCount (P 4,2 AND P MN AND Notbook ) * RootCount (P 4,1 AND P MN AND Notbook ) * RootCount (P 4,0 AND P MN AND Notbook ) = 8 * * * * 1 = 11 (4)

Iceberg Query Operation Using P- trees (Step Three)  Finally, we obtain the summary table 5. According to the threshold T=15, we can see that only group pair “ New York And Notebook ” pass our threshold T. From value P-tree P NY AND Notebook, we can see that tuple 1 and 4 are in the results of our iceberg query example. Type ValuesSum (# Product)Threshold New York And Notebook17Y Minneapolis And Notebook11N Table 5. Summary Table of Our Example P NY AND Notebook

Performance Analysis Figure 15. Iceberg Query with multi-attributes aggregation Performance Time Comparison

Performance Analysis  Our experiments are implemented in the C++ language on a 1GHz Pentium PC machine with 1GB main memory running on Red Hat Linux.  In figure 15, we compare the running time of P-tree method and bitmap method on calculating multi-attribute iceberg query. In this case P-trees are proved to be substantially faster.

Conclusion  we believe our study confirms that the P-tree approach is superior to the bitmap approach for aggregation of all types and multi-attribute iceberg queries.  It also proves that the advantages of basic P-tree representations of files are: First, there is no need for redundant, auxiliary structures. Second basic P-trees are good at calculating multi- attribute aggregations, numeric value, and fair to all attributes.

Thank you !