Lecturer : Dr. Pavle Mogin

Slides:



Advertisements
Similar presentations
Brought to you by Tutorial Support Services The Math Center.
Advertisements

Unit 6B Measures of Variation.
5.1Database System Concepts - 6 th Edition Chapter 5: Advanced SQL Advanced Aggregation Features OLAP.
Set operators (UNION, UNION ALL, MINUS, INTERSECT) [SQL]
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Pivoting and SQL:1999.
©Silberschatz, Korth and Sudarshan22.1Database System Concepts 4 th Edition 1 Extended Aggregation SQL-92 aggregation quite limited  Many useful aggregates.
Mode Mean Range Median WHAT DO THEY ALL MEAN?.
Computer Science 101 Web Access to Databases SQL – Extended Form.
Mode Mean Range Median WHAT DO THEY ALL MEAN?.
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik.
Measures of Central Tendency & Spread
Chapter 3 Averages and Variations
ADVANCE T-SQL: WINDOW FUNCTIONS Rahman Wehelie 7/16/2013 ITC 226.
Tuesday August 27, 2013 Distributions: Measures of Central Tendency & Variability.
Psychology’s Statistics. Statistics Are a means to make data more meaningful Provide a method of organizing information so that it can be understood.
6, 5, 3, 6, 8 01/10/11 Mean, Median, Mode Warm Up
Hotness Activity. Descriptives! Yay! Inferentials Basic info about sample “Simple” statistics.
SQL Aggregation Oracle and ANSI Standard SQL Lecture 9.
Statistics 1: Introduction to Probability and Statistics Section 3-2.
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.
Module 4: Grouping and Summarizing Data. Overview Listing the TOP n Values Using Aggregate Functions GROUP BY Fundamentals Generating Aggregate Values.
The number which appears most often in a set of numbers. Example: in {6, 3, 9, 6, 6, 5, 9, 3} the Mode is 6 (it occurs most often). Mode : The middle number.
CHAPTER 2: Basic Summary Statistics
©Silberschatz, Korth and Sudarshan5.1Database System Concepts - 6 th Edition Recursive Queries.
CCGPS Coordinate Algebra Unit 4: Describing Data.
Copyright © 2009 Pearson Education, Inc. 4.1 What Is Average? LEARNING GOAL Understand the difference between a mean, median, and mode and how each is.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation OLAP Queries and SQL:1999.
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.
An Introduction to Statistics
Plan for Final Lecture What you may expect to be asked in the Exam?
Data Analysis and OLAP Dr. Ms. Pratibha S. Yalagi Topic Title
One-Variable Statistics
Lecturer : Dr. Pavle Mogin
Aggregating Data Using Group Functions
Queries.
Instructor: Craig Duckett Lecture 09: Tuesday, April 25th, 2017
PL/SQL LANGUAGE MULITPLE CHOICE QUESTION SET-1
Mode Mean Range Median WHAT DO THEY ALL MEAN?.
Chapter 5: Advanced SQL Database System concepts,6th Ed.
Chapter 12 Statistics 2012 Pearson Education, Inc.
SQL: Advanced Options, Updates and Views Lecturer: Dr Pavle Mogin
Aggregating Data Using Group Functions
12.2 – Measures of Central Tendency
Data warehouse Design Using Oracle
Oracle8i Analytical SQL Features
The Relational Algebra
Representing Quantitative Data
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Aggregations Various Aggregation Functions GROUP BY HAVING.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Chapter 4 Summary Query.
Aggregating Data Using Group Functions
Unit 6B Measures of Variation Ms. Young.
4.1 What Is Average? LEARNING GOAL
POPULATION VS. SAMPLE Population: a collection of ALL outcomes, responses, measurements or counts that are of interest. Sample: a subset of a population.
Data Analysis with SQL Window Functions
Creating Noninput Items
Statistics 1: Introduction to Probability and Statistics
Numerical Descriptive Measures
The Relational Algebra
M1G Introduction to Database Development
Access: Queries III Participation Project
Projecting output in MySql
CHAPTER 2: Basic Summary Statistics
Joins and other advanced Queries
Chapter 12 Statistics.
Slides based on those originally by : Parminder Jeet Kaur
Microsoft Office Illustrated Fundamentals
Central Tendency & Variability
Presentation transcript:

Lecturer : Dr. Pavle Mogin Aggregate Functions Lecturer : Dr. Pavle Mogin

Plan for Aggregate Functions Topic Motivation to discuss aggregate functions A classification of aggregates Distributive aggregates Algebraic aggregates Holistic aggregates

Motivation OLAP queries predominantly rely on aggregates Computing aggregates is costly since it requires sorting Reusing finer grained aggregates to compute coarser grained aggregates is a nice idea It particularly applies to computing roll-ups But this idea is not always applicable Understanding mechanisms in computing aggregates can help to extend usefulness of this idea

A Two Dimensional Set of Values Consider a two-dimensional set of values V = {Xij | i = 1,…, n; j = 1,…, mi } The set V contains n subsets and each of these contains mi values Suppose we need to compute: m sub-aggregates by applying an aggregate function on each subset {Xi | i = 1,…, n }, and A global aggregate of the same kind by applying an appropriate aggregate functions (possibly different from one applied to compute sub-aggregates) on sub-aggregates

An Example Set of Values Suppose: i is City, j is ProdType, n = 3, m1 = 3, m2 = 2, m3 = 4 For i = Auckland and j = Jeans, Xij = 10 For i = Wellington and j = Socks, Xij = 3 Sale_By_City_ProdType City ProdType Amnt Auckland Jeans 10 Shoes 11 Pajamas 12 Christchurch 5 6 Wellington 7 8 9 Socks 3 It may be supposed that Sale_By_City_ProdType is already produced by CUBE operator

Distributive Aggregate Functions An aggregate function F ( ) is distributive if there is a function G ( ) such that F ({Xij }) = G ({F ({Xj | j = 1,…, mi })i | i = 1,…, n }) The formula above says that function F ( ) is applied on each of n subsets producing aggregates Yi = F ({Xj | j = 1,…, mi )}), and then is G ( ) applied on the n intermediate results {Yi | i = 1,…, n } to produce the aggregate F ({Xij }) of a coarser granularity SUM(), COUNT( ), MIN( ), MAX( ) are distributive aggregate functions SUM(Xij) = SUM(SUM(Xj)i) COUNT(Xij ) = SUM(COUNT(Xj )i ) MAX(Xij ) = MAX(MAX(Xj )i ) For each j = 1,…, m, function F() is applied on all i values, and then is G() applied on all these partial results.

Examples of Distributive Functions CREATE MATERIALIZED VIEW Sale_By_City AS SELECT City, SUM(Amnt) AS SubTot FROM Sale_By_City_ProdType GROUP BY City; SELECT SUM(SubTot) AS Total_Sale FROM Sale_By_City; CREATE TABLE Max_By_City AS SELECT City, MAX(Amnt) SubMax SELECT MAX(SubMax) AS Max_Sale FROM Max_By_City; Sale_By_City Auck 33 Chr 11 Well 27 Total_Sale 71 SubTot Max_By_City Auck 12 Chr 6 Well 9 Max_Sale 12 SubMax

Algebraic Aggregate Functions An aggregate function F ( ) is algebraic if there is a p-tuple function G ( ) and a function H ( ) such that F({Xij }) = H ({G ({Xj | j = 1,…, mi )i })| i = 1,…, n }) Function G ( ) computes a p-tuple for each of n subsets, and then is function H ( ) applied on all these p-tuples to produce the courser aggregate AVG( ), STDEV( ), center_of_mass( ) are algebraic functions For each operation, p is a constant Each p-tuple is produced by computing a sub-aggregate and all are used to compute the global aggregate STDEV() is standard deviation

Average as an Algebraic Function CREATE MATERIALIZED View Sale_By_City_Avg AS SELECT City, COUNT(*)AS my_count, AVG(Amnt)AS my_avg FROM Sale_By_City_ProdType GROUP BY City; SELECT SUM(my_avg*my_count)/SUM(my_count) AS Total_Avg FROM Sale_By_City_Avg; Note, for the average function, p = 2, When computing AVG sub-aggregates, we have to produce and store a two tuple: Either an intermediate (SUM, COUNT), or An intermediate (AVG, COUNT) Sale_By_City_Avg City my_count my_avg Auck 3 11.00 Chr 2 5.50 Well 4 6.75 Total_Avg 7.89

Holistic Aggregate Functions An aggregate function F () is holistic if there is no constant storage needed to describe a sub-aggregate There is no constant k (k > 1 ), such that a k - tuple characterizes the computation of sub-aggregates F ({Xj | j = 1,…, mi )})i Usually, each sub-aggregate is characterized by mi values, hence all values are needed to compute the super-aggregate Most common holistic aggregate functions are: Median(), MostFrequent() (called also Mode()), and Rank() MaxN(), or TopN() Holistic - emphasizing the organic or functional relation between parts and the whole

Median As a Holistic Aggregate Ffunction Median_By_City City List of values Med Auck 10, 11,12 11.0 Chr 6, 7 6.5 Well 3,7,8,9 7.5 Global_Median 8.0 Consider the set of numbers 80, 90, 90, 100, 85, 90. They could be math grades, for example. The MEAN is the arithmetic average, the average you are probably used to finding for a set of numbers - add up the numbers and divide by how many there are: (80 + 90 + 90 + 100 + 85 + 90) / 6 = 89 1/6. The MEDIAN is the number in the middle. In order to find the median, you have to put the values in order from lowest to highest, then find the number that is exactly in the middle: 80 85 90 90 90 100 ^ since there is an even number of values, the MEDIAN is between these two, or it is 90. Notice that there is exactly the same number of values ABOVE the median as BELOW it! The MODE is the value that occurs most often. In this case, since there are 3 90's, the mode is 90. A set of data can have more than one mode. The RANGE is the difference between the lowest and highest values. In this case 100 - 80 = 20, so the range is 20. The range tells you something about how spread out the data are. Data with large ranges tend to be more spread out. To compute global median, one needs all Amnt values

RANK Function The RANK function returns the position of a row within its partition Rows are ordered in the partition according to a ranking criteria Rank of a row is expressed as an ordinal number of the row within the partition All rows with the same value of the ranking criteria are designated the same rank If n (> 1 ) rows are ranked at position r, then the first next row is ranked at position r + n DENSE_RANK function generates ranks without gaps To compute the overall ranks, we need all source data, hence it is holistic If say 7 items share the third position, the DENSE_RANK function will assign the fourth position to the next ranked item (instead of eleventh position)

An Example Set of Values Sale_By_City_ProdType City ProdType Amnt Auckland Jeans 10 Christchurch Hats 12 Shoes 11 Wellington 7 Pajamas 8 Socks 13 Jackets 6 Pullovers 4 5 Trousers It may be supposed that Sale_By_City_ProdType is already produced by CUBE operator

The third best selling product type A Rank Query Retrieve the third best selling product type, use DENSE_RANK City The third best selling product type Auckland Jeans 10 Christchurch Jeans 5 Wellington Jackets 7, Pullovers 7, Jeans 7, Trousers 7 Overall ProdType Amnt Hats 28 Jeans 22 Shoes 25 Pullovers 15 Pajamas Trousers 7 Socks To calculate the overall ranks, we need all source data

Queries of the Type “Top N” Data analysts often require just top N best ranked elements of a dimension with regard to the measure Example: Show the 10 best sold products for a given location l and time unit t

The three best selling product types A Top N Query Retrieve the three best selling product types City The three best selling product types Auckland Pajamas 12, Socks 12, Hats 12 Christchurch Hats 12, Shoes 6, Jens 5 Wellington Pajamas 13, Socks 13, Shoes 8 Overall ProdType Amnt Hats 28 Jeans 22 Shoes 25 Pullovers 15 Pajamas Trousers 7 Socks To calculate the overall top N, we need all source data. Hence, the aggregate function is holistic.

Computing All for Holistic Agg Funcs To compute overall aggregate we need all source values In the case of the median, we needed to look at all source values to find out the overall median In the case of the rank and topN functions, we needed all source values to compute overall numbers for each product type by adding sales numbers of the same product in different cities Even if we wanted to find the third best selling product in any of cities, we would need all source values, since it is Shoes (in Auckland) So, these are really holistic functions

A Problem with Rank() and TopN() Example: Show the 10 best sold products for a given location l and time unit t The SQL statement SELECT p.ProdId, p.ProdName, s.Sale FROM Product p NATURAL JOIN Sales s WHERE s.LocId = l AND s.TimeId = t ORDER BY s.Sale DESC will return all existing products (possibly thousands of them) and consume a lot of time to compute, although only ten first were needed

A Problem with Rank() and TopN() Some DBMSs allow defining the clause OPTIMIZE FOR N ROWS in the line below the clause ORDER BY of the SQL statement PostgreSQL supports the clauses LIMIT {COUNT | ALL} and OFFSET start (convenient for rank) This way, SQL processor is informed when to start and stop displaying the result But result is already computed How to inform SQL processor to perform computation just for first N items? Think about!

Summary Distributive aggregates can be easily used to compute aggregates of coarser granularity Distributive: SUM(), COUNT(), MIN(), MAX() Algebraic aggregates can be used to compute aggregates of the coarser granularity if the corresponding p-tuple is also produced Algebraic: AVG(), STDEV(), VAR(), Holistic aggregates can be hardly used to compute aggregates of coarser granularity Holistic: MEDIAN(), MODE(), RANK(), TopN()