Download presentation
Presentation is loading. Please wait.
1
Objectives of the Lecture :
GroupBy Operations Objectives of the Lecture : To consider the algebraic GroupBy operator; To consider the implementation of the GroupBy operator in SQL.
2
Purpose of ‘GroupBy’ It is very useful if a DBMS can do some calculations and manipulations of data as it is retrieved from the DB. There are 2 possible kinds of calculations/manipulations : Those that take individual data values and return individual data values (i.e. scalar calculations). Those that take a group of data values and return an individual data value (i.e. an aggregate calculation) The purpose of GroupBy is to allow aggregate calculations to be carried out. (Also called summaries). Because an aggregation can only logically apply to a group (or collection) of data all of the same type, each aggregation is applied to a group of data from one attribute. Thus GroupBy is said to work ‘down attributes’, i.e. ‘vertically’ as opposed to ‘horizontally’, i.e. ‘along tuples’. DBMSs do not usually provide the full power of a programming language that allows arbitrary processing of any kind of complexity, but they do provide the more commonly useful kinds of processing. A ‘scalar’ is defined as an individual, atomic, indivisible value. For example 357 and ‘Julie’ are scalar values, even though the number consists of 3 digits and the name of 5 letters or characters; and even though it is logically possible to extract a digit or character from these values. They are considered as scalars because in reality the number and name are treated as single values. The terms aggregation and summary focus on the fact that one scalar value is produced from a group or collection of scalar values, even though, as will be seen later, not all aggregations/summaries are what might traditionally be called by that name.
3
Example of ‘GroupBy’ Get the total salary paid to each marital-status group. 3 E1 7 Robson 1 E3 5 Smith 2 E5 6 Mitchell 4 E6 8 Blake E8 Jones ENo EName E2 Atkins E4 Fenwick E7 Watson D S M M-S Sal 32,500 18,000 24,000 54,000 40,000 28,000 EMPLOYEE M-S Total 6 S 58,000 2 D 78,500 M 102,000 In the example above, the algebra operation expresses the fact that : groups or collections are formed using the M-S attribute, and specifically each group must have only one M-S value; in each group the Sal attribute values are summed, without getting rid of any duplicate Sal values, to yield a Total for each group. EMPLOYEE GroupBy[ M-S ] With[ Total Bag[ Sal ] Sum ]
4
Parameters of a GroupBy (1)
Two main aspects How to split the operand up into groups. (“each marital-status”) EMPLOYEE GroupBy[ M-S ] With[ Total Bag[ Sal ] Sum ] How to get the aggregate. (“the total salary paid”)
5
Parameters of a GroupBy (2)
Individual Parameters 2. The attribute to be aggregated (“total salary ”) 1. The grouping attribute. (“each marital-status”) 3. Include duplicates (everyone’s salary) EMPLOYEE GroupBy[ M-S ] With[ Total Bag[ Sal ] Sum ] 5. New attribute to hold the result. 4. Type of aggregate (“total salary ”)
6
The Operand of ‘GroupBy’
The operand is considered to consist of 3 disjoint sets of attributes : 1. Grouping attribute(s) : attributes used to to split the operand up into groups of tuples. 2. Aggregate attribute(s) : attributes whose values are aggregated or summarised. 3. Irrelevant attribute(s) : attributes not used by the GroupBy operator. These sets are specified by 2 of GroupBy’s parameters. It/they appear(s) in the result. One new summary result attribute created per aggregate attribute. Disjoint sets are sets that have no members in common. So in this case, any attribute in the operand can only be in one of the three sets, not in two or all three of them. Irrelevant attribute(s) are, by a process of elimination, any other attributes left in the relation which are not members of either of the other two sets. It is not uncommon for there to be only one attribute in any of these sets; e.g. there may well be just one grouping attribute or aggregate attribute. It is possible for any of the three sets to be empty. The case where there are no grouping attribute(s) will be considered later. While there could theoretically be no aggregate attribute(s), there would be no point in this, since then there would be no aggregations or summaries, which is the whole point of using the operator. There could be no irrelevant attribute(s) because the grouping attribute(s) and aggregate attribute(s) could just happen to fill the relation completely. It/they do not appear(s) in the result.
7
‘GroupBy’ Procedure 1. Remove Irrelevant attribute(s).
2. Split the rest of the relation into groups of tuples, such that the grouping attribute(s) all have the same value in every tuple of a group. 3. Create one result tuple / group consisting of the grouping attribute(s). 4. FOR each aggregation attribute : Append a result attribute to the result relation to hold the aggregation results. FOR each group : Apply the aggregate function to the group’s bag/set of values, and put the result in the appended new attribute. Step 1 is not logically necessary, but is just to make clear that irrelevant attributes do not participate in a GroupBy operation. Steps on their own actually carry out the operation. The details of the procedure are determined by the parameters to GroupBy. One parameter specifies the grouping attributes. The remaining four parameters control the nature of the aggregations : Which attribute is to be aggregated ? What what sort of aggregation is to be performed on that attribute; e.g. sum ? Does the aggregation include any duplicate values. Put another way, should duplicates be got rid of before the aggregation is performed ? What is the name of the attribute into which the new aggregated values are to be put ? GroupBy can execute aggregations on several attributes in the same operation. For example : RELATION GroupBy[ G1, G2] With[ Res1 Bag[ Ag1 ] Sum ; Res2 Project[ Ag2 ] Min ; Res3 Bag[ Ag3 ] Max ] In this case, there are 3 sets of 4 parameters, one set of 4 per aggregation. The result will consist of attributes ( G1, G2, Res1, Res2, Res3 ). The grouping also uses 2 attributes, G1 and G2. See later slides for details of Project and other aggregate functions.
8
Illustration of ‘GroupBy’ Procedure
1. Remove unwanted attributes. 2. Group resulting tuples. 3. Create a tuple/group with grouping attribute(s). 4. Append aggregation(s). 3 E1 7 Robson 1 E3 5 Smith 2 E5 6 Mitchell 4 E6 8 Blake E8 Jones ENo EName E2 Atkins E4 Fenwick E7 Watson D S M M-S Sal 32,500 18,000 24,000 54,000 40,000 28,000 EMPLOYEE 4 24,000 8 54,000 M 2 18,000 S 40,000 6 32,500 D 28,000 Sal M-S M 102,000 S 2 58,000 6 D 78,500 M-S Total The aggregate function Sum is chosen as a parameter because we want to add the salary values together. Bag is chosen as a parameter because we want everyone’s salary, regardless of whether it’s the same salary as someone else’s or not, and therefore we need any duplicate values that may occur. The parameter Total is the name of the attribute to hold the sums of the salaries. The grouping and aggregation parameters determine the 3 sets of attributes : Grouping attribute : M-S. Aggregate attribute : Sal. Therefore ENo and EName are irrelevant attributes. In following the procedure with this example, note the following : If duplicate tuples remain when ENo and EName are removed, they are left; it is not a genuine projection. (When the GroupBy operation has finished, there will be no duplicate tuples in the result). It is just that these attributes are not required by steps The relation is split into groups using M-S. ‘D’ appears in every tuple of one group, ‘M’ in every tuple of the second group, and ‘S’ in every tuple of the third group. There are 3 different values of M-S, so 3 new tuples consisting of attribute M-S are put into the result. The aggregations are put into the new attribute Total that is appended to the result relation. EMPLOYEE GroupBy[ M-S ] With[ Total Bag[ Sal ] Sum ]
9
Another ‘GroupBy’ Example
How many different shipment sizes are there per supplier ? SHIP 1 PNo P1 3 2 P2 P3 SNo S1 S2 S3 Qty 10 12 6 8 SNo 1 S1 3 S2 2 S3 Qty 1 3 10 2 12 6 8 Sizes 1 2 SNo S1 S2 S3 In the example above, the algebra operation expresses the fact that : groups are formed using the SNo attribute; the number of different values of the Qty attribute in each group are counted up (i.e. duplicates are removed from each group first before counting) and the result put in the new attribute Sizes. SHIP GroupBy[ SNo ] With[ Sizes Project[ Qty ] Count ]
10
The GroupBy Procedure Again
1. Remove unwanted attributes. 2. Group resulting tuples. 3. Create a tuple/group with grouping attribute(s). 4. Append aggregation(s). SHIP 1 PNo P1 3 2 P2 P3 SNo S1 S2 S3 Qty 10 12 6 8 Qty SNo 1 10 2 12 S1 3 6 S2 8 S3 Sizes SNo 1 2 S1 S2 S3 The aggregate function Count is chosen as a parameter because we want to count up the quantity sizes. Project as opposed to Bag is chosen as a parameter because, like the Project algebra operator, it gives us a set of values with any duplicates removed. We want to count the different sizes of the shipments; we do not want to count up all the individual shipments, and so duplicates must be removed before counting. The parameter Sizes is the name of the attribute to hold the sums of the salaries. The grouping and aggregation parameters determine the 3 sets of attributes : Grouping attribute : SNo. Aggregate attribute : Qty Therefore PNo is an irrelevant attribute. In following the procedure with this example, note the following PNo is removed or ignored because it is irrelevant. The relation is split into groups using SNo. ‘S1’ determines one group, ‘S2’ a second group, and ‘S3’ the third. There are 3 new tuples consisting of attribute SNo are put in the result. The aggregations are put into the new attribute Sizes that is appended to the result relation. SHIP GroupBy[ SNo ] With[ Sizes Project[ Qty ] Count ]
11
Standard SQL Aggregate Functions
For the attribute values in a group : Sum adds them together; Min finds their minimum; Max finds their maximum; Avg gets the average. Count counts up the number of tuples in a group or counts up the number of attribute values in a group. Some SQL DBMSs feature additional aggregate functions, such as Stdev (= standard deviation) and Variance. Avg is logically unnecessary, as it can be replaced by Sum / Count, but is of practical convenience. Note however that mathematicians also have two other kinds of averages which could be useful. The number of aggregate functions provided with an SQL DBMS is likely to increase as time goes by. Aggregate functions are sometimes known as Set Functions, because they operate on a set of values to produce a single result. Unfortunately this is a misleading term because, as we have seen, the functions can also act on bags of values.
12
Points to Note When considering whether to include or exclude duplicate attribute values, the concern is only with duplicates in the same group. The result contains one tuple per group. Hence a result tuple can only contain : grouping attribute(s), because they have the one single value in each group; aggregated values because they create one scalar value from a group of values. The aggregate function can be part of a general scalar expression. Example : EMPLOYEE GroupBy[ M-S ] With[ Total 1.1 * ( Bag[ Sal ] Sum ) ] In SQL, if any grouping attribute contains NULL, then NULL will be treated like a value & corresponding group(s) formed in producing the result. It is a not uncommon error in SQL to try to create an aggregation expression without an aggregate function. This results in the system supposedly having to create multiple values in one attribute of one tuple in the result; consequently an error message is generated. In the example above, a single aggregate value for a group of Sal values is generated, which is then multiplied by 1.1 and 500 added to the result of that. This still results in a single scalar value, and so is acceptable.
13
Repeat this for each aggregation.
SQL : GroupBy The SQL equivalent of a GroupBy has the syntax : Select GroupingAttribute(s), “aggregation” As Result_Name From RELATION_NAME Group By GroupingAttribute(s) ; Repeat this for each aggregation. SQL always uses two separate words Group By; it never runs the two words together as GroupBy. The Group By phrase must always appear after the Select and From phrases, and its sole purpose is to indicate a GroupBy operation. Grouping attribute(s) appear in the SQL Group By phrase, and are those that determine which attribute(s) will be used in the SQL GroupBy operation. It is not mandatory to repeat any or all of them in the Select phrase, but they usually are, since otherwise we wouldn’t know from the result which summary value went with which grouping attribute value. In mapping the aggregation specification from algebra to SQL, 3 of the 4 parameters are the same in both algebra and SQL : Result_Name (= the new attribute holding the result of the aggregation), AggFunction ( = the aggregate function), Attribute_Name (= an aggregate attribute). Thus an algebra With[Result_Name Project / ( Attribute_Name ) AggFunction ] corresponds in SQL to AggFunction (Distinct / All Attribute_Name ) As Result_Name As regards the 4th parameter, Project in algebra corresponds to Distinct in SQL; Bag in algebra corresponds to All in SQL. As corresponds to an assignment operator (although the assignment is backwards, from left to right instead of right to left), but it can be optionally omitted in SQL; inserting As makes the SQL easier to read, and so is recommended to avoid errors. If As and the new attribute’s name are omitted, the aggregate expression is used as the new attribute’s name. The syntax of “aggregation” is : AggFunction ( Distinct / All Attribute_Name ) Distinct removes duplicates, All does not. (All is the default if Attribute_Name alone is given).
14
SQL : Examples The 2 previous examples are written in SQL as follows.
“Get the total salary paid to each marital-status group.” Select M-S, Sum( All Sal ) As Total From EMPLOYEE Group By M-S ; “How many different shipment sizes are there per supplier ?” Select SNo, Count( Distinct Qty ) As Sizes From SHIP Group By SNo ; In contrast to the second example, consider the query “How many different shipments are there per supplier ?” Select SNo, Count( * ) As Sizes From SHIP Group By SNo ; All could be omitted. Compare the following with the first query above : Select M-S, Sum( Distinct Sal ) As Total From EMPLOYEE Group By M-S ; By contrast, this means “Get the total of all the different salaries paid to each marital-status group”. Compare the following with the second query above : Select SNo, Count( All Qty ) As Sizes From SHIP Group By SNo ; By contrast, this means “How many different shipments are there per supplier ?” In fact this means the same as the third query above ! The reason for this is that if all the values in an attribute are counted, then this is the same result as just counting all the tuples in the group. Count ( * ) means in fact “count all the tuples in the group”, because ‘*’ stands for all the attributes, and hence for the entire tuple, all of which must therefore be counted. For this reason, if SQL were to allow Count ( All * ) and Count ( Distinct * ) (which it doesn’t), then they would still yield the same result as Count ( * ) . Calculations can be incorporated into SQL aggregations. Thus the earlier example would be written : Select M-S, ( 1.1 * Sum( Sal ) ) As Total From EMPLOYEE Group By M-S ; Cannot have DISTINCT with “*”.
15
Executing an SQL ‘Select’ Statement
The phrases are executed in the following order :- Joins / Cartesian Products done here. From Restrictions done here. Where Grouping done here. Group By The above shows the execution sequence of all SQL phrases covered up to now. Thus we can see that the GroupBy operation is carried out after all the Restrictions and Joins have been executed, and is in effect carried out on the results of all those operations. If this sequence were to be written out in algebra, it would look like : R Join[ Att ] S Restrict[ condition ] GroupBy[ Grp ] With[ Res aggregation ] Hence incidentally the need for us to design our queries in this sequence. As we have already seen, a GroupBy effectively carries out a Projection on its operand, and therefore the only Projection that can be done after a GroupBy will get rid of some of the attributes produced in the GroupBy. Since there is no point in creating something only to immediately get rid of it, we would not normally bother to create it in the first place; so in practice there is rarely a Projection after a GroupBy when using SQL. In fact it is the combination of the Group By and Select phrases that does a combined Projection and GroupBy operation in SQL. The reason that the above diagram is not precise in this respect is because the Order By phrase has been added for completeness. The Order By phrase is logically executed before the Select phrase in order that sequencing can be carried out on attributes that are then removed because they are not required in the result. Order By Sequencing done here. Select Projections done here.
16
Combining Algebra Operators
An SQL retrieval that combines all the operations covered so far, assuming no Projection after a GroupBy, is in general written in SQL1 as follows : Select GroupingAttr(s), “aggregation” As Result_Name From R, S Where theta-join condition And restriction-condition Group By GroupingAttr(s) ; Example : the SQL query to answer the question “How many car owners earning less than £30,000 are there in each marital status group ?” is : Select M-S, Count( * ) As Total From EMPLOYEE, CAR Where ENo = Owner And Sal < Group By M-S ; For completeness, the algebra version of the example query is : EMPLOYEE Gen[ ENo = Owner ] CAR Restrict[ Sal < ] GroupBy[ M-S ] With[ Total Bag[ * ] Count ] From this it can be seen that the SQL2 version of the query is : Select M-S, Count( * ) As Total From EMPLOYEE Join CAR On (ENo = Owner ) Where Sal < Group By M-S ; In general, the SQL2 version of queries of this kind is : Select GroupingAttr(s), “aggregation” As Result_Name From join expression Where restriction-condition Group By GroupingAttr(s) ;
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.