V 1.0 DBMAN 3 Group By, Having Cube, Rollup OLTP vs OLAP Data analysis 1
V 1.0 SELECT Displayed order of suffixes 1.INTO 2.FROM 3.WHERE 4.GROUP BY 5.HAVING 6.UNION/MINUS 7.INTERSECT 8.ORDER BY 2
V 1.0 DBMAN 3 Group By, Having Cube, Rollup OLTP vs OLAP Data analysis 3
V 1.0 Grouping/Aggregate functions SUM - Sum AVG - Average MIN - Minimum MAX - Maximum COUNT - Number of non null values (records) GROUP_CONCAT - Concatenated list of elements STDDEV - Standard deviation VARIANCE - Variance 4
V 1.0 Non-grouping usage select avg(sal) as Average from emp; select min(sal) from emp; select min(sal) from emp where sal>2000; select avg(distinct sal) as Average from emp; select count(sal) from emp; select count(comm) from emp where sal>2000; select comm from emp where sal>2000; select count(*) from emp where sal>2000; select avg(comm) from emp; NULL values are not included! 5
V 1.0 Grouping select distinct deptno from emp; select avg(sal) from emp where deptno=10; select avg(sal) from emp where deptno=20; select avg(sal) from emp where deptno=30; select deptno, avg(sal) from emp group by deptno; 6
V 1.0 Grouping IN THE SELECTION LIST (FIELD LIST) ONLY THE GROUPED FIELD(s) AND THE GROUPING FUNCTION(s) ARE ALLOWED! (YES, IN MYSQL AS WELL!!!) (ONLY_FULL_GROUP_BY) select deptno, avg(sal) as Average, min(sal) as Minimum, count(*) as Num from emp group by deptno; 7
V 1.0 Grouping and suffixes select mgr, avg(sal) from emp group by mgr; select ifnull(mgr, "none") as boss, lpad(avg(sal), 15, '#') as "Averagesal" from emp group by mgr; HAVING vs. WHERE select mgr, avg(sal) from emp where ename like '%E%' group by mgr; select mgr, avg(sal) from emp where ename like '%E%' group by mgr having avg(sal)>1300; select mgr, avg(sal) as average from emp where ename like '%E%' group by mgr having avg(sal)>1300 order by average desc; 8
V 1.0 More complex grouping queries select min(max(sal)), max(max(sal)), round(avg(max(sal))) from emp group by deptno; -- In Oracle this works, in MySQL „Invalid use of group function” select min(sal+ nvl(comm,0)), mod(empno,3) from emp group by mod(empno,3) having min(sal+nvl(comm,0)) > 800; 9
V 1.0 select distinct job, substr(job, 2, 1) from emp; select avg(sal) as average, substr(job, 2, 1) from emp group by substr(job, 2, 1); select ename, sal, round(sal/1000) from emp; select round(sal/1000) as SalCat, count(sal) as Num from emp group by round(sal/1000); More complex grouping queries 10
V 1.0 select ename, round(datediff(curdate(), hiredate)/365.25) as diff from emp; select count(*), round(datediff(curdate(), hiredate)/365.25) as diff from emp group by round(datediff(curdate(), hiredate)/365.25); More complex grouping queries (MySQL) 11
V 1.0 select ename, hiredate, (to_char(sysdate, 'YYYY')- to_char(hiredate, 'YYYY')) as diff from emp; select count(*),(to_char(sysdate, 'YYYY')- to_char(hiredate, 'YYYY')) as diff from emp group by (to_char(sysdate, 'YYYY')-to_char(hiredate, 'YYYY')); OR: we could use months_between() More complex grouping queries (Oracle) 12
V 1.0 select distinct depno, job from emp; select deptno, job, avg(sal), min(sal), max(sal) from emp group by deptno, job order by deptno, job; Oracle-specific „extras”: –GROUP BY GROUPING SETS –GROUP BY CUBE –GROUP BY ROLLUP More complex grouping queries 13
V 1.0 DBMAN 3 Group By, Having Cube, Rollup OLTP vs OLAP Data analysis 14
V 1.0 GROUP BY Group by, Having – one-field use is "trivial": e.g. average salary for job or department Multiple fields: complex grouping, e.g. average salary for job AND department Still: only the grouped field and the grouping functions are allowed in the selection list!!! 15
V 1.0 SELECT job, deptno, avg(sal) FROM emp GROUP BY job, deptno; JOB DEPTNO AVG(SAL) CLERK MANAGER PRESIDENT ANALYST CLERK MANAGER CLERK MANAGER SALESMAN
V 1.0 SELECT mgr, job, deptno, avg(sal) FROM emp GROUP BY job, deptno, mgr; MGR JOB DEPTNO AVG(SAL) MANAGER MANAGER CLERK SALESMAN MANAGER CLERK CLERK PRESIDENT ANALYST CLERK
V 1.0 DISADVANTAGES OF A SINGLE GROUP BY Not flexible enough One grouping per query, thus multiple queries are needed even if groupings are similar Slower Aim: One query, multiple groupings GROUPING SETS SELECT job, deptno, avg(sal) FROM emp GROUP BY GROUPING SETS ( (job, deptno) ); 18
V 1.0 NVL – Type matching! SELECT nvl(mgr, 'Nope'), deptno, avg(sal) FROM emp GROUP BY GROUPING SETS ( (mgr, deptno) ); SELECT nvl(to_char(mgr), 'Nope'), deptno, avg(sal) FROM emp GROUP BY GROUPING SETS ( (mgr, deptno) ); SELECT nvl(mgr, 0), deptno, avg(sal) FROM emp GROUP BY GROUPING SETS ( (mgr, deptno) ); 19
V 1.0 GROUP BY GROUPING SETS We can define multiple groupings inside one query, sub-results can be cached E.g. performing an MGR, DEPTNO and a JOB, DEPTNO grouping in ONE query: SELECT nvl(mgr, 0), deptno, nvl(job, 'Nope'), avg(sal) FROM emp GROUP BY GROUPING SETS ( (mgr, deptno), (deptno, job) ); 20
V 1.0 GROUP BY GROUPING SETS SELECT nvl(mgr, 0), nvl(deptno,0), nvl(job, 'NO'), avg(sal) FROM emp GROUP BY GROUPING SETS ( (mgr, deptno), (deptno, job), (mgr) ); SELECT nvl(mgr, 0), nvl(deptno,0), nvl(job, 'NO'), avg(sal) FROM emp GROUP BY GROUPING SETS ( (mgr, deptno), (deptno, job), (mgr), () ); Why do we have 0 for the mgr value ??? 21
V
V 1.0 GROUPING Using the GROUPING special "grouping function" we can determine if the given field is used for a grouping in a record Grouping function: allowed in the selection list Special: It can only work with a grouped field! 23
V 1.0 GROUPING 0 = TRUE ? When using with a single and multi-field simple GROUP BY, it returns with 0 SELECT job, avg(sal), grouping(job) FROM emp GROUP BY job; SELECT deptno, job, avg(sal), grouping(job) FROM emp GROUP BY job, deptno; When using with grouping sets: grouping = 0 means that the field is being used in the aggregation for that record 24
V 1.0 GROUPING SELECT mgr, deptno, job, avg(sal), GROUPING(mgr) as GMGR, GROUPING(deptno) as GDEPTNO, GROUPING(job) as GJOB FROM emp GROUP BY GROUPING SETS ( (mgr, deptno), (deptno, job), (mgr), () ); 25
V
V 1.0 GROUPING SELECT CASE WHEN GROUPING(mgr)=0 THEN mgr ELSE 0 END as MGR, CASE WHEN GROUPING(deptno)=0 THEN deptno ELSE 0 END as DEPTNO, CASE WHEN GROUPING(job)=0 THEN job ELSE 'NO' END as JOB, avg(sal) FROM emp GROUP BY GROUPING SETS ( (mgr, deptno), (deptno, job), (mgr), ()); 27
V
V 1.0 GROUPING_ID Unique identifier for each possible grouping column configuration SELECT mgr, deptno, job, avg(sal), GROUPING_ID(mgr, deptno, job) as GID FROM emp GROUP BY GROUPING SETS ( (mgr, deptno), (deptno, job), (mgr), () ); 29
V
V 1.0 GROUP BY GROUPING SETS DRAWBACKS Too complicated, too long When do we need a query with three totally different grouping sets? What kind of caching can we do here? Usually, there are hierarchical relations between the grouping fields more meaning, more caching ROLLUP and CUBE GROUPING and GROUPING_ID can be used the same way 31
V 1.0 CUBE GROUP BY CUBE (a, b, c) = GROUP BY GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )). CUBE(field1, field2) the two fields have the same rank, all permutations are shown CUBE(job, deptno): In addition for the simple two-field grouping, we get the job-averages, the department- averages, and the total average 32
V 1.0 SELECT job, deptno, avg(sal) FROM emp GROUP BY CUBE(job, deptno); 33
V 1.0 ROLLUP GROUP BY ROLLUP (a, b, c) = GROUPING SETS ( (a, b, c), (a, b), (a), ( )) ROLLUP(field1, field2) the first field is hierarchically more important, we only take the permutations where it is used ROLLUP(job, deptno): In addition for the simple two- field grouping, we get the job-averages and the total average 34
V 1.0 SELECT job, deptno, avg(sal) FROM emp GROUP BY ROLLUP(job, deptno); JOB DEPTNO AVG(SAL) CLERK MANAGER PRESIDENT ANALYST CLERK MANAGER CLERK MANAGER SALESMAN ANALYST 3000 CLERK 1037,5 MANAGER 2758,33333 PRESIDENT 5000 SALESMAN ,
V 1.0 MIXTURE OF GROUPINGS GROUP BY a, CUBE (b, c) = GROUP BY GROUPING SETS ( (a, b, c), (a, b), (a, c), (a) ) GROUP BY a, ROLLUP (b, c) = GROUP BY GROUPING SETS ( (a, b, c), (a, b), (a) ) 36
V 1.0 DBMAN 3 Group By, Having Cube, Rollup OLTP vs OLAP Data analysis 37
V 1.0 OLTP? OLAP? OLTP = On Line Transaction Processing OLAP = On Line Analytic Processing OLTP –product » price –invoice » amount –client » name OLAP –Product category × Region » Gross margin –Product × Warehouse » Inventory –Supplier × Time × Product » Return rate –Tables are usually a result of grouping! 38
V 1.0 OLTP vs OLAP OLTPOLAP ApplicationOperational: ERP, CRM, legacy apps Management Information System, Decision Support System Typical users StaffManagers, Executives HorizonWeeks, MonthsYears RefreshImmediatePeriodic Data modelEntity-relationshipMulti-dimensional SchemaNormalizedStar EmphasisUpdateRetrieval 39
V 1.0 Star data model? 40
V 1.0 Star data model? The supervisor that gave the most discounts? The quantity shipped on a particular date, month, year or quarter? In which zip code did product A sell the most? 41
V 1.0 OLAP rules Automatized data transfer –Extract data from OLTP system(s) –Transform/standardize, if necessary –Import to OLAP database –Build cubes (GROUP BY!) –Produce reports Drilling –Drill down: region city district –Drill up: city region country –Drill across: north region south region west region 42
V 1.0 OLAP vs Group by Every dimension can be a result of a group by query Every data cube will be a result of group by queries One problem: missing/bad data points We need trends and projections! 43
V 1.0 DBMAN 3 Group By, Having Cube, Rollup OLTP vs OLAP Data analysis 44
V FROM 2.WHERE 3.GROUP BY 4.HAVING 5.UNION/MINUS 6.INTERSECT 7.ORDER BY 8.INTO SELECT Order of suffixes 45
V 1.0 BASIC PROBLEMS Functions: in the selection list Order by, group by: always executed after functions, so we might need sub-queries ROWNUM s*cks (later...) Solution: special functions, that can work together with the ordering / grouping of records 46
V 1.0 RANK FUNCTIONS SELECT ROW_NUMBER() OVER (ORDER BY ENAME ASC) AS RNUM, ENAME FROM EMP; Simple rank functions: RANK() 1, 2, 2, 4 DENSE_RANK() 1, 2, 2, 3 PERCENT_RANK() percentage, [0..1] NO PARAMETERS! 47
V 1.0 LET'S TRY THOSE… SELECT ename, sal, RANK() over (ORDER BY sal desc) FROM emp; + DENSE_RANK(), PERCENT_RANK() 48
V 1.0 RANK WITHIN A GROUP SELECT deptno, ename, sal, RANK() OVER ( PARTITION BY deptno ORDER BY sal ) as RANG FROM emp; 49
V 1.0 RANK WITHIN A GROUP SELECT deptno, job, ename, sal, RANK() OVER ( PARTITION BY deptno, job ORDER BY sal ) as RANG FROM emp; + ORDER BY … 50
V 1.0 GROUPING FUNCTIONS WITH ANALYTICAL CLOSURES SELECT ename, sal, SUM(SAL) OVER (order by sal) as MySAL FROM emp; Ordered list! SELECT ename, sal, AVG(SAL) OVER (order by sal) as MySAL FROM emp; 51
V 1.0 GROUPING FUNCTIONS WITH ANALYTICAL CLOSURES SELECT deptno, ename, sal, SUM(SAL) OVER ( partition by deptno order by ename ) as MySum FROM emp ORDER BY deptno, ename; 52
V 1.0 GROUPING FUNCTIONS WITH ANALYTICAL CLOSURES alter session set nls_date_format='YYYY-MM-DD'; select ename, hiredate, sal from emp order by hiredate; select ename, hiredate, sal, sum(sal) over (order by hiredate) as TOTAL from emp order by hiredate; select ename, hiredate, sal, sum(sal) over (partition by to_char(hiredate, 'YYYY') order by hiredate) as TOTAL from emp order by hiredate; 53
V 1.0 SUBSET (Sliding window) SELECT ename, sal, avg(SAL) OVER ( order by sal rows between 1 preceding and 2 following ) as MyAvg FROM emp; 54
V 1.0 SUBSET (Sliding window) SELECT deptno, ename, sal, sum(SAL) OVER ( partition by deptno order by sal rows between 0 preceding and 1 following ) as MySum FROM emp; 55
V 1.0 SUBSET (Sliding window) We can use the RANGE keyword SELECT deptno, ename, sal, sum(SAL) OVER ( order by sal range between current row and unbounded following ) as MySum FROM emp; 56
V 1.0 OTHER ANALYTICAL FUNCTIONS FIRST_VALUE(), LAST_VALUE() RATIO_TO_REPORT() Ratio compared to the sum value SELECT ename, sal, RATIO_TO_REPORT(sal) OVER () FROM emp ORDER BY sal desc; + PARTITION BY 57
V
59