User-Defined Aggregates for Advanced Database Applications Haixun Wang Computer Science Dept. University of California, Los Angeles.

User-Defined Aggregates for Advanced Database Applications Haixun Wang hxwang@cs.ucla.edu Computer Science Dept. University of California, Los Angeles

2 State of the Art: the big picture Relational Databases: Single data model of spartan simplicity Logic-based database languages Commercial success and dominance through SQL standards (in the 80’s) A new wave of DB applications (in the 90’s) Knowledge discovery, Data Mining, OLAP, Time Series analysis, Spatial/Temporal DBs, XML, … Underscores limitations of RDBMS Prompted major extensions leading to SQL3 standards (SQL99 is a subset of SQL3)

3 State of the Art: R&D highlights Deductive DBs Rule based syntax Logic based formalization of query language semantics – e.g., nonmonotonic reasoning and stratification Recursive queries OO-DBs Complex data types / inheritance Expressive power by merging PL and query languages OR-DBs Rich data types / Path Expression (SQL) UDFs and Database Extenders (Data Blades)

4 State of the Art: the seamy side A patchwork of major extensions DBMSs have become more powerful but much hard and complex to build and use Still not powerful enough Data Mining: clustering, classification, association New language constructs not helping either Limited expressive power in other applications Bill of Materials (BoM) type of applications Temporal reasoning

5 This thesis This thesis: Many of the problems can be solved by UDAs User Defined Aggregates (UDAs): insufficient support in commercial world and DB standards Our claim: UDAs provide a more general and powerful mechanism for DB extensions AXL – a system to make it easier to define UDAs AXL – where SQL and Data Mining intersect

6 A Brief History of AXL (and of my thesis) Logic formalization of aggregates [DDLP’98, LBAI’00] Early returns, monotonic aggregates: used freely in recursive queries. Extensions of LDL++ (Logic Database Language) SQL-AG: Implementing and extending SQL3 UDAs [DBPL’99] Implemented on DB2 using extended user-defined aggregates with ‘early returns’ SADL: Simple Aggregate Definition Language [ICDE’00] using SQL to define new aggregates easy to use, but with limited performance and power AXL: Aggregate eXtension Language [VLDB’00] powerful, efficient and still SQL-based

7 Defining UDAs in SQL3 AGGREGATE FUNCTION myavg(val NUMBER) RETURN NUMBER STATE state INITIALIZE myavg_init ITERATE myavg_iterate TERMINATE myavg_return INITIALIZE: gives an initial value to the aggregate ITERATE : computes the intermediate aggregate value for each new record TERMINATE : returns the final value computed for the aggregate myavg_init, myavg_iterate, myavg_return are 3 functions that the user must write in a procedural programming language

8 Limitation of SQL3 UDAs UDAs in SQL3, Postgres, Informix, and early versions of LDL ++ share the same limitations: Aggregates can not be used inside recursion No support for early returns and on-line aggregation Also Ease of Use is a major issue (except for LDL++)

9 Ease of Use T HE P ROBLEM : UDFs are very hard to write and debug. In “unfenced mode” they jeopardize the integrity of the system. UDAs defined using several UDFs are prone to the same problem. A S OLUTION : Use a high-level language for defining UDAs. But there are many potential problems with any new language. T HE I DEAL S OLUTION : use SQL to define new aggregates. Substantial benefits: Users are already familiar with SQL No impedance mismatch of data types and programming paradigms DB advantages: scalability, data independence, optimizability, parallelizability

10 Simple Aggregates AGGREGATE avg(value INT) : REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state (value, 1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; }

11 Avoiding Multiple Scans Show the average salary of senior managers who make 3 times more than the average employees. SQL: SELECT avg(salary) FROM employee WHERE title = ‘senior manager’ AND salary > 3 * (SELECT avg(salary) FROM employee) Two scans of the employee table required With AXL UDAs: SELECT sscan(title, salary) FROM employee

12 AXL: Using a Single Scan AGGREGATE sscan(title CHAR(20), salary INT) : REAL { TABLE state(sum INT, cnt INT) AS VALUES (0,0); TABLE seniors(salary INT); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+salary, cnt=cnt+1; INSERT INTO seniors VALUES(salary) WHERE title = ‘senior manager’; } TERMINATE: { INSERT INTO RETURN SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > 3 * (SELECT sum/cnt FROM state); }

13 Ordered Sequences and Time Series We have a sequence of events, each of which is active during a certain interval (from, end). Find out at which point of time we have the largest number of active events. SQL: Group-by on the start time of each interval, and count! With AXL UDAs: SELECT density(from, end) FROM events

14 AXL: Using a Single Scan AGGREGATE density(start TIME, end TIME) : (time TIME, count INT) {TABLE state(time TIME, count INT) AS (0, 0); TABLE active(endpoint TIME); INITIALIZE: ITERATE: { DELETE FROM active WHERE endpoint < start; INSERT INTO active VALUES(end); UPDATE state SET time=start, count =count + 1 WHERE count < (SELECT count(*) FROM active); } TERMINATE: { INSERT INTO RETURN SELECT time, count FROM state; }

15 Define UDA based on other UDAs Instead of using INITIALIZE/ITERATE/TERMINATE structure, we can also build new UDAs on top of UDAs already defined. ORDER BY, … AGGREGATE sortdensity(from TIME, to TIME): (time TIME, count INT) { ORDER BY from ASC; INSERT INTO RETURN SELECT density(from, to); }

16 Early Returns AVG normally converges early: an early approximation is all is needed in several applications Online aggregation means that early returns are produced during the computation Many applications: e.g., find the local max and min in a sequence of values, various temporal aggregates, rollups, etc. These might depend on the order – same as new OLAP extensions in SQL3

17 Return avg for Every 100 Records AGGREGATE olavg(value INT): REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state VALUES (value,1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; INSERT INTO RETURN SELECT sum/cnt FROM state WHERE cnt MOD 100 = 0; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; }

18 Temporal Coalescing AGGREGATE coalesce(from TIME, to TIME): (start TIME, end TIME) { TABLE state(cFrom TIME, cTo TIME); INITIALIZE: { INSERT INTO state VALUES (from, to); } ITERATE: { UPDATE state SET cTo = to WHERE cTo >= from AND cTo < to; INSERT INTO RETURN SELECT cFrom, cTo FROM state WHERE cTo < from; UPDATE state SET cFrom = from, cTo = to WHERE cTo < from; } TERMINATE: { INSERT INTO RETURN SELECT cFrom, cTo FROM state; }

19 Recursive Aggregates In AXL, aggregates can call other aggregates. Particularly, an aggregate can call itself recursively. AGGREGATE alldesc(P CHAR(20)): CHAR(20) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(P); INSERT INTO RETURN SELECT alldesc(Child) FROM children WHERE Parent = P; } Find all the descendents of Tom: SELECT alldesc(Child) FROM children WHERE Parent = ‘Tom’;

20 Check Point Simple applications: AXL UDAs provide a solution with better performance and good ease of use. Many advanced database applications Time Series, Temporal Database, Spatial Database… In particular data mining applications

21 Data Mining and Database Systems Current Approach: Cursor-based: loose-coupling, stored procedures UDFs: ease of use problems Cache-Mine: Using DBMSs as containers of data Many attempts to closely integrate data mining functions into DBMS have shown major problems

What we need … SQL-aware Data Mining Systems Surajit Chaudhuri “Data Mining and Database Systems: Where is the Intersection?” IEEE Data Engineering Bulletin, 1998

23 AXL: Where SQL and Data Mining Intersects SQL-based language Mining data created on the fly Strong expressive power Better performance

24 Bayesian Classifiers Bayesian Classifier computes a series of probabilities:

25 Decision Tree Classifiers Training set: tennis Stream of Column/Value Pairs (together with RecId and Category)

26 Convert training set to column/value pairs AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4 INT, yorn INT) : (col INT, val INT, YorN INT) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(1, v1, yorn), (2,v2,yorn), (3,v3,yorn), (4,v4,yorn); } CREATE VIEW col-val-pairs(recId INT, col INT, val INT, YorN INT) SELECT mcount(), dissemble(Outlook, Temp, Humidity, Wind, PlayTennis) FROM tennis; SELECT classify(recId, col, val, YorN) FROM col-val-pairs;

Categorical Classifier in AXL [ 1] AGGREGATE classify(RecId INT, iNode INT, iCol INT, iValue REAL, iYorN INT) [ 2] {TABLE treenodes(RecId INT,Node INT, Col INT, Val REAL, YorN INT); [ 3] TABLE summary(Col INT, Value INT, Yc INT, Nc INT, KEY {Col, Value}); [ 4] TABLE mincol(Col INT, MinGini REAL); [ 5]TABLE ginitable(Col INT, Gini REAL); [ 6]INITIALIZE: ITERATE: { [ 7] INSERT INTO treenodes VALUES (RecId, iNode, iCol, iValue, iYorN); [ 8] UPDATE summary [ 9] SET Yc=Yc+iYorN, Nc=Nc+1-iYorN WHERE Col = iCol AND Value = iValue; [10] INSERT INTO summary SELECT iCol, iValue, iYorN, 1-iYorN WHERE SQLCDE=0; [11] } [12]TERMINATE: { [13] INSERT INTO ginitable SELECT Col, sum((Yc*Nc)/(Yc+Nc))/sum(Yc+Nc) [14] FROM summary GROUP BY Col; [15] INSERT INTO mincol SELECT minpointvalue(Col, Gini) FROM ginitable; [16] INSERT INTO result SELECT iNode, Col FROM mincol; [17] SELECT classify(t.RecId, t.Node*MAXVALUE+m.Value+1, t.Col, t.Value, t.YorN) [18] FROM treenodes AS t, [19] (SELECT tt.RecId RecId, tt.Value Value FROM treenodes tt, mincol m [20] WHERE tt.Col=m.Col AND m.MinGini>0 ) AS m [21] WHERE t.RecId = m.RecId GROUP BY m.Value; [22] } [23] }

28 Performance SPRINT Algorithm: AXL vs. CCategorical Classifier: AXL vs. C

SPRINT Algorithm in AXL [ 1] AGGREGATE sprint(iNode INT, iRec INT, iCol INT, iValue REAL, iYorN INT) [ 2] {TABLE treenodes(Rec INT, Col INT, Val REAL, YorN INT, KEY(Col, Value)); [ 3] TABLE summary(Col INT, SplitGini REAL, SplitVal REAL, Yc INT, Nc INT); [ 4] TABLE split(Rec INT, LeftOrRight INT, KEY (RecId)); [ 5] TABLE mincol(Col INT, Val REAL, Gini REAL); [ 6]TABLE node(Node INT) AS VALUES(iNode); [ 7]INITIALIZE: ITERATE: { [ 8] INSERT INTO treenodes VALUES (iRec, iCol, iValue, iYorN); [ 9] UPDATE summary [10] SET Yc=Yc+iYorN, Nc=Nc+1-iYorN, (SplitGini, SplitVal) = giniudf(Yc, Nc, N, SplitGini, SplitVal) [11] WHERE Col=iCol; [12] } [13]TERMINATE: { [14] INSERT INTO mincol SELECT minpointvalue(Col, SplitGini, SplitVal) FROM summary; [15] INSERT INTO result SELECT n.Node, m.Col, m.Value FROM mincol AS m, node AS n; [16] INSERT INTO split SELECT t.Rec, (t.Value>m.Value) FROM treenodes AS t, mincol AS m [17] WHERE t.Col = m.Col AND m.Gini > 0; [18] SELECT sprint(n.Node*2+s.LeftOrRight, t.Rec, t.Col, t.Val, t.YorN) [19] FROM treenodes AS t, split AS s, node AS n WHERE t.Rec = s.Rec [20] GROUP BY s.LeftOrRight; [21] } [22] }

30 Comparison with Other Architectures Datasets with 6.6 million records, support level.25%.

31 Implementation of AXL.axl.cc.exe Berkeley DB.axl.cc UDFs DB2.lib SQL DB2 Standalone Mode DB2 Add-on Mode

32 Implementation of AXL Open interface of physical data model. Currently using Berkeley DB as our storage manager In memory tables Limited Optimization Using B + -Tree indexes to support equality/range query Predicate push-down / push-up User Defined Aggregates Hash based Return multiple rows: ‘early return’ Return multiple columns: employee’s name and salary

33 Implementation of AXL (cont’d) Non-blocking aggregation Keeping the state of aggregation between calls to the aggregate routines Local tables defined inside aggregation are passed as parameters to the aggregates Explicit sorting (and implicit hash-based aggregation) AXL V1.2: above 30,000 lines of code

34 Check Point Simple applications: AXL UDAs provide a solution with better performance and good ease of use. Data Mining applications Formal Semantics of Aggregates and Monotonic Aggregation

35 Aggregates in Recursion Stratification: shaves(barber, X) :- villager(X),  shaves(X, X). villager(barber). Aggregates:  p  count(p) =0 Aggregates in many applications are actually monotonic (and should be allowed inside recursion).

36 Beyond Stratification Significant previous efforts… I. S. Mumick, H. Pirahesh and R. Ramakrishnan, “The magic of duplicates and aggregates”, VLDB 1990 A. Van Gelder, “Foundations of aggregates in deductive databases”, DOOD 1993 K. A. Ross and Y. Sagiv, “Monotonic aggregates in deductive databases”, JCSS 1997 S. Greco and C. Zaniolo, “Greedy algorithms in Datalog with choice and negation”, JICSLP 1998

37 Formal Semantics of Aggregates choice((X),(Y)) Enforcing functional dependency. FD: X->Y Multiplicity of stable models, monotonic transformation Ordering a domain Once (X,Y) is generated, choice ensures this is the only arc leaving source node X and entering sink node Y Formal semantics of UDA … return(Y,V) :- ordered(X,Y),  ordered(Y,_), terminate(Y,V). …

38 Early Returns  Monotonic Aggregates AGGREGATE mcount(): INT { TABLE state(cnt INT) AS VALUES (0); INITIALIZE: ITERATE: { UPDATE state SET cnt=cnt+1; INSERT INTO RETURN SELECT cnt FROM state; } Aggregates with only ‘early returns’ and no ‘final returns’ are monotonic w.r.t. set containment:

39 Early Returns  Monotonic Aggregates SELECT mcount(*) FROM employee; v.s. SELECT count(*) FROM employee; mcount{John, Mary, Tom}  {1,2,3} count{John, Mary, Tom}  3 mcount{John, Mary, Tom, Jerry}  {1,2,3,4} count{John, Mary, Tom, Jerry}  4

40 Return sum at the n th value AGGREGATE sumat(value INT, n INT): INT { TABLE state (sum INT, cnt INT) AS VALUES (0,0); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt +1; INSERT INTO RETURN SELECT sum FROM state WHERE cnt = n; }

41 Extended UDAs in DBMSs A logic-based formalization of aggregates Simple characterization of monotonic aggregates Monotonic aggregates can be used freely inside recursion BoM, Company Control, Join-the-Party… Greedy optimization algorithms, such as Dijkstra’s single source shortest path.

42 Monotonic Aggregation Monotonic aggregates can be used without any restriction and without changing the underlying implementation. This solves the problem that had eluded database researchers since the introduction of relational systems: BoM, Company Control, Join-the-Party… Greedy optimization algorithms, such as Dijkstra’s single source shortest path.

43 Join-the-Party Problem Some people will come to the party no matter what, and their names are stored in a sure(PName) relation. But many other people will join only after they know that at least K=3 of their friends will be there. WITH wllcm(Name) AS ((SELECT Pname FROM sure) UNION ALL (SELECT f.Pname FROM friend AS f, wllcm AS w WHERE w.Name =f.Fname GROUP BY f.Pname HAVING mcount()=3)) SELECT Name FROM wllcm; Density-based Clustering [M. Ester et al. KDD 96]

44 BoM: Cost of Parts 001 002 005 003 006 004 1 5 1 100 10 5 23 8 assembly part-cost fan-out 3

45 BoM: Using AXL WITH cst(part, cost) AS ((SELECT part, cost FROM part-cost) UNION ALL (SELECT a.part, sumat(a.qty * c.cost, p.ChC) FROM assembly AS a, cst AS c, fan-out AS p WHERE a.subpart = c.part AND p.part = a.part GROUP BY a.part)) SELECT part, cost FROM cst; Bottom up solution. Computes the cost for each part once and only once. Monotonic sumat(cost, n) returns sum when exactly n items are aggregated. Works in DB2 after AXL rewrites callings of sumat() to callings of automatically generated UDFs.

46 BoM: Using Recursive SQL WITH mpath(subpart, qty) AS ((SELECT subpart, qty FROM assembly WHERE part = ‘001’) UNION ALL (SELECT c.subpart, m.qty * c.qty FROM mpath m, assembly c WHERE m.subpart = c.part)) SELECT sum(m.qty * c.cost) FROM mpath m, part_cost c WHERE m.subpart = c.part ; Top down solution: computes the cost of part ‘001’ Explosion: all edges that descend from part ‘001’ are “stored” in mpath What if we want to compute the cost of each part?

47 Check Point Simple applications Data Mining and Decision Support Formal Semantics & Monotonic aggregates OLAP and other aggregate extensions SUCH THAT CUBE, ROLLUP, GROUPING SET OLAP Functions

48 “Such That” For each division, show the average salary of senior managers who make 3 times more than the average employees, and the average salary of senior engineers who make 2 times more than the average employees (in the same output record). D. Chatziantoniou, Kenneth Ross, VLDB 1996 SELECT division, avg(X.salary), avg(Y.salary) FROM employee GROUP BY division: X, Y SUCH THAT X.title = ‘senior manager’ AND X.salary > 3*avg(salary) AND Y.title = ‘senior engineer’ AND Y.salary > 2*avg(salary)

49 Expressing “Such That” in AXL TABLE seniors(salary INT); AGGREGATE sscan2(title CHAR(20), salary INT, qtitle CHAR(20), ratio INT): REAL { TABLE state(sum INT, cnt INT) AS VALUES (0,0); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+salary, cnt=cnt+1; INSERT INTO seniors VALUES (salary) WHERE title = qtitle; } TERMINATE: { SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > ratio * (SELECT sum/cnt FROM state); }

50 Using UDA sscan2 SELECTdivision, sscan2(title, salary, ‘senior manager’, 3), sscan2(title, salary, ‘senior engineer’, 2) FROM employee GROUP BY division; No joins or sub-queries required. One pass through the employee relation (standard SQL requires at least 2 passes).

51 Other Aggregate Extensions GROUPING SETS, ROLL-UP, CUBE New OLAP extensions Windows containing a partitioning, an ordering of rows and an aggregate group “… every standard must be prepared to tackle new issues that arise as the market evolves. If SQL does not respond positively to this challenge, SQL risks becoming irrelevant …” -- F. Zemke, K. Kulkarni, A. Witkowski, B. Lyle Introduction to OLAP Functions

52 Building New UDAs Based On UDAs Already Defined AGGREGATE gssum(territory CHAR(10), product CHAR(10), sum INT): (territory CHAR(10), product CHAR(10), sum INT) { SELECT territory, ‘-all-’, sum(sales) GROUP BY territory; SELECT ‘-all-’, ‘-all-’, sum(sales); SELECT ‘-all-’, product, sum(sales) GROUP BY product; } Equivalent to the GROUPING SET construct: SELECT territory, product, sum(sales) FROM data GROUP BY GROUPING SET ((territory), (product), ());

53 Ordering Input Stream Some aggregates require that the input stream ordered: AGGREGATE gssum(territory CHAR(10), product CHAR(10), sum INT): (territory CHAR(10), product CHAR(10), sum INT) { SELECT ‘-all-’, product, sum(sales) GROUP BY product; SELECT ‘-all-’, ‘-all-’, sum(sales); SORT BY territory ASC; SELECT territory, ‘-all-’, sum(sales) GROUP BY territory; } We can also use SORT BY GROUPBY to inform the compiler that all the group by columns to the aggregate should be ordered.

54 OLAP Functions An OLAP function is defined using a window, which may specify a partitioning, an ordering of rows within partitions, and an aggregate group. SELECT d.product, d.territory d.sales / sum(d.sales) OVER Whole d.sales / sum(d.sales) OVER By_product, d.sales / sum(d.sales) OVER By_territory FROMdata AS d WHERE year = 1999 WINDOW Whole AS (), By_product AS (PARTITION BY d.product), By_territory AS (PARTITION BY d.territory)

55 Rewrite OLAP Functions in AXL AGGREGATE mwin(territory CHAR (10), product CHAR(10), sales INT) : (territory CHAR(10), product CHAR(10), p0 REAL, p1 REAL, p2 REAL) { TABLE summary (territory CHAR(10), product CHAR(10), sum INT) AS (SELECT gssum(territory, product, sales) FROM data); INITIALIZE: ITERATE: { INSERT INTO RETURN SELECT territory, product, sales/s1.sales, sales/s2.sales, sales/s3.sales FROM summary AS s1, s2, s3 WHERE s1.territory = territory AND s1.produce = ‘-all-’ AND s2.territory = ‘-all-’ AND s2.product = product AND s3.territory = ‘-all-’ AND s3.product = ‘-all-’; }

56 Contributions Adding extended UDAs to O-R systems high level language, minimal additions to SQL monotonic aggregates, recursive aggregates designed for general purpose applications Tightly couple data mining functions with DBMS SPRINT Algorithm, Categorical Classifier, … Performance More efficient than cursor-based languages like PL/SQL, JDBC and UDF-based approaches …

57 Future Directions Parallelization Extenders/Data Blades build on top of UDAs instead of UDFs Decision Support the Apriori algorithm Windows and OLAP functions Spatial/Temporal extensions the TENOR system

58 Future Direction: Parallelization Current parallel aggregation algorithms valid for AXL: Inter-Partition parallelism: all tuples of the same group-by value are in one node Two phase algorithm: user provides a COMBINE routine Unlike SQL3, AXL’s aggregate routines are written in SQL, so we can apply traditional query parallelization techniques to INITIALIZE, ITERATE, and TERMINATE Since aggregate routines are written in SQL, the COMBINE routine can be generated automatically by the system for simple UDAs

59 X SQL + X = Data Mining X X = Procedural Extension, e.g. PL/SQL X X = C/C++ = … = … A minimalist’s point of view: X X = extended User Defined Aggregates

60 Conclusion UDAs are more useful than UDFs in many applications AXL is an SQL-like language for developing UDAs AXL is a powerful database language more efficient than cursor-based languages like PL/SQL, JDBC, and etc. AXL is suitable for Decision Support applications Future work: optimization, parallelizability and O- R extensions

User-Defined Aggregates for Advanced Database Applications Haixun Wang Computer Science Dept. University of California, Los Angeles.

Similar presentations

Presentation on theme: "User-Defined Aggregates for Advanced Database Applications Haixun Wang Computer Science Dept. University of California, Los Angeles."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

User-Defined Aggregates for Advanced Database Applications Haixun Wang Computer Science Dept. University of California, Los Angeles.

Similar presentations

Presentation on theme: "User-Defined Aggregates for Advanced Database Applications Haixun Wang Computer Science Dept. University of California, Los Angeles."— Presentation transcript:

Similar presentations

About project

Feedback