Using SQL to Build User-Defined Aggregates and Extenders for O-R Systems Haixun Wang, Carlo Zaniolo drwang@us.ibm.com, zaniolo@cs.ucla.edu Computer Science Dept. University of California, Los Angeles 1
State of the Art DBMSs struggling to keep up with new applications multimedia, time series, spatial/temporal DB OLAP, data mining New language constructs (based on aggregates) Grouping Sets, Rollup, Cube, OLAP Functions, … UDFs: the main extension mechanism Data Blades for time series, spatial, XML, etc. Very Hard to use No aggregate functions supported
User-Defined Aggregates (UDAs) Not in SQL99—though in earlier SQL3 drafts, and supported in Informix (and others) SQL3 UDAs suffer from serious limitations and ease-of-use problems Claim: we have a great solution for the UDA problem Our UDAs are better than UDFs for extending DBs AXL – a system to make it easy to define UDAs AXL – can be used to do Data Mining in SQL
UDAs in SQL3 AGGREGATE FUNCTION myavg(val NUMBER) RETURN NUMBER STATE state INITIALIZE myavg_init ITERATE myavg_iterate TERMINATE myavg_return INITIALIZE: gives an initial value to the aggregate ITERATE : computes the intermediate value for each new record TERMINATE: returns the final value computed for the aggregate myavg_init, myavg_iterate, myavg_return are 3 functions that the user must write in a procedural programming language On line aggregation: this would require another function for early returns
Limitation of SQL3 UDAs Cannot define on-line aggregation or rollups Aggregation as a function from a set (or multiset) to a single value Aggregates can not be used inside recursion (the nonmonotonicity curse) Ease of use is a major issue
Ease of Use THE PROBLEM: UDFs are very hard to write and debug. In “unfenced mode” they jeopardize the integrity of the system. UDAs defined using several UDFs are prone to the same problem. A SOLUTION: Use a high-level language for defining UDAs. But who wants a new DB language? THE IDEAL SOLUTION: Use SQL to define new aggregates. Substantial benefits: Users are already familiar with SQL No impedance mismatch of data types and programming paradigms DB advantages: scalability, data independence, optimizability, parallelizability But how far can we take SQL?
AXL by Example: average AGGREGATE avg(value INT) : REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state (value, 1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state;
Second Example Show the average salary of senior managers who make 3 times more than the average employees. SQL: SELECT avg(salary) FROM employee WHERE title = ‘senior manager’ AND salary > 3 * (SELECT avg(salary) FROM employee) Two scans of the employee table required With AXL UDAs: SELECT sscan(title, salary)
AXL: Using a Single Scan AGGREGATE sscan(title CHAR(20), salary INT) : REAL { TABLE state(sum INT, cnt INT) AS VALUES (0,0); TABLE seniors(salary INT); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+salary, cnt=cnt+1; INSERT INTO seniors VALUES(salary) WHERE title = ‘senior manager’; } TERMINATE: { INSERT INTO RETURN SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > 3 * (SELECT sum/cnt FROM state);
Early Returns AVG normally converges early: an early approximation is all is needed in several applications Online aggregation means that early returns are produced during the computation Early returns are useful in many other computations: for instance to find the local max and min in a sequence of values -- and various temporal aggregates
Return avg for Every 100 Records AGGREGATE olavg(value INT): REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state VALUES (value,1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; INSERT INTO RETURN SELECT sum/cnt FROM state WHERE cnt MOD 100 = 0; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; } }
Temporal Coalescing AGGREGATE coalesce(from TIME, to TIME): (start TIME, end TIME) { TABLE state(cFrom TIME, cTo TIME); INITIALIZE: { INSERT INTO state VALUES (from, to); } ITERATE: { UPDATE state SET cTo = to WHERE cTo >= from AND cTo < to; INSERT INTO RETURN SELECT cFrom, cTo FROM state WHERE cTo < from; UPDATE state SET cFrom = from, cTo = to WHERE cTo < from; TERMINATE: { INSERT INTO RETURN SELECT cFrom, cTo FROM state;
Recursive Aggregates In AXL, aggregates can call other aggregates. Particularly, an aggregate can call itself recursively. AGGREGATE alldesc(P CHAR(20)): CHAR(20) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(P); SELECT alldesc(Child) FROM children WHERE Parent = P; } Find all the descendents of Tom: SELECT alldesc(Child) FROM children WHERE Parent = ‘Tom’;
AXL: Where SQL and Data Mining Intersects Loosely-coupled: Cache Mining: current data mining applications have a loose connection with databases Tightly-coupled: UDFs based data mining functions Ideal: AXL powered by recursive aggregates
Decision Tree Classifiers Training set: tennis Stream of Column/Value Pairs (together with RecId and Category)
Convert training set to column/value pairs AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4 INT, yorn INT) : (col INT, val INT, YorN INT) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(1, v1, yorn), (2,v2,yorn), (3,v3,yorn), (4,v4,yorn);} } CREATE VIEW col-val-pairs(recId INT, col INT, val INT, YorN INT) SELECT mcount(), dissemble(Outlook, Temp, Humidity, Wind, PlayTennis) FROM tennis; SELECT sprint(recId, col, val, YorN) FROM col-val-pairs;
SPRINT Algorithm in AXL [ 1] AGGREGATE sprint(iNode INT, iRec INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(Rec INT, Col INT, Val REAL, YorN INT, KEY(Col, Value)); [ 3] TABLE summary(Col INT, SplitGini REAL, SplitVal REAL, Yc INT, Nc INT); [ 4] TABLE split(Rec INT, LeftOrRight INT, KEY (RecId)); [ 5] TABLE mincol(Col INT, Val REAL, Gini REAL); [ 6] TABLE node(Node INT) AS VALUES(iNode); [ 7] INITIALIZE: ITERATE: { [ 8] INSERT INTO treenodes VALUES (iRec, iCol, iValue, iYorN); [ 9] UPDATE summary [10] SET Yc=Yc+iYorN, Nc=Nc+1-iYorN, (SplitGini, SplitVal) = giniudf(Yc, Nc, N, SplitGini, SplitVal) [11] WHERE Col=iCol; [12] } [13] TERMINATE: { [14] INSERT INTO mincol SELECT minpointvalue(Col, SplitGini, SplitVal) FROM summary; [15] INSERT INTO result SELECT n.Node, m.Col, m.Value FROM mincol AS m, node AS n; [16] INSERT INTO split SELECT t.Rec, (t.Value>m.Value) FROM treenodes AS t, mincol AS m [17] WHERE t.Col = m.Col AND m.Gini > 0; [18] SELECT sprint(n.Node*2+s.LeftOrRight, t.Rec, t.Col, t.Val, t.YorN) [19] FROM treenodes AS t, split AS s, node AS n WHERE t.Rec = s.Rec [20] GROUP BY s.LeftOrRight; [21] } [22] }
Performance SPRINT Algorithm: AXL vs. C Categorical Classifier: AXL vs. C
Implementation of AXL AXL V1.2: approaching 40,000 lines of code AXL compiler translates AXL programs into C++ code DB2 add-on: emulate UDAs using UDFs Standalone: running under both Win98/NT and UNIX Open interface of physical data model. Currently using Berkeley DB as our storage manager In memory tables Limited Optimization Using B+-Tree indexes to support equality/range query Predicate push-down / push-up