Haixun Wang, Carlo Zaniolo Computer Science Dept.

Slides:



Advertisements
Similar presentations
The Connection Factory Jeroen van Rotterdam, CTO May 19th, WWW9.
Advertisements

Index Dennis Shasha and Philippe Bonnet, 2013.
MapReduce.
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
1 Efficient Temporal Coalescing Query Support in Relational Database Systems Xin Zhou 1, Carlo Zaniolo 1, Fusheng Wang 2 1 UCLA, 2 Simens Corporate Research.
CS240A: Databases and Knowledge Bases Temporal Applications and SQL:1999 Carlo Zaniolo Department of Computer Science University of California, Los Angeles.
CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.
Database Systems Research: Where it is (or should be) Headed? (aka looking for a “perfect” candidate) Laks V.S. Lakshmanan Dept. of Computer Science Univ.
Online Aggregation Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Midterm Review Lecture 14b. 14 Lectures So Far 1.Introduction 2.The Relational Model 3.Disks and Files 4.Relational Algebra 5.File Org, Indexes 6.Relational.
CS240A: Databases and Knowledge Bases Introduction Carlo Zaniolo Department of Computer Science University of California, Los Angeles WINTER 2002.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Classifiers in Atlas CS240B Class Notes UCLA. Data Mining z Classifiers: yBayesian classifiers yDecision trees z The Apriori Algorithm zDBSCAN Clustering:
1 of 7 A High-Performance Data Mining Framework in MySQL Dr. Lutz Hamel Tiegeng Ren Dept. of Computer Science in URI 3/31/2003.
ATLaS: A Complete Database Language for Streams Carlo Zaniolo, Haixun Wang Richard Luo,Jan-Nei Law et al. Documentation and software downloads:
The University of Akron Dept of Business Technology Computer Information Systems Database Management Approaches 2440: 180 Database Concepts Instructor:
SEMESTER 1, 2013/2014 DB2 APPLICATION DEVELOPMENT OVERVIEW.
Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos OO and OR DBMSs.
CS240A: Databases and Knowledge Bases Introduction Carlo Zaniolo Department of Computer Science University of California, Los Angeles.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Native Support for Web Services  Native Web services access  Enables cross platform interoperability  Reduces middle-tier dependency (no IIS)  Simplifies.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.
.NET Database Programmability and Extensibility in Microsoft SQL Server José A. Blakeley, Mat Henaire, Christian Kleinerman, Isaac Kunen, Adam Prout, Vineet.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Database Management 9. course. Execution of queries.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
1 COMP 3438 – Part II-Lecture 1: Overview of Compiler Design Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
CS240A Notes on DB Extenders a.k.a. Data Blades, Cartridge, Snapins Carlo Zaniolo Department of Computer Science University of California, Los Angeles.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
ICS 321 Fall 2009 DBMS Application Programming Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 10/06/20091Lipyeow.
13 1 Chapter 13 The Data Warehouse Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Fushen Wang, XinZhou, Carlo Zaniolo Using XML to Build Efficient Transaction- Time Temporal Database Systems on Relational Databases In Time Center, 2005.
Frank Dehnewww.dehne.net Parallel Data Cube Data Mining OLAP (On-line analytical processing) cube / group-by operator in SQL.
FlexTable: Using a Dynamic Relation Model to Store RDF Data IDS Lab. Seungseok Kang.
CS4432: Database Systems II Query Processing- Part 2.
User-Defined Aggregates for Advanced Database Applications Haixun Wang Computer Science Dept. University of California, Los Angeles.
Mining real world data RDBMS and SQL. Index RDBMS introduction SQL (Structured Query language)
Data Models and Query Languages of Spatio-Temporal Information Cindy Xinmin Chen Computer Science Department UCLA February 28, 2001.
CS240A: Databases and Knowledge Bases Temporal Databases Carlo Zaniolo Department of Computer Science University of California, Los Angeles.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
Blocking, Monotonicity, and Turing Completeness in a Database Language for Sequences and Streams Yan-Nei Law, Haixun Wang, Carlo Zaniolo 12/06/2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
CSC 143 P 1 CSC 143 Recursion [Chapter 5]. CSC 143 P 2 Recursion  A recursive definition is one which is defined in terms of itself  Example:  Compound.
Christoph F. Eick: Final Words COSC Topics Covered in COSC 3480  Data models (ER, Relational, XML)  Using data models; learning how to store real.
CS240A: Databases and Knowledge Bases Introduction Carlo Zaniolo Department of Computer Science University of California, Los Angeles.
Advanced Database Aggregation Query Processing
SQL SQL Ayshah I. Almugahwi Maryam J. Alkhalifa
CPS216: Data-intensive Computing Systems
CS240A: Databases and Knowledge Bases Introduction
CS 540 Database Management Systems
Chapter Trees and B-Trees
Chapter Trees and B-Trees
Tools for Memory: Database Management Systems
Lecture#7: Fun with SQL (Part 2)
CPSC-310 Database Systems
Topics Covered in COSC 6340 Data models (ER, Relational, XML (short))
Chapter 6 System and Application Software
20 Questions with Azure SQL Data Warehouse
CS179G, Project In Computer Science
Topics Covered in COSC 6340 Data models (ER, Relational, XML)
SQL: Structured Query Language
Chapter 13 The Data Warehouse
CS240B: Assignment1 Winter 2016.
Chapter 6 System and Application Software
Chapter 6 System and Application Software
Chapter 6 System and Application Software
CS240A: Databases and Knowledge Bases A Taxonomy of Temporal DBs
Presentation transcript:

Using SQL to Build User-Defined Aggregates and Extenders for O-R Systems Haixun Wang, Carlo Zaniolo drwang@us.ibm.com, zaniolo@cs.ucla.edu Computer Science Dept. University of California, Los Angeles 1

State of the Art DBMSs struggling to keep up with new applications multimedia, time series, spatial/temporal DB OLAP, data mining New language constructs (based on aggregates) Grouping Sets, Rollup, Cube, OLAP Functions, … UDFs: the main extension mechanism Data Blades for time series, spatial, XML, etc. Very Hard to use No aggregate functions supported

User-Defined Aggregates (UDAs) Not in SQL99—though in earlier SQL3 drafts, and supported in Informix (and others) SQL3 UDAs suffer from serious limitations and ease-of-use problems Claim: we have a great solution for the UDA problem Our UDAs are better than UDFs for extending DBs AXL – a system to make it easy to define UDAs AXL – can be used to do Data Mining in SQL

UDAs in SQL3 AGGREGATE FUNCTION myavg(val NUMBER) RETURN NUMBER STATE state INITIALIZE myavg_init ITERATE myavg_iterate TERMINATE myavg_return INITIALIZE: gives an initial value to the aggregate ITERATE : computes the intermediate value for each new record TERMINATE: returns the final value computed for the aggregate myavg_init, myavg_iterate, myavg_return are 3 functions that the user must write in a procedural programming language On line aggregation: this would require another function for early returns

Limitation of SQL3 UDAs Cannot define on-line aggregation or rollups Aggregation as a function from a set (or multiset) to a single value Aggregates can not be used inside recursion (the nonmonotonicity curse) Ease of use is a major issue

Ease of Use THE PROBLEM: UDFs are very hard to write and debug. In “unfenced mode” they jeopardize the integrity of the system. UDAs defined using several UDFs are prone to the same problem. A SOLUTION: Use a high-level language for defining UDAs. But who wants a new DB language? THE IDEAL SOLUTION: Use SQL to define new aggregates. Substantial benefits: Users are already familiar with SQL No impedance mismatch of data types and programming paradigms DB advantages: scalability, data independence, optimizability, parallelizability But how far can we take SQL?

AXL by Example: average AGGREGATE avg(value INT) : REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state (value, 1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state;

Second Example Show the average salary of senior managers who make 3 times more than the average employees. SQL: SELECT avg(salary) FROM employee WHERE title = ‘senior manager’ AND salary > 3 * (SELECT avg(salary) FROM employee) Two scans of the employee table required With AXL UDAs: SELECT sscan(title, salary)

AXL: Using a Single Scan AGGREGATE sscan(title CHAR(20), salary INT) : REAL { TABLE state(sum INT, cnt INT) AS VALUES (0,0); TABLE seniors(salary INT); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+salary, cnt=cnt+1; INSERT INTO seniors VALUES(salary) WHERE title = ‘senior manager’; } TERMINATE: { INSERT INTO RETURN SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > 3 * (SELECT sum/cnt FROM state);

Early Returns AVG normally converges early: an early approximation is all is needed in several applications Online aggregation means that early returns are produced during the computation Early returns are useful in many other computations: for instance to find the local max and min in a sequence of values -- and various temporal aggregates

Return avg for Every 100 Records AGGREGATE olavg(value INT): REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state VALUES (value,1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; INSERT INTO RETURN SELECT sum/cnt FROM state WHERE cnt MOD 100 = 0; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; } }

Temporal Coalescing AGGREGATE coalesce(from TIME, to TIME): (start TIME, end TIME) { TABLE state(cFrom TIME, cTo TIME); INITIALIZE: { INSERT INTO state VALUES (from, to); } ITERATE: { UPDATE state SET cTo = to WHERE cTo >= from AND cTo < to; INSERT INTO RETURN SELECT cFrom, cTo FROM state WHERE cTo < from; UPDATE state SET cFrom = from, cTo = to WHERE cTo < from; TERMINATE: { INSERT INTO RETURN SELECT cFrom, cTo FROM state;

Recursive Aggregates In AXL, aggregates can call other aggregates. Particularly, an aggregate can call itself recursively. AGGREGATE alldesc(P CHAR(20)): CHAR(20) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(P); SELECT alldesc(Child) FROM children WHERE Parent = P; } Find all the descendents of Tom: SELECT alldesc(Child) FROM children WHERE Parent = ‘Tom’;

AXL: Where SQL and Data Mining Intersects Loosely-coupled: Cache Mining: current data mining applications have a loose connection with databases Tightly-coupled: UDFs based data mining functions Ideal: AXL powered by recursive aggregates

Decision Tree Classifiers Training set: tennis Stream of Column/Value Pairs (together with RecId and Category)

Convert training set to column/value pairs AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4 INT, yorn INT) : (col INT, val INT, YorN INT) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(1, v1, yorn), (2,v2,yorn), (3,v3,yorn), (4,v4,yorn);} } CREATE VIEW col-val-pairs(recId INT, col INT, val INT, YorN INT) SELECT mcount(), dissemble(Outlook, Temp, Humidity, Wind, PlayTennis) FROM tennis; SELECT sprint(recId, col, val, YorN) FROM col-val-pairs;

SPRINT Algorithm in AXL [ 1] AGGREGATE sprint(iNode INT, iRec INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(Rec INT, Col INT, Val REAL, YorN INT, KEY(Col, Value)); [ 3] TABLE summary(Col INT, SplitGini REAL, SplitVal REAL, Yc INT, Nc INT); [ 4] TABLE split(Rec INT, LeftOrRight INT, KEY (RecId)); [ 5] TABLE mincol(Col INT, Val REAL, Gini REAL); [ 6] TABLE node(Node INT) AS VALUES(iNode); [ 7] INITIALIZE: ITERATE: { [ 8] INSERT INTO treenodes VALUES (iRec, iCol, iValue, iYorN); [ 9] UPDATE summary [10] SET Yc=Yc+iYorN, Nc=Nc+1-iYorN, (SplitGini, SplitVal) = giniudf(Yc, Nc, N, SplitGini, SplitVal) [11] WHERE Col=iCol; [12] } [13] TERMINATE: { [14] INSERT INTO mincol SELECT minpointvalue(Col, SplitGini, SplitVal) FROM summary; [15] INSERT INTO result SELECT n.Node, m.Col, m.Value FROM mincol AS m, node AS n; [16] INSERT INTO split SELECT t.Rec, (t.Value>m.Value) FROM treenodes AS t, mincol AS m [17] WHERE t.Col = m.Col AND m.Gini > 0; [18] SELECT sprint(n.Node*2+s.LeftOrRight, t.Rec, t.Col, t.Val, t.YorN) [19] FROM treenodes AS t, split AS s, node AS n WHERE t.Rec = s.Rec [20] GROUP BY s.LeftOrRight; [21] } [22] }

Performance SPRINT Algorithm: AXL vs. C Categorical Classifier: AXL vs. C

Implementation of AXL AXL V1.2: approaching 40,000 lines of code AXL compiler translates AXL programs into C++ code DB2 add-on: emulate UDAs using UDFs Standalone: running under both Win98/NT and UNIX Open interface of physical data model. Currently using Berkeley DB as our storage manager In memory tables Limited Optimization Using B+-Tree indexes to support equality/range query Predicate push-down / push-up