Carlos Ordonez, Javier Garcia-Garcia,

Slides:

Advertisements

Similar presentations

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Advertisements

Chapter 3 Operating Systems. Chapter 3 Operating Systems 3.1 The Evolution of Operating Systems 3.1 The Evolution of Operating Systems 3.2 Operating System.

The Gamma Operator for Big Data Summarization

Service Broker Lesson 11. Skills Matrix Service Broker Service Broker, provides a solution to common problems with message delivery and consistency that.

Ch 4. The Evolution of Analytic Scalability

Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.

Lecture On Database Analysis and Design By- Jesmin Akhter Lecturer, IIT, Jahangirnagar University.

Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

The Client/Server Database Environment Ployphan Sornsuwit KPRU Ref.

1 Chapter Overview Preparing to Upgrade Performing a Version Upgrade from Microsoft SQL Server 7.0 Performing an Online Database Upgrade from SQL Server.

Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;

Designing Applications for Performance Appropriate I/O for Specific Task Minimize all Initiation and Termination Design Everything to be “Interactive”

Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

IT System Administration Lesson 3 Dr Jeffrey A Robinson.

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

1 Database Systems Group Research Overview OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:

1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.

Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.

System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.

Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.

Latency and Communication Challenges in Automated Manufacturing

Table General Guidelines for Better System Performance

Design Components are Code Components

Database Management Systems

N-Tier Architecture.

Interactive Machine Learning with a GPU-Accelerated Toolkit

Hybrid Cloud Architecture for Software-as-a-Service Provider to Achieve Higher Privacy and Decrease Securiity Concerns about Cloud Computing P. Reinhold.

Job Scheduling in a Grid Computing Environment

Distributed Databases

The Client/Server Database Environment

The Client/Server Database Environment

Big Data Analytics in Parallel Systems

Software Architecture in Practice

Database Performance Tuning and Query Optimization

Census Technology: Processing architecture and data analysis

Database Management Systems

Database Management System (DBMS)

A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

DUCKS – Distributed User-mode Chirp-Knowledgeable Server

April 30th – Scheduling / parallel

Predictive Performance

Oracle Architecture Overview

Chapter 1 Database Systems

External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.

Database Systems Chapter 1

Ch 4. The Evolution of Analytic Scalability

Selected Topics: External Sorting, Join Algorithms, …

Table General Guidelines for Better System Performance

Parallel Analytic Systems

Design Components are Code Components

EE 472 – Embedded Systems Dr. Shwetak Patel.

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

Chapter 1 Database Systems

Chapter 11 Database Performance Tuning and Query Optimization

Big Data Analytics: Exploring Graphs with Optimized SQL Queries

Chapter 3 Database Management

Wellington Cabrera Carlos Ordonez

The Gamma Operator for Big Data Summarization

Database System Architectures

Wellington Cabrera Advisor: Carlos Ordonez

Wellington Cabrera Advisor: Carlos Ordonez

COMP755 Advanced Operating Systems

The Gamma Operator for Big Data Summarization on an Array DBMS

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.

Best Practices in Higher Education Student Data Warehousing Forum

QlikView for use with SAP Netweaver Version 5.8 New Features

Presentation transcript:

Carlos Ordonez, Javier Garcia-Garcia, Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems Carlos Ordonez, Javier Garcia-Garcia, Carlos Garcia-Alvarado, Wellington Cabrera, Veera Baladandayuthapani, Shoaib Quraishi Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Motivation Relational databases are a natural repository of data. Enterprise Systems But analytical tasks are often done outside the DBMS Drawbacks External data mining software Data exporting Privacy issues - Most of the real life data is stored in DBMS

Our proposal Provide analytical algorithms as a service in the cloud, exploiting the processing power of DBMSs DBMSs present both in the cloud and in the client side No external packages required Standard SQL queries , UDFs and Aggregate UDFs A set of off-the-shelf algorithms are provided To provide analytical algorithms

Challenges Large volume of data to be transmitted Matrix computations Processing power requirements of number crunching Data redundancy Minimize I/O All data in relational format Avoid exporting tasks -Matrix Comp , while data stored in relational format - Avoid duplicity of data Avoid exporting data to external packages

Advantages Cloud system can: Reduce work load on local system Accelerate analytical processing Enforce data security Simplify multiple model management It is not required to install data mining software, neither in local system nor in the cloud Everything stored in relational tables Cloud system can Accelerate analytical processing Since the data does not go outside the DBMS, the access permissions still work, Enforced confidentiality and integrity of data

System attributes Smart local processing: exploit CPU/RAM of local DBMS Integrated: Local DBMS and Cloud DBMS are tightly integrated Fast: one pass over input table for most algorithms; parallel Simple: Calling the algorithms is simple: Stored Procedure with default parameters Relational: relational tables to store models, job parameters

System Components Cloud DBMS Cloud management server Local DBMS Store procedures, UDFs Cloud management server Handling data mining job requests Monitoring job progress Cost estimation for 3 alternative processing modes Managing jobs Local DBMS Web application User can post jobs using a web interface

Models PCA K-Means Linear Regression Variable Selection Naïve Bayes

Job processing

Remarks Hybrid Mode: Cloud Model: Local Mode: Sufficient statistics calculated in local DBMS Take advantage of local power processing, RAM Cloud DBMS receives a summarization Transmitting the entire dataset is avoided Model computation in cloud DBMS Cloud Model: Summarization step Occurs in cloud Large data sets: Sampling Local Mode: Preferred for small datasets Summarization/Sampling

Job Scheduler FIFO job scheduling by default If wait time for an individual job goes beyond a threshold ψ, then the system switches to SJF If most jobs take a lot time to compute and the waiting time is beyond ψ, then the system switches to Round Robin(RR). As the load decreases, the system backtrack to SJF, FIFO

Job queue

Job queue

Algorithm Optimizations Sufficient Statistics are exploited to accelerate data mining algorithms Previous work [1] shows that Linear Regression, PCA, Naïve Bayes, K-means are efficiently computed by using sufficient statistics n, L , Q Sufficient Statistics can be computed On samples On the whole dataset

Sufficient Statistics: nLQ/Γ Considering a dataset with n points The sufficient statistics are generalized as: n=|X| Z=[ 1, X, Y]

Sufficient Statistics: nLQ/Γ 1 set of sufficient statistic for each class/ cluster is necessary for: Naïve Bayes K-means One matrix Γ is enough for PCA Linear Regression Variable Selection

Data transfer comparison Data set Physical Activity ( n=2.88M, d=42) Dataset : 880.00 MB nLQ/Γ: 0.02 MB 50,000 times smaller!

Optimizations Sufficient Statistics Matrix computations in RAM Calculated in one parallel scan Aggregate UDFS Multithreaded, RAM Matrix computations in RAM LAPACK integration Fast, accurate, stable

Summary Sufficient statistics transmitted to cloud Hybrid processing is best Job policy: FIFO->SJF->RR Parallel summarization, parallel scan Model computation in RAM in the cloud Complicated number crunching in the cloud Job and model history in the cloud All data is relational tables: they can be queried, stored securely

References C. Ordonez. Statistical model computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2010 C. Ordonez, Y. Zhang, W. Cabrera. The Gamma Operator for Big Data Summarization on an Array DBMS (BigMine 2014). JMLR W&CP 36 :88-103, 2014 Carlos Ordonez, Carlos Garcia-Alvarado, Veera Baladandayuthapani. Bayesian Variable Selection in Linear Regression in One Pass for Large Data Sets, ACM 2Transactions on Knowledge Discovery from Data (TKDD), 2015

Questions: What do you use to communicate both databases? How NLQ is moved to the cloud? What utility/protocol do you use to move data from local DBMS to cloud DBMS when system is workingin “cloud” mode? Why you do not exploit multicores to enable processing several jobs at the same time in the cloud DBMS?