Carlos Ordonez, Javier Garcia-Garcia,

Carlos Ordonez, Javier Garcia-Garcia,
Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems Carlos Ordonez, Javier Garcia-Garcia, Carlos Garcia-Alvarado, Wellington Cabrera, Veera Baladandayuthapani, Shoaib Quraishi Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Motivation Relational databases are a natural repository of data.
Enterprise Systems But analytical tasks are often done outside the DBMS Drawbacks External data mining software Data exporting Privacy issues - Most of the real life data is stored in DBMS

Our proposal Provide analytical algorithms as a service in the cloud, exploiting the processing power of DBMSs DBMSs present both in the cloud and in the client side No external packages required Standard SQL queries , UDFs and Aggregate UDFs A set of off-the-shelf algorithms are provided To provide analytical algorithms

Challenges Large volume of data to be transmitted Matrix computations
Processing power requirements of number crunching Data redundancy Minimize I/O All data in relational format Avoid exporting tasks -Matrix Comp , while data stored in relational format - Avoid duplicity of data Avoid exporting data to external packages

Advantages Cloud system can:
Reduce work load on local system Accelerate analytical processing Enforce data security Simplify multiple model management It is not required to install data mining software, neither in local system nor in the cloud Everything stored in relational tables Cloud system can Accelerate analytical processing Since the data does not go outside the DBMS, the access permissions still work, Enforced confidentiality and integrity of data

System attributes Smart local processing: exploit CPU/RAM of local DBMS Integrated: Local DBMS and Cloud DBMS are tightly integrated Fast: one pass over input table for most algorithms; parallel Simple: Calling the algorithms is simple: Stored Procedure with default parameters Relational: relational tables to store models, job parameters

System Components Cloud DBMS Cloud management server Local DBMS
Store procedures, UDFs Cloud management server Handling data mining job requests Monitoring job progress Cost estimation for 3 alternative processing modes Managing jobs Local DBMS Web application User can post jobs using a web interface

Models PCA K-Means Linear Regression Variable Selection Naïve Bayes

Job processing

Remarks Hybrid Mode: Cloud Model: Local Mode:
Sufficient statistics calculated in local DBMS Take advantage of local power processing, RAM Cloud DBMS receives a summarization Transmitting the entire dataset is avoided Model computation in cloud DBMS Cloud Model: Summarization step Occurs in cloud Large data sets: Sampling Local Mode: Preferred for small datasets Summarization/Sampling

Job Scheduler FIFO job scheduling by default
If wait time for an individual job goes beyond a threshold ψ, then the system switches to SJF If most jobs take a lot time to compute and the waiting time is beyond ψ, then the system switches to Round Robin(RR). As the load decreases, the system backtrack to SJF, FIFO

Job queue

Algorithm Optimizations
Sufficient Statistics are exploited to accelerate data mining algorithms Previous work [1] shows that Linear Regression, PCA, Naïve Bayes, K-means are efficiently computed by using sufficient statistics n, L , Q Sufficient Statistics can be computed On samples On the whole dataset

Sufficient Statistics: nLQ/Γ
Considering a dataset with n points The sufficient statistics are generalized as: n=|X| Z=[ 1, X, Y]

Sufficient Statistics: nLQ/Γ
1 set of sufficient statistic for each class/ cluster is necessary for: Naïve Bayes K-means One matrix Γ is enough for PCA Linear Regression Variable Selection

Data transfer comparison
Data set Physical Activity ( n=2.88M, d=42) Dataset : MB nLQ/Γ: MB 50,000 times smaller!

Optimizations Sufficient Statistics Matrix computations in RAM
Calculated in one parallel scan Aggregate UDFS Multithreaded, RAM Matrix computations in RAM LAPACK integration Fast, accurate, stable

Summary Sufficient statistics transmitted to cloud
Hybrid processing is best Job policy: FIFO->SJF->RR Parallel summarization, parallel scan Model computation in RAM in the cloud Complicated number crunching in the cloud Job and model history in the cloud All data is relational tables: they can be queried, stored securely

References C. Ordonez. Statistical model computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2010 C. Ordonez, Y. Zhang, W. Cabrera. The Gamma Operator for Big Data Summarization on an Array DBMS (BigMine 2014). JMLR W&CP 36 :88-103, 2014 Carlos Ordonez, Carlos Garcia-Alvarado, Veera Baladandayuthapani. Bayesian Variable Selection in Linear Regression in One Pass for Large Data Sets, ACM 2Transactions on Knowledge Discovery from Data (TKDD), 2015

Questions: What do you use to communicate both databases? How NLQ is moved to the cloud? What utility/protocol do you use to move data from local DBMS to cloud DBMS when system is workingin “cloud” mode? Why you do not exploit multicores to enable processing several jobs at the same time in the cloud DBMS?

Carlos Ordonez, Javier Garcia-Garcia,

Similar presentations

Presentation on theme: "Carlos Ordonez, Javier Garcia-Garcia,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Carlos Ordonez, Javier Garcia-Garcia,

Similar presentations

Presentation on theme: "Carlos Ordonez, Javier Garcia-Garcia,"— Presentation transcript:

Similar presentations

About project

Feedback