Optimized Algorithms for Data Analysis in Parallel Database Systems

Optimized Algorithms for Data Analysis in Parallel Database Systems
Wellington M. Cabrera Advisor: Dr. Carlos Ordonez

Outline Motivation Background Review of work pre proposal
Parallel DBMSs under shared-nothing architecture Data sets Review of work pre proposal Linear Models with Parallel Matrix Multiplication Variable Selection, Linear Regression, PCA Presentation of recent work Graph Analytics with Parallel Matrix-Matrix Multiplication Transitive closure, All Pairs Shortest Path, Triangle Counting Graph Analytics with Parallel Matrix-Vector Multiplication PageRank, Connected Components Reachability, SSSP Conclusions Enfantizar que este trabajo profundiza en el aspect de Paralelismo

Motivation & Contributions

Motivation Large datasets found in any domain.
Continuous growing of data. Number of records Number of attributes/features DBMS are systems with a lot of research behind. Query optimizer Optimized I/O Parallelism DBMS offer increased security, compared with ad-hoc file management.

Issues Most of the data analysis, model computation and graph analytics is done outside of the database, exporting CSV files. It is difficult to express complex models and graph algorithms in a DBMS. No matrix operations support Queries may become hard to program Algorithms programmed without a deep understanding of DBMS technology may run with a poor performance. What’s wrong with exporting the data set to external systems? Data Privacy threat Waste of time Analysis is delayed

Contributions History/Timeline
First part of PhD Linear Models with Parallel Matrix Multiplication 1, 2 Variable Selection, Linear Regression, PCA Second part of PhD Graph Analytics with Parallel Matrix-Matrix Multiplication3 Transitive closure, All Pairs Shortest Path, Triangle Counting Graph Analytics with Parallel Matrix-Vector Multiplication4 PageRank, Connected Components Reachability, SSSP 1. The Gamma Matrix to Summarize Dense and Sparse Data Sets for Big Data Analytics. IEEE TKDE 28(7): (2016) 2. Accelerating a Gibbs sampler for variable selection on genomics data with summarization and variable pre-selection combining an array DBMS and R. Machine Learning 102(3): (2016) 3. Comparing columnar, row and array DBMSs to process recursive queries on graphs. Inf. Syst. 63: (2017) 4. Unified Algorithm to Solve Several Graph Problems with Relational Queries. Alberto Mendelzon International Workshop on Foundations of Data Management (2016)

Recalcar que aprovecho el conocimiento de paralelismo
BACKGROUND

Definitions Data set for Linear Models
Let X = {x1, ..., xn} be the input data set with n data points, where each point has d dimensions. X is a d × n matrix, where the data point xi is represented by a column vector (thus, equivalent to a d × 1 matrix). Y is a 1 x n vector , representing the dependent variable . Generally n>d; therefor X is a rectangular matrix Big data: n>>d

Definitions

Definition: Graph data set
Let G=(V,E) m=|E| ; n =|V| We denominate the adjacency matrix of G as E . E is a n x n matrix, generally sparse S: a vector of vertices used in graph computations |S|=|V|=n Each entry Si represents a vertex attribute. distance from an specific source, membership, probability We omit values in S with no information (like ∞ for distances, 0 for probabilities) Notice E is n x n , but X is d x n

DBMS Storage classes Row Store: Legacy, transactions
Column Store: Modern, analytics Array Store: Emerging, Scientific

Linear Models: data set storage in columnar/row DBMS
Case n>>d Low and high dimensional datasets n in millions/billions; d up to few hundreds Most data sets: marketing, public health, sensor networks. data point xi stored as a row, with d columns extra column to store outcome Y Thus, data set is stored as a table T, where T has n rows and d+1 columns. Parallel databases may partition T either by hash function or mod function.

Linear Models: data set storage in columnar/row DBMS
Case d>n Very High d, low n. d in thousands. Examples: Gene expression (microarray) data. Word frequency in documents Cannot keep n>d layout. Number of columns beyond most Row DBMS limits. Data point xi stored as a column Extra row to store the outcome Y Thus, data set is stored in a table T, which has n columns and d+1 rows Aclarar

Linear Models: data set representation in an array DBMS
Array databases store data as multidimensional arrays, instead of relational tables. Arrays are partitioned by chunks (bi-dimensional data blocks) All chunks in a specific array have the same shape and size. Data points xi stored as rows, with an extra column for the outcome yi Thus, the dataset is represented as a bi-dimensional array, with n rows and d+1 columns.

Graph data set Row and columnar DBMS: E(i,j,v)
Array DBMS: E as a n x n sparse array

Linear Models COMPUTATION with MATRIX MULTIPLICATION

Gamma Matrix Γ= Z ZT

Models Computation 2-Step algorithm for PCA, LR, VS…. One pass to the dataset. Compute summarization matrix (Gamma) in one pass. Compute models (PCA, LR, VS) using Gamma. Preselection & 2-Step algorithm for very high-dimensional VS ( Two passes). A preprocess step is incorporated Compute partial Gamma and perform preselection. Compute VS using Gamma.

Models Computation 2-step algorithm
Compute the summarization matrix Gamma in the DBMS ( cluster, multiple nodes/cores) Compute the model locally exploiting Gamma and parallel matrix operations (LAPACK) , using any programming language (i.e. R, C++, C#). This approach was published in our work [1]

First step: One pass data set summarization
We introduced the Gamma Matrix in [1]. The Gamma Matrix ( or Г) is a square matrix with d+2 rows and columns that contains a set of sufficient statistics, useful to compute several statistical indicators and models. PCA, VS, LR, covariance/correlation matrices. Computed in parallel with multiple cores or multiple nodes.

Matrix Multiplication Z∙ZT
Parallel Computation with Multicore CPU (single node) in one pass. AGGUDF are processed in parallel, in four phases (initialize, accumulate, merge, terminate) and enable multicore processing. Initialize: Variables set up Accumulate: partial Gammas are calculated via vector products. Merge: Final Gamma is computed by adding partial Gammas. Terminate: Control returns to main processing Computation with LAPACK ( main memory) Computation with OpenMPI

Matrix Multiplication Z∙ZT
Parallel Computation with multiple nodes Computation in Parallel Array Database. Each worker can process with one or multiple cores. Each core computes its own partial Gamma, using its own local data. Master node receives partial Gammas from workers Master node computes final Gamma with matrix addition.

Gamma: Z∙ZT

Models Computation Contribution summary;
Enables the analysis of very high dimensional data sets in the DBMS. Overcomes the problem of data sets larger than RAM (d< n) 10s to 100s times faster than standard approach GAMMA fit in Memory d<n; Gamma does not fit in memory d>>n

PCA Compute Г, which contains n, L and Q
Compute ρ, solve SVD of ρ, and select the k principal components

LR Computation ERROR: A small effort. Tipo de letra de notacion matematica

Variable Selection 1 + 2 Step Algorithm
Pre-selection Based on marginal correlation ranking Calculate correlation between each variable and the outcome Sort in descending order Take the best d variables Top d variables are considered for further analysis Compute Г , which contains Qγ and XγYT Iterate the Gibbs sampler a sufficiently large number of iterations to explore Aqui falta de anadir bastante justificacion. Anadir un libro de textbook como referencia de estadistica

Optimizing Gibbs Sampler
Non-conjugate Gaussian priors require the full Markov Chain. Conjugate priors simplify the computation. β,σ integrated out. Marin-Roberts formulation Zellner-g prior for β and Jeffrey’s prior for σ Gamma Mayuscula no esta explicado aqui. Que es comjugate prior

PCA DBMS : SciDB System : Local 1 node, 2 instances Dataset: KDDnet d
R Г operator + R 10 100K 0.5 0.6 1M 4.8 1.3 10M 45.2 7 100M fail 64.7 100 10K 0.7 0.8 5.7 2.5 61.2 16.8 194.9 Poner las tablas mas grandes que tenemos. Mencionar que Variable selection no esta disponibe en Spark

LR DBMS : SciDB System : Local 1 node, 2 instances Dataset: KDDnet d n
Г operator + R 10 100K 0.5 0.6 1M 5.6 1.3 10M 50.1 7.1 100M fail 69.8 100 10K 0.7 0.9 6.3 2.6 60.5 16.9 194.9

VS DBMS : SciDB System : Local 1 node, 2 instances
Dataset: Brain Cancer - miRNA DBMS : SciDB System : Local 1 node, 2 instances Dataset: Brain Cancer - miRNA d n VS R Г operator + R 10 100K 3.4 3.6 1M 7.6 4.7 10M 58.8 9.6 100M fail 93.4 100 10K 13.1 13 34.4 14.8 113 29.1 207.2 p d n VS R Г operator + R 12500 100 248 1245 15 200 2586 17 400 stopped 39 800 84

MATRIX-VECTOR Computation IN PARALLEL DBMS
Me fascino el paralelismo y la multiplicacion de matrices MATRIX-VECTOR Computation IN PARALLEL DBMS

Algorithms

Matrix –vector multiplication with relational queries

Optimizing Parallel Join Data Partitioning
Join locality: E, S partitioned by hashing in the joining columns. Sorted tables: a merge join is possible, complexity O(n)

Optimizing Parallel Join Data Partitioning
S split in N chunks E split in N x N squared chunks Data partitioning optimizing E JOIN S ON E.j=S.i E S e11 e12 e1n s1 s2 1 2 3 4

Handling Skewed data Chunk density for a social network data set in a 8 instances cluster. Skewness results on uneven data distribution (right) Chunk density after repartition (left) Edges per worker, before (right) and after (left) repartitioning

Unified Algorithm Unified Algorithm solves:
Reachability from a source vertex, SSSP WCC, Page Rank

Data Partitioning in Array DBMS
Data is partitioned by chunks: ranges Vector S is evenly partitioned through the cluster. Sensible to skewness Redistribution using mod function HACE MAS SENTIDO INTEGRAR R com ARRAY DBMS

Experimental Validation
Time complexity close to linear Comparing with a classical optimization: Replication of the smallest table

Optimized Queries in array DBMS vs ScaLAPACK

Comparing columnar vs array vs Spark

Speed up with real data sets

Matrix Powers

Matrix Powers with recursive queries

Recursive Queries *

Recursive Queries: Powers of a Matrix
*

Matrix Multiplication with SQL Queries
Matrix-Matrix Multiplication (+ , x ) semiring SELECT R.i, E.j, sum(R.v * E.v) FROM R join E on R.j=E.i GROUP BY i, j Matrix-Matrix Multiplication (min , - ) semiring SELECT R.i, E.j, min(R.v + E.v)

Data partitioning for parallel computation in Columnar DBMS

Data partitioning for parallel computation in Array DBMS
Distributed storage of R, E in array DBMS R E e11 e12 e1n 1 2 3 4

Matrix Multiplication. Comparing to ScaLAPACK

Parallel Speed-up: column and array DBMS

Conclusions TBD

Publications 1.

Optimized Algorithms for Data Analysis in Parallel Database Systems

Similar presentations

Presentation on theme: "Optimized Algorithms for Data Analysis in Parallel Database Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimized Algorithms for Data Analysis in Parallel Database Systems

Similar presentations

Presentation on theme: "Optimized Algorithms for Data Analysis in Parallel Database Systems"— Presentation transcript:

Similar presentations

About project

Feedback