Optimized Algorithms for Data Analysis in Parallel Database Systems Wellington M. Cabrera Advisor: Dr. Carlos Ordonez
Outline Motivation Background Review of work pre proposal Parallel DBMSs under shared-nothing architecture Data sets Review of work pre proposal Linear Models with Parallel Matrix Multiplication Variable Selection, Linear Regression, PCA Presentation of recent work Graph Analytics with Parallel Matrix-Matrix Multiplication Transitive closure, All Pairs Shortest Path, Triangle Counting Graph Analytics with Parallel Matrix-Vector Multiplication PageRank, Connected Components Reachability, SSSP Conclusions Enfantizar que este trabajo profundiza en el aspect de Paralelismo
Motivation & Contributions
Motivation Large datasets found in any domain. Continuous growing of data. Number of records Number of attributes/features DBMS are systems with a lot of research behind. Query optimizer Optimized I/O Parallelism DBMS offer increased security, compared with ad-hoc file management.
Issues Most of the data analysis, model computation and graph analytics is done outside of the database, exporting CSV files. It is difficult to express complex models and graph algorithms in a DBMS. No matrix operations support Queries may become hard to program Algorithms programmed without a deep understanding of DBMS technology may run with a poor performance. What’s wrong with exporting the data set to external systems? Data Privacy threat Waste of time Analysis is delayed
Contributions History/Timeline First part of PhD Linear Models with Parallel Matrix Multiplication 1, 2 Variable Selection, Linear Regression, PCA Second part of PhD Graph Analytics with Parallel Matrix-Matrix Multiplication3 Transitive closure, All Pairs Shortest Path, Triangle Counting Graph Analytics with Parallel Matrix-Vector Multiplication4 PageRank, Connected Components Reachability, SSSP 1. The Gamma Matrix to Summarize Dense and Sparse Data Sets for Big Data Analytics. IEEE TKDE 28(7): 1905-1918 (2016) 2. Accelerating a Gibbs sampler for variable selection on genomics data with summarization and variable pre-selection combining an array DBMS and R. Machine Learning 102(3): 483-504 (2016) 3. Comparing columnar, row and array DBMSs to process recursive queries on graphs. Inf. Syst. 63: 66-79 (2017) 4. Unified Algorithm to Solve Several Graph Problems with Relational Queries. Alberto Mendelzon International Workshop on Foundations of Data Management (2016)
Recalcar que aprovecho el conocimiento de paralelismo BACKGROUND
Definitions Data set for Linear Models Let X = {x1, ..., xn} be the input data set with n data points, where each point has d dimensions. X is a d × n matrix, where the data point xi is represented by a column vector (thus, equivalent to a d × 1 matrix). Y is a 1 x n vector , representing the dependent variable . Generally n>d; therefor X is a rectangular matrix Big data: n>>d
Definitions
Definition: Graph data set Let G=(V,E) m=|E| ; n =|V| We denominate the adjacency matrix of G as E . E is a n x n matrix, generally sparse S: a vector of vertices used in graph computations |S|=|V|=n Each entry Si represents a vertex attribute. distance from an specific source, membership, probability We omit values in S with no information (like ∞ for distances, 0 for probabilities) Notice E is n x n , but X is d x n
DBMS Storage classes Row Store: Legacy, transactions Column Store: Modern, analytics Array Store: Emerging, Scientific
Linear Models: data set storage in columnar/row DBMS Case n>>d Low and high dimensional datasets n in millions/billions; d up to few hundreds Most data sets: marketing, public health, sensor networks. data point xi stored as a row, with d columns extra column to store outcome Y Thus, data set is stored as a table T, where T has n rows and d+1 columns. Parallel databases may partition T either by hash function or mod function.
Linear Models: data set storage in columnar/row DBMS Case d>n Very High d, low n. d in thousands. Examples: Gene expression (microarray) data. Word frequency in documents Cannot keep n>d layout. Number of columns beyond most Row DBMS limits. Data point xi stored as a column Extra row to store the outcome Y Thus, data set is stored in a table T, which has n columns and d+1 rows Aclarar
Linear Models: data set representation in an array DBMS Array databases store data as multidimensional arrays, instead of relational tables. Arrays are partitioned by chunks (bi-dimensional data blocks) All chunks in a specific array have the same shape and size. Data points xi stored as rows, with an extra column for the outcome yi Thus, the dataset is represented as a bi-dimensional array, with n rows and d+1 columns.
Graph data set Row and columnar DBMS: E(i,j,v) Array DBMS: E as a n x n sparse array
Linear Models COMPUTATION with MATRIX MULTIPLICATION
Gamma Matrix Γ= Z ZT
Models Computation 2-Step algorithm for PCA, LR, VS…. One pass to the dataset. Compute summarization matrix (Gamma) in one pass. Compute models (PCA, LR, VS) using Gamma. Preselection & 2-Step algorithm for very high-dimensional VS ( Two passes). A preprocess step is incorporated Compute partial Gamma and perform preselection. Compute VS using Gamma.
Models Computation 2-step algorithm Compute the summarization matrix Gamma in the DBMS ( cluster, multiple nodes/cores) Compute the model locally exploiting Gamma and parallel matrix operations (LAPACK) , using any programming language (i.e. R, C++, C#). This approach was published in our work [1]
First step: One pass data set summarization We introduced the Gamma Matrix in [1]. The Gamma Matrix ( or Г) is a square matrix with d+2 rows and columns that contains a set of sufficient statistics, useful to compute several statistical indicators and models. PCA, VS, LR, covariance/correlation matrices. Computed in parallel with multiple cores or multiple nodes.
Matrix Multiplication Z∙ZT Parallel Computation with Multicore CPU (single node) in one pass. AGGUDF are processed in parallel, in four phases (initialize, accumulate, merge, terminate) and enable multicore processing. Initialize: Variables set up Accumulate: partial Gammas are calculated via vector products. Merge: Final Gamma is computed by adding partial Gammas. Terminate: Control returns to main processing Computation with LAPACK ( main memory) Computation with OpenMPI
Matrix Multiplication Z∙ZT Parallel Computation with multiple nodes Computation in Parallel Array Database. Each worker can process with one or multiple cores. Each core computes its own partial Gamma, using its own local data. Master node receives partial Gammas from workers Master node computes final Gamma with matrix addition.
Gamma: Z∙ZT
Models Computation Contribution summary; Enables the analysis of very high dimensional data sets in the DBMS. Overcomes the problem of data sets larger than RAM (d< n) 10s to 100s times faster than standard approach GAMMA fit in Memory d<n; Gamma does not fit in memory d>>n
PCA Compute Г, which contains n, L and Q Compute ρ, solve SVD of ρ, and select the k principal components
LR Computation ERROR: A small effort. Tipo de letra de notacion matematica
Variable Selection 1 + 2 Step Algorithm Pre-selection Based on marginal correlation ranking Calculate correlation between each variable and the outcome Sort in descending order Take the best d variables Top d variables are considered for further analysis Compute Г , which contains Qγ and XγYT Iterate the Gibbs sampler a sufficiently large number of iterations to explore Aqui falta de anadir bastante justificacion. Anadir un libro de textbook como referencia de estadistica
Optimizing Gibbs Sampler Non-conjugate Gaussian priors require the full Markov Chain. Conjugate priors simplify the computation. β,σ integrated out. Marin-Roberts formulation Zellner-g prior for β and Jeffrey’s prior for σ Gamma Mayuscula no esta explicado aqui. Que es comjugate prior
PCA DBMS : SciDB System : Local 1 node, 2 instances Dataset: KDDnet d R Г operator + R 10 100K 0.5 0.6 1M 4.8 1.3 10M 45.2 7 100M fail 64.7 100 10K 0.7 0.8 5.7 2.5 61.2 16.8 194.9 Poner las tablas mas grandes que tenemos. Mencionar que Variable selection no esta disponibe en Spark
LR DBMS : SciDB System : Local 1 node, 2 instances Dataset: KDDnet d n Г operator + R 10 100K 0.5 0.6 1M 5.6 1.3 10M 50.1 7.1 100M fail 69.8 100 10K 0.7 0.9 6.3 2.6 60.5 16.9 194.9
VS DBMS : SciDB System : Local 1 node, 2 instances Dataset: Brain Cancer - miRNA DBMS : SciDB System : Local 1 node, 2 instances Dataset: Brain Cancer - miRNA d n VS R Г operator + R 10 100K 3.4 3.6 1M 7.6 4.7 10M 58.8 9.6 100M fail 93.4 100 10K 13.1 13 34.4 14.8 113 29.1 207.2 p d n VS R Г operator + R 12500 100 248 1245 15 200 2586 17 400 stopped 39 800 84
MATRIX-VECTOR Computation IN PARALLEL DBMS Me fascino el paralelismo y la multiplicacion de matrices MATRIX-VECTOR Computation IN PARALLEL DBMS
Algorithms
Matrix –vector multiplication with relational queries
Optimizing Parallel Join Data Partitioning Join locality: E, S partitioned by hashing in the joining columns. Sorted tables: a merge join is possible, complexity O(n)
Optimizing Parallel Join Data Partitioning S split in N chunks E split in N x N squared chunks Data partitioning optimizing E JOIN S ON E.j=S.i E S e11 e12 e1n s1 s2 1 2 3 4
Handling Skewed data Chunk density for a social network data set in a 8 instances cluster. Skewness results on uneven data distribution (right) Chunk density after repartition (left) Edges per worker, before (right) and after (left) repartitioning
Unified Algorithm Unified Algorithm solves: Reachability from a source vertex, SSSP WCC, Page Rank
Data Partitioning in Array DBMS Data is partitioned by chunks: ranges Vector S is evenly partitioned through the cluster. Sensible to skewness Redistribution using mod function HACE MAS SENTIDO INTEGRAR R com ARRAY DBMS
Experimental Validation Time complexity close to linear Comparing with a classical optimization: Replication of the smallest table
Experimental Validation Optimized Queries in array DBMS vs ScaLAPACK
Comparing columnar vs array vs Spark
Experimental Validation Speed up with real data sets
Matrix Powers
Matrix Powers with recursive queries
Recursive Queries *
Recursive Queries: Powers of a Matrix *
Matrix Multiplication with SQL Queries Matrix-Matrix Multiplication (+ , x ) semiring SELECT R.i, E.j, sum(R.v * E.v) FROM R join E on R.j=E.i GROUP BY i, j Matrix-Matrix Multiplication (min , - ) semiring SELECT R.i, E.j, min(R.v + E.v)
Data partitioning for parallel computation in Columnar DBMS
Data partitioning for parallel computation in Array DBMS Distributed storage of R, E in array DBMS R E e11 e12 e1n 1 2 3 4
Experimental Validation Matrix Multiplication. Comparing to ScaLAPACK
Experimental Validation Parallel Speed-up: column and array DBMS
Conclusions TBD
Publications 1.