Optimized Algorithms for Data Analysis in Parallel Database Systems

Slides:



Advertisements
Similar presentations
Shark:SQL and Rich Analytics at Scale
Advertisements

Choosing an Order for Joins
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Spark: Cluster Computing with Working Sets
1 Reduction between Transitive Closure & Boolean Matrix Multiplication Presented by Rotem Mairon.
Graph & BFS.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000
The Gamma Operator for Big Data Summarization
Birch: An efficient data clustering method for very large databases
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
A Comparison of Column, Row and Array DBMSs to Process Recursive Queries Carlos Ordonez ATT Labs.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
1 Database Systems Group Research Overview OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Bayesian Neural Networks
Practical Database Design and Tuning
Integrating the R Language Runtime System with a Data Stream Warehouse
CS 540 Database Management Systems
Author: Vikas Sindhwani and Amol Ghoting Presenter: Jinze Li
CS 440 Database Management Systems
Interquery Parallelism
Chapter 12: Query Processing
Big Data Analytics in Parallel Systems
Parallel Database Systems
Basic machine learning background with Python scikit-learn
Evaluation of Relational Operations
Tutorial 8 Objectives Continue presenting methods to import data into Access, export data from Access, link applications with data stored in Access, and.
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
April 30th – Scheduling / parallel
Lecture 17: Distributed Transactions
Predictive Performance
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Database Applications (15-415) DBMS Internals- Part VI Lecture 15, Oct 23, 2016 Mohammad Hammoud.
Physical Database Design
Akshay Tomar Prateek Singh Lohchubh
Carlos Ordonez, Predrag T. Tosic
Faloutsos/Pavlo C. Faloutsos – A. Pavlo Lecture#13: Query Evaluation
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Selected Topics: External Sorting, Join Algorithms, …
Parallel Analytic Systems
Dimension reduction : PCA and Clustering
Database Design and Programming
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
CS 584 Project Write up Poster session for final Due on day of final
Wellington Cabrera Carlos Ordonez
The Gamma Operator for Big Data Summarization
Wellington Cabrera, Carlos Ordonez (presenter)
Wellington Cabrera Advisor: Carlos Ordonez
Wellington Cabrera Advisor: Carlos Ordonez
Closures of Relations Epp, section 10.1,10.2 CS 202.
Carlos Ordonez, Javier Garcia-Garcia,
The Gamma Operator for Big Data Summarization on an Array DBMS
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.
MapReduce: Simplified Data Processing on Large Clusters
Parallel Systems to Compute
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

Optimized Algorithms for Data Analysis in Parallel Database Systems Wellington M. Cabrera Advisor: Dr. Carlos Ordonez

Outline Motivation Background Review of work pre proposal Parallel DBMSs under shared-nothing architecture Data sets Review of work pre proposal Linear Models with Parallel Matrix Multiplication Variable Selection, Linear Regression, PCA Presentation of recent work Graph Analytics with Parallel Matrix-Matrix Multiplication Transitive closure, All Pairs Shortest Path, Triangle Counting Graph Analytics with Parallel Matrix-Vector Multiplication PageRank, Connected Components Reachability, SSSP Conclusions Enfantizar que este trabajo profundiza en el aspect de Paralelismo

Motivation & Contributions

Motivation Large datasets found in any domain. Continuous growing of data. Number of records Number of attributes/features DBMS are systems with a lot of research behind. Query optimizer Optimized I/O Parallelism DBMS offer increased security, compared with ad-hoc file management.

Issues Most of the data analysis, model computation and graph analytics is done outside of the database, exporting CSV files. It is difficult to express complex models and graph algorithms in a DBMS. No matrix operations support Queries may become hard to program Algorithms programmed without a deep understanding of DBMS technology may run with a poor performance. What’s wrong with exporting the data set to external systems? Data Privacy threat Waste of time Analysis is delayed

Contributions History/Timeline First part of PhD Linear Models with Parallel Matrix Multiplication 1, 2 Variable Selection, Linear Regression, PCA Second part of PhD Graph Analytics with Parallel Matrix-Matrix Multiplication3 Transitive closure, All Pairs Shortest Path, Triangle Counting Graph Analytics with Parallel Matrix-Vector Multiplication4 PageRank, Connected Components Reachability, SSSP 1. The Gamma Matrix to Summarize Dense and Sparse Data Sets for Big Data Analytics. IEEE TKDE 28(7): 1905-1918 (2016) 2. Accelerating a Gibbs sampler for variable selection on genomics data with summarization and variable pre-selection combining an array DBMS and R. Machine Learning 102(3): 483-504 (2016) 3. Comparing columnar, row and array DBMSs to process recursive queries on graphs. Inf. Syst. 63: 66-79 (2017) 4. Unified Algorithm to Solve Several Graph Problems with Relational Queries. Alberto Mendelzon International Workshop on Foundations of Data Management (2016)

Recalcar que aprovecho el conocimiento de paralelismo BACKGROUND

Definitions Data set for Linear Models Let X = {x1, ..., xn} be the input data set with n data points, where each point has d dimensions. X is a d × n matrix, where the data point xi is represented by a column vector (thus, equivalent to a d × 1 matrix). Y is a 1 x n vector , representing the dependent variable . Generally n>d; therefor X is a rectangular matrix Big data: n>>d

Definitions

Definition: Graph data set Let G=(V,E) m=|E| ; n =|V| We denominate the adjacency matrix of G as E . E is a n x n matrix, generally sparse S: a vector of vertices used in graph computations |S|=|V|=n Each entry Si represents a vertex attribute. distance from an specific source, membership, probability We omit values in S with no information (like ∞ for distances, 0 for probabilities) Notice E is n x n , but X is d x n

DBMS Storage classes Row Store: Legacy, transactions Column Store: Modern, analytics Array Store: Emerging, Scientific

Linear Models: data set storage in columnar/row DBMS Case n>>d Low and high dimensional datasets n in millions/billions; d up to few hundreds Most data sets: marketing, public health, sensor networks. data point xi stored as a row, with d columns extra column to store outcome Y Thus, data set is stored as a table T, where T has n rows and d+1 columns. Parallel databases may partition T either by hash function or mod function.

Linear Models: data set storage in columnar/row DBMS Case d>n Very High d, low n. d in thousands. Examples: Gene expression (microarray) data. Word frequency in documents Cannot keep n>d layout. Number of columns beyond most Row DBMS limits. Data point xi stored as a column Extra row to store the outcome Y Thus, data set is stored in a table T, which has n columns and d+1 rows Aclarar

Linear Models: data set representation in an array DBMS Array databases store data as multidimensional arrays, instead of relational tables. Arrays are partitioned by chunks (bi-dimensional data blocks) All chunks in a specific array have the same shape and size. Data points xi stored as rows, with an extra column for the outcome yi Thus, the dataset is represented as a bi-dimensional array, with n rows and d+1 columns.

Graph data set Row and columnar DBMS: E(i,j,v) Array DBMS: E as a n x n sparse array

Linear Models COMPUTATION with MATRIX MULTIPLICATION

Gamma Matrix   Γ= Z ZT

Models Computation 2-Step algorithm for PCA, LR, VS…. One pass to the dataset. Compute summarization matrix (Gamma) in one pass. Compute models (PCA, LR, VS) using Gamma. Preselection & 2-Step algorithm for very high-dimensional VS ( Two passes). A preprocess step is incorporated Compute partial Gamma and perform preselection. Compute VS using Gamma.

Models Computation 2-step algorithm Compute the summarization matrix Gamma in the DBMS ( cluster, multiple nodes/cores) Compute the model locally exploiting Gamma and parallel matrix operations (LAPACK) , using any programming language (i.e. R, C++, C#). This approach was published in our work [1]

First step: One pass data set summarization We introduced the Gamma Matrix in [1]. The Gamma Matrix ( or Г) is a square matrix with d+2 rows and columns that contains a set of sufficient statistics, useful to compute several statistical indicators and models. PCA, VS, LR, covariance/correlation matrices. Computed in parallel with multiple cores or multiple nodes.

Matrix Multiplication Z∙ZT Parallel Computation with Multicore CPU (single node) in one pass. AGGUDF are processed in parallel, in four phases (initialize, accumulate, merge, terminate) and enable multicore processing. Initialize: Variables set up Accumulate: partial Gammas are calculated via vector products. Merge: Final Gamma is computed by adding partial Gammas. Terminate: Control returns to main processing Computation with LAPACK ( main memory) Computation with OpenMPI

Matrix Multiplication Z∙ZT Parallel Computation with multiple nodes Computation in Parallel Array Database. Each worker can process with one or multiple cores. Each core computes its own partial Gamma, using its own local data. Master node receives partial Gammas from workers Master node computes final Gamma with matrix addition.

Gamma: Z∙ZT

Models Computation Contribution summary; Enables the analysis of very high dimensional data sets in the DBMS. Overcomes the problem of data sets larger than RAM (d< n) 10s to 100s times faster than standard approach GAMMA fit in Memory d<n; Gamma does not fit in memory d>>n

PCA Compute Г, which contains n, L and Q Compute ρ, solve SVD of ρ, and select the k principal components

LR Computation   ERROR: A small effort. Tipo de letra de notacion matematica

Variable Selection 1 + 2 Step Algorithm Pre-selection Based on marginal correlation ranking Calculate correlation between each variable and the outcome Sort in descending order Take the best d variables Top d variables are considered for further analysis Compute Г , which contains Qγ and XγYT Iterate the Gibbs sampler a sufficiently large number of iterations to explore Aqui falta de anadir bastante justificacion. Anadir un libro de textbook como referencia de estadistica

Optimizing Gibbs Sampler Non-conjugate Gaussian priors require the full Markov Chain. Conjugate priors simplify the computation. β,σ integrated out. Marin-Roberts formulation Zellner-g prior for β and Jeffrey’s prior for σ Gamma Mayuscula no esta explicado aqui. Que es comjugate prior

PCA DBMS : SciDB System : Local 1 node, 2 instances Dataset: KDDnet d R Г operator + R 10 100K 0.5 0.6 1M 4.8 1.3 10M 45.2 7 100M fail 64.7 100 10K 0.7 0.8 5.7 2.5 61.2 16.8 194.9 Poner las tablas mas grandes que tenemos. Mencionar que Variable selection no esta disponibe en Spark

LR DBMS : SciDB System : Local 1 node, 2 instances Dataset: KDDnet d n Г operator + R 10 100K 0.5 0.6 1M 5.6 1.3 10M 50.1 7.1 100M fail 69.8 100 10K 0.7 0.9 6.3 2.6 60.5 16.9 194.9

VS DBMS : SciDB System : Local 1 node, 2 instances Dataset: Brain Cancer - miRNA DBMS : SciDB System : Local 1 node, 2 instances Dataset: Brain Cancer - miRNA d n VS R Г operator + R 10 100K 3.4 3.6 1M 7.6 4.7 10M 58.8 9.6 100M fail 93.4 100 10K 13.1 13 34.4 14.8 113 29.1 207.2 p d n VS R Г operator + R 12500 100 248 1245 15 200 2586 17 400 stopped 39 800 84

MATRIX-VECTOR Computation IN PARALLEL DBMS Me fascino el paralelismo y la multiplicacion de matrices MATRIX-VECTOR Computation IN PARALLEL DBMS

Algorithms

Matrix –vector multiplication with relational queries

Optimizing Parallel Join Data Partitioning Join locality: E, S partitioned by hashing in the joining columns. Sorted tables: a merge join is possible, complexity O(n)

Optimizing Parallel Join Data Partitioning S split in N chunks E split in N x N squared chunks Data partitioning optimizing E JOIN S ON E.j=S.i E S e11 e12   e1n s1 s2 1 2 3 4

Handling Skewed data Chunk density for a social network data set in a 8 instances cluster. Skewness results on uneven data distribution (right) Chunk density after repartition (left) Edges per worker, before (right) and after (left) repartitioning

Unified Algorithm Unified Algorithm solves: Reachability from a source vertex, SSSP WCC, Page Rank

Data Partitioning in Array DBMS Data is partitioned by chunks: ranges Vector S is evenly partitioned through the cluster. Sensible to skewness Redistribution using mod function HACE MAS SENTIDO INTEGRAR R com ARRAY DBMS

Experimental Validation Time complexity close to linear Comparing with a classical optimization: Replication of the smallest table

Experimental Validation Optimized Queries in array DBMS vs ScaLAPACK

Comparing columnar vs array vs Spark

Experimental Validation Speed up with real data sets

Matrix Powers

Matrix Powers with recursive queries

Recursive Queries *

Recursive Queries: Powers of a Matrix *

Matrix Multiplication with SQL Queries Matrix-Matrix Multiplication (+ , x ) semiring SELECT R.i, E.j, sum(R.v * E.v) FROM R join E on R.j=E.i GROUP BY i, j Matrix-Matrix Multiplication (min , - ) semiring SELECT R.i, E.j, min(R.v + E.v)

Data partitioning for parallel computation in Columnar DBMS

Data partitioning for parallel computation in Array DBMS Distributed storage of R, E in array DBMS R E e11 e12   e1n 1 2 3 4

Experimental Validation Matrix Multiplication. Comparing to ScaLAPACK

Experimental Validation Parallel Speed-up: column and array DBMS

Conclusions TBD

Publications 1.