The Gamma Operator for Big Data Summarization

Slides:

Advertisements

Similar presentations

Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke

Advertisements

Choosing an Order for Joins

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.

6.830 Lecture 9 10/1/2014 Join Algorithms. Database Internals Outline Front End Admission Control Connection Management (sql) Parser (parse tree) Rewriter.

The Gamma Operator for Big Data Summarization

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Introduction to Hadoop and HDFS

Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

CS4432: Database Systems II Query Processing- Part 2.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)

Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)

Big Data is a Big Deal!.

Integrating the R Language Runtime System with a Data Stream Warehouse

Table General Guidelines for Better System Performance

Analysis of Sparse Convolutional Neural Networks

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Large-scale file systems and Map-Reduce

Chilimbi, et al. (2014) Microsoft Research

Applying Control Theory to Stream Processing Systems

External Sorting Chapter 13

Scaling SQL with different approaches

CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

Parallel Database Systems

Database Performance Tuning and Query Optimization

Introduction to Query Optimization

Evaluation of Relational Operations

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

The R language and its Dynamic Runtime

Chapter 15 QUERY EXECUTION.

A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman

Applying Twister to Scientific Applications

April 30th – Scheduling / parallel

Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan

湖南大学-信息科学与工程学院-计算机与科学系

On Spatial Joins in MapReduce

Cse 344 May 2nd – Map/reduce.

Cse 344 May 4th – Map/Reduce.

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Ch 4. The Evolution of Analytic Scalability

External Sorting Chapter 13

Table General Guidelines for Better System Performance

Parallel Analytic Systems

Chapter 11 Database Performance Tuning and Query Optimization

Big Data, Bigger Data & Big R Data

Big Data Analytics: Exploring Graphs with Optimized SQL Queries

Optimized Algorithms for Data Analysis in Parallel Database Systems

Wellington Cabrera Carlos Ordonez

5/7/2019 Map Reduce Map reduce.

Wellington Cabrera Advisor: Carlos Ordonez

CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

Wellington Cabrera Advisor: Carlos Ordonez

Carlos Ordonez, Javier Garcia-Garcia,

The Gamma Operator for Big Data Summarization on an Array DBMS

External Sorting Chapter 13

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.

COS 518: Distributed Systems Lecture 11 Mike Freedman

MapReduce: Simplified Data Processing on Large Clusters

CSE 190D Database System Implementation

Parallel Systems to Compute

Presentation transcript:

The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez

Acknowledgments Michael Stonebraker , MIT My PhD students: Yiqun Zhang, Wellington Cabrera SciDB team: Paul Brown, Bryan Lewis, Alex Polyakov

Why SciDB? Large matrices beyond RAM size Storage by row or column not good enough Matrices natural in statistics, engineer. and science Multidimensional arrays -> matrices, not same thing Parallel shared-nothing best for big data analytics Closer to DBMS technology, but some similarity with Hadoop Feasible to create array operators, having matrices as input and matrix as output Combine processing with R package and LAPACK

Old: separate sufficient statistics

New: Generalizing and unifying Sufficient Statistics: Z=[1,X,Y]

Equivalent equations with projections from Γ

Properties of 

Further properties details: non-commutative and distributive

Storage in array chunks

In SciDB we store the points in X as 2D array. SCAN 1 2 d Worker

Array storage and processing in SciDB Assuming d<<n it is natural to hash partition X by i=1..n Gamma computation is fully parallel maintaining local Gamma versions in RAM. X can be read with a fully parallel scan No need to write Gamma from RAM to disk during scan, unless fault tolerant

Point must fit in one chunk. Otherwise, join is needed (slow) 1 2 d 1 2 d OK NO! Coordinator 1 2 d Coordinator Worker 1 Worker 1

Parallel computation Coordinator Worker 1 Worker 2 1 2 d 1 2 d 1 2 d Coordinator Worker 1 Worker 2 send send

Dense matrix operator: O(d2 n)

Sparse matrix operator: O(d n) for hyper-sparse matrix

Pros: Algorithm evaluation with physical array operators Since xi fits in one chunk joins are avoided (at least 2X I/O with hash or merge join) Since xi*xiT can be computed in RAM we avoid an aggregation which would require sorting points by i No need to store X twice: X, XT: half I/O, half RAM space No need transpose X, costly reorganization even in RAM, especially if X spans several RAM segments Operator works in C++ compiled code: fast; vector accessed once; direct assignment (bypass C++ functions calls)

System issues and limitations Gamma not efficiently computable in AQL or AFL: hence operator is required Arrays of tuples in SciDB are more general, but cumbersome for matrix manipulation: arrays of single attribute (double) Points must be stored completely inside a chunk: wide rectangular chunks: may not be I/O optimal Slow: Arrays must be pre-processed to SciDB load format, loaded to 1D array and re-dimensioned=>optimize load. Multiple SciDB instances per node improve I/O speed: interleaving CPU Larger chunks are better: 8MB, especially for dense matrices; avoid shuffling; avoid joins Dense (alpha) and sparse (beta) versions

Benchmark: scale up emphasis Small: cluster with 2 Intel Quadcore servers 4GB RAM, 3TB disk Large: Amazon cloud

Why is Gamma faster than SciDB+LAPACK? Gamma operator d Gamma op Scan mem alloc CPU merge 100 3.5 0.7 0.1 2.2 0.0 200 10.9 1.0 8.6 400 38.8 33.9 800 145.0 4.6 134.7 0.4 1600 599.8 11.4 575.5 SciDB and LAPACK (crossprod() call in SciDB) TOTAL transpose subarray 1 repart 1 subarray 2 repart 2 build 0s gemm ScaLAPACK MKL 77.3 0.3 41.7 25.9 8.0 0.8 0.2 163.0 84.9 55.7 17.2 1.8 0.6 373.1 172.6 0.5 120.6 39.4 5.4 2.1 1497.3 553.6 537.6 169.8 21.2 8.1 * 33.4

Combination: SciDB + R

Can Gamma operator beat LAPACK? Gamma versus Open BLAS LAPACK (90% performance of MKL) Gamma: scan, sparse/dense 2 threads; disk+RAM+CPU LAPACK: Open BLAS~=MKL; 2 threads; RAM+CPU d=100 LAPACK d=200 d=400 d=800 n density dense sparse Op BLAS Op BLAS2 Open BLAS 100k 0.1% 3.3 0.1 0.4 11.3 1.0 38.9 0.2 3.1 145.0 0.6 10.7 1.0% 10.0% 0.5 0.9 2.2 6.2 100.0% 4.5 15.4 55.9 201.0 1M 31.1 3.8 103.5 10.0 316.5 423.2 1475.7 fail 1.1 4.0 7.0 16.3 46.4 44.0 148.8 542.3 2159.6

SciDB in the Cloud: massive parallelism outdated SciDB in the Cloud: massive parallelism

Comparing systems to compute Γ on local server

Comparing systems to compute Γ on local server

Vertica vs. SciDB for sparse matrices

Running on the cloud

Running on the cloud

Conclusions One pass summarization matrix operator: parallel, scalable Optimization of outer matrix multiplication as sum (aggregation) of vector outer products Dense and sparse matrix versions required Operator compatible with any parallel shared-nothing system, but better for arrays Gamma matrix must fit in RAM, but n unlimited Summarization matrix can be exploited in many intermediate computations (with appropriate projections) in linear models Simplifies many methods to two phases: Summarization Computing model parameters Requires arrays, but can work with SQL or MapReduce

Future work: Theory Use Gamma in other models like logistic regression, clustering, Factor Analysis, HMMs Connection to frequent itemset Sampling Higher expected moments, co-variates Unlikely: Numeric stability with unnormalized sorted data

Future work: Systems DONE: Sparse matrices: layout, compression DONE: Beat LAPACK on high d Online model learning (cursor interface needed, incompatible with DBMS) Unlimited d (currently d>8000); join required for high d? Parallel processing of high d more complicated, chunked Interface with BLAS and MKL, not worth it? DONE: Faster than column DBMS for sparse?