The Gamma Operator for Big Data Summarization

Slides:



Advertisements
Similar presentations
Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke
Advertisements

Choosing an Order for Joins
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Spark: Cluster Computing with Working Sets
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Introduction to Hadoop and HDFS
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.
Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
CS4432: Database Systems II Query Processing- Part 2.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
B+ Trees: An IO-Aware Index Structure Lecture 13.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Integrating the R Language Runtime System with a Data Stream Warehouse
Large-scale file systems and Map-Reduce
Parallel Database Systems
Database Performance Tuning and Query Optimization
Introduction to Query Optimization
Evaluation of Relational Operations
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The R language and its Dynamic Runtime
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan
Communication and Memory Efficient Parallel Decision Tree Construction
Cse 344 May 4th – Map/Reduce.
Ch 4. The Evolution of Analytic Scalability
Parallel Analytic Systems
Chapter 11 Database Performance Tuning and Query Optimization
Big Data, Bigger Data & Big R Data
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
The Gamma Operator for Big Data Summarization
Wellington Cabrera Advisor: Carlos Ordonez
Carlos Ordonez, Javier Garcia-Garcia,
The Gamma Operator for Big Data Summarization on an Array DBMS
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez

Acknowledgments Michael Stonebraker , MIT My PhD students: Yiqun Zhang, Wellington Cabrera SciDB team: Paul Brown, Bryan Lewis, Alex Polyakov

Why SciDB? Large matrices beyond RAM size Storage by row or column not good enough Matrices natural in statistics, engineer. and science Multidimensional arrays -> matrices, not same thing Parallel shared-nothing best for big data analytics Closer to DBMS technology, but some similarity with Hadoop Feasible to create array operators, having matrices as input and matrix as output Combine processing with R package and LAPACK

   

Old: separate sufficient statistics

New: Generalizing and unifying Sufficient Statistics: Z=[1,X,Y]

Equivalent equations with projections from Γ

Properties of 

Further properties details: non-commutative and distributive

Storage in array chunks

In SciDB we store the points in X as 2D array. SCAN 1 2 d   Worker

Array storage and processing in SciDB Assuming d<<n it is natural to hash partition X by i=1..n Gamma computation is fully parallel maintaining local Gamma versions in RAM. X can be read with a fully parallel scan No need to write Gamma from RAM to disk during scan, unless fault tolerant

Point must fit in one chunk. Otherwise, join is needed (slow) 1 2 d 1 2 d OK NO! Coordinator 1 2 d Coordinator Worker 1   Worker 1

Parallel computation Coordinator Worker 1 Worker 2 1 2 d 1 2 d 1 2 d   Coordinator   Worker 1   Worker 2 send send    

Dense matrix operator: O(d2 n)

Sparse matrix operator: O(d n) for hyper-sparse matrix

Pros: Algorithm evaluation with physical array operators Since xi fits in one chunk joins are avoided (at least 2X I/O with hash or merge join) Since xi*xiT can be computed in RAM we avoid an aggregation which would require sorting points by i No need to store X twice: X, XT: half I/O, half RAM space No need transpose X, costly reorganization even in RAM, especially if X spans several RAM segments Operator works in C++ compiled code: fast; vector accessed once; direct assignment (bypass C++ functions calls)

System issues and limitations Gamma not efficiently computable in AQL or AFL: hence operator is required Arrays of tuples in SciDB are more general, but cumbersome for matrix manipulation: arrays of single attribute (double) Points must be stored completely inside a chunk: wide rectangular chunks: may not be I/O optimal Slow: Arrays must be pre-processed to SciDB load format, loaded to 1D array and re-dimensioned=>optimize load. Multiple SciDB instances per node improve I/O speed: interleaving CPU Larger chunks are better: 8MB, especially for dense matrices; avoid shuffling; avoid joins Dense (alpha) and sparse (beta) versions

Benchmark: scale up emphasis Small: cluster with 2 Intel Quadcore servers 4GB RAM, 3TB disk Large: Amazon cloud 2

Why is Gamma faster than SciDB+LAPACK? Gamma operator d Gamma op Scan mem alloc CPU merge 100 3.5 0.7 0.1 2.2 0.0 200 10.9 1.0 8.6 400 38.8 33.9 800 145.0 4.6 134.7 0.4 1600 599.8 11.4 575.5   SciDB and LAPACK (crossprod() call in SciDB) TOTAL transpose subarray 1 repart 1 subarray 2 repart 2 build 0s gemm ScaLAPACK MKL 77.3 0.3 41.7 25.9 8.0 0.8 0.2 163.0 84.9 55.7 17.2 1.8 0.6 373.1 172.6 0.5 120.6 39.4 5.4 2.1 1497.3 553.6 537.6 169.8 21.2 8.1 * 33.4

Combination: SciDB + R

Can Gamma operator beat LAPACK? Gamma versus Open BLAS LAPACK (90% performance of MKL) Gamma: scan, sparse/dense 2 threads; disk+RAM+CPU LAPACK: Open BLAS~=MKL; 2 threads; RAM+CPU   d=100 LAPACK d=200 d=400 d=800 n density dense sparse Op BLAS Op BLAS2 Open BLAS 100k 0.1% 3.3 0.1 0.4 11.3 1.0 38.9 0.2 3.1 145.0 0.6 10.7 1.0% 10.0% 0.5 0.9 2.2 6.2 100.0% 4.5 15.4 55.9 201.0 1M 31.1 3.8 103.5 10.0 316.5 423.2 1475.7 fail 1.1 4.0 7.0 16.3 46.4 44.0 148.8 542.3 2159.6

SciDB in the Cloud: massive parallelism

Conclusions One pass summarization matrix operator: parallel, scalable Optimization of outer matrix multiplication as sum (aggregation) of vector outer products Dense and sparse matrix versions required Operator compatible with any parallel shared-nothing system, but better for arrays Gamma matrix must fit in RAM, but n unlimited Summarization matrix can be exploited in many intermediate computations (with appropriate projections) in linear models Simplifies many methods to two phases: Summarization Computing model parameters Requires arrays, but can work with SQL or MapReduce

Future work: Theory Use Gamma in other models like logistic regression, clustering, Factor Analysis, HMMs Connection to frequent itemset Sampling Higher expected moments, co-variates Unlikely: Numeric stability with unnormalized sorted data

Future work: Systems DONE: Sparse matrices: layout, compression DONE: Beat LAPACK on high d Online model learning (cursor interface needed, incompatible with DBMS) Unlimited d (currently d>8000); join required for high d? Parallel processing of high d more complicated, chunked Interface with BLAS and MKL, not worth it? Faster than column DBMS for sparse?