Matthias Boehm1, Alexandre V. Evfimievski2, Berthold Reinwald2

Slides:

Advertisements

Similar presentations

Shark:SQL and Rich Analytics at Scale

Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Spark: Cluster Computing with Working Sets

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Cut-And-Stitch: Efficient Parallel Learning of Linear Dynamical Systems on SMPs Lei Li Computer Science Department School of Computer Science Carnegie.

Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.

Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

Parallel Computing with Matlab CBI Lab Parallel Computing Toolbox TM An Introduction Oct. 27, 2011 By: CBI Development Team.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.

HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.

Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Matrix Multiplication in Hadoop

Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:

An Accelerated Procedure for Hypergraph Coarsening on the GPU Lin Cheng, Hyunsu Cho, and Peter Yoon Trinity College Hartford, CT, USA.

1 / 24 Distributed Methods for High-dimensional and Large-scale Tensor Factorization Kijung Shin (Seoul National University) and U Kang (KAIST)

Presented by: Omar Alqahtani Fall 2016

TensorFlow– A system for large-scale machine learning

Integrating the R Language Runtime System with a Data Stream Warehouse

Tarek Elgamal2, Shangyu Luo3, Matthias Boehm1, Alexandre V

Machine Learning Library for Apache Ignite

An Open Source Project Commonly Used for Processing Big Data Sets

Distributed Network Traffic Feature Extraction for a Real-time IDS

Spark Presentation.

Scaling SQL with different approaches

Resource Elasticity for Large-Scale Machine Learning

APACHE HAWQ 2.X A Hadoop Native SQL Engine

Applying Twister to Scientific Applications

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

ICS-2018 June 12-15, Beijing Zwift : A Programming Framework for High Performance Text Analytics on Compressed Data Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng.

CMPT 733, SPRING 2016 Jiannan Wang

Communication and Memory Efficient Parallel Decision Tree Construction

Cse 344 May 4th – Map/Reduce.

Restructuring the multi-resolution approximation for spatial data to reduce the memory footprint and to facilitate scalability Vinay Ramakrishnaiah Mentors:

Lecture 21: ML Optimizers

CSCE569 Parallel Computing

Oct. 27, By: CBI Development Team

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Declarative Transfer Learning from Deep CNNs at Scale

Compressed Linear Algebra for Large-Scale Machine Learning

Resource Elasticity for Large-Scale Machine Learning

Acknowledgements: Apache SystemML Team, Tarek Elgamal

Compressed Linear Algebra for Large-Scale Machine Learning

Chapter 01: Introduction

TensorFlow: A System for Large-Scale Machine Learning

Architecture of ML Systems 03 Size Inference, Rewrites, and Operator Selection Matthias Boehm Graz University of Technology, Austria Computer Science and.

The Gamma Operator for Big Data Summarization on an Array DBMS

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Presentation transcript:

Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning Matthias Boehm1, Alexandre V. Evfimievski2, Berthold Reinwald2 1 Graz University of Technology; Graz, Austria 2 IBM Research – Almaden; San Jose, CA, USA 25+5

Motivation Large-Scale ML Introduction and Motivation Motivation Large-Scale ML Data Model Usage Feedback Loop Large-Scale Machine Learning Variety of ML applications (supervised, semi-/unsupervised) Large data collection (labels: feedback, weak supervision) State-of-the-art ML Systems Batch algorithms  Data-/task-parallel operations Mini-batch algorithms  Parameter server Data-Parallel Distributed Operations Linear Algebra (matrix multiplication, element-wise operations, structural and grouping aggregations, statistical functions) Meta learning (e.g., cross validation, ensembles, hyper-parameters) In practice: also reorganizations and cumulative aggregates

Motivation Cumulative Aggregates Introduction and Motivation Motivation Cumulative Aggregates Example Prefix Sums Applications #1 Iterative survival analysis: Cox Regression / Kaplan-Meier #2 Spatial data processing via linear algebra, cumulative histograms #3 Data preprocessing: subsampling of rows / remove empty rows Parallelization Recursive formulation looks inherently sequential Classic example for parallelization via aggregation trees (message passing or shared memory HPC systems) Question: Efficient, Data-Parallel Cumulative Aggregates? (blocked matrices as unordered collections in Spark or Flink) Z = cumsum(X) with Zij = k=1 i 𝐗 kj = Xij + Z(i-1)j 1 2 1 1 3 4 2 1 1 2 2 3 5 7 7 8 rank 1 rank 2 2 MPI: 1 3 2 5 7

Outline SystemML Overview and Related Work Data-Parallel Cumulative Aggregates System Integration Experimental Results

SystemML Overview and Related Work

High-Level SystemML Architecture SystemML Overview and Related Work High-Level SystemML Architecture DML Scripts DML (Declarative Machine Learning Language) APIs: Command line, JMLC, Spark MLContext, Spark ML, (20+ scalable algorithms) Language Compiler [SIGMOD’15,’17,‘19] [PVLDB’14,’16a,’16b,’18] [ICDE’11,’12,’15] [CIDR’17] [VLDBJ’18] [DEBull’14] [PPoPP’15] 05/2017 Apache Top-Level Project 11/2015 Apache Incubator Project 08/2015 Open Source Release Runtime In-Memory Single Node (scale-up) Hadoop or Spark Cluster (scale-out) In-Progress: GPU since 2014/16 since 2012 since 2010/11 since 2015

Basic HOP and LOP DAG Compilation SystemML Overview and Related Work Basic HOP and LOP DAG Compilation LinregDS (Direct Solve) X = read($1); y = read($2); intercept = $3; lambda = 0.001; ... if( intercept == 1 ) { ones = matrix(1, nrow(X), 1); X = append(X, ones); } I = matrix(1, ncol(X), 1); A = t(X) %*% X + diag(I)*lambda; b = t(X) %*% y; beta = solve(A, b); ... write(beta, $4); Cluster Config: driver mem: 20 GB exec mem: 60 GB HOP DAG (after rewrites) 800MB 800GB 8KB 172KB 1.6TB 16MB 8MB Scenario: X: 108 x 103, 1011 y: 108 x 1, 108 CP SP dg(rand) (103x1,103) r(diag) X (108x103,1011) y (108x1,108) ba(+*) r(t) b(+) b(solve) write LOP DAG (after rewrites) 1.6GB 800MB 16KB X y r’(CP) mapmm(SP) tsmm(SP) (persisted in MEM_DISK) Hybrid Runtime Plans: Size propagation / memory estimates Integrated CP / Spark runtime Distributed Matrices Fixed-size (squared) matrix blocks Data-parallel operations X1,1 X2,1 Xm,1

 Qualify for update in-place, but still too slow SystemML Overview and Related Work Cumulative Aggregates in ML Systems (Straw-man Scripts and Built-in Support) ML Systems Update in-place: R (ref count), SystemML (rewrites), Julia Builtins in R, Matlab, Julia, NumPy, SystemML (since 2014) cumsum(), cummin(), cummax(), cumprod() SQL SELECT Rid, V, sum(V) OVER(ORDER BY Rid) AS cumsum FROM X Sequential and parallelized execution (e.g., [Leis et al, PVLDB’15]) 1: cumsumN2 = function(Matrix[Double] A) 2: return(Matrix[Double] B) 3: { 4: B = A; csums = matrix(0,1,ncol(A)); 5: for( i in 1:nrow(A) ) { 6: csums = csums + A[i,]; 7: B[i,] = csums; 8: } 9: } 1: cumsumNlogN = function(Matrix[Double] A) 2: return(Matrix[Double] B) 3: { 4: B = A; m = nrow(A); k = 1; 5: while( k < m ) { 6: B[(k+1):m,] = B[(k+1):m,] + B[1:(m‐k),]; 7: k = 2 * k; 8: } 9: } copy-on-write  O(n^2)  O(n log n)  Qualify for update in-place, but still too slow Leis et al: segment tree for unnecessary redundancy for semi-additive aggregation functions like min and max over variable-sized frames (window functions introduced in the OLAP extensions of SQL:2003)

Data-Parallel Cumulative Aggregates

aggregates of aggregates Data-Parallel Cumulative Aggregates DistCumAgg Framework Basic Idea: self-similar operator chain (forward, local, backward) block-local cumagg aggregates aggregates of aggregates

Basic Cumulative Aggregates Data-Parallel Cumulative Aggregates Basic Cumulative Aggregates Instantiating Basic Cumulative Aggregates Example cumsum(X) Operation Init fagg foff fcumagg cumsum(X) colSums(B) B1:=B1:+a cumsum(B) cummin(X) ∞ colMins(B) B1:=min(B1:,a) cummin(B) cummax(X) -∞ colMaxs(B) B1:=max(B1:,a) cummax(B) cumprod(X) 1 colProds(B) B1:=B1:*a cumprod(B) fused to avoid copy

Complex Cumulative Aggregates Data-Parallel Cumulative Aggregates Complex Cumulative Aggregates 1 .2 1 .1 3 .0 2 .1 Exponential smoothing Instantiating Complex Recurrences Equations Example Z = cumsumprod(X) = cumsumprod(Y, W) with Zi = Yi + Wi * Zi-1, Z0=0 Init fagg foff fcumagg cbind(cumsumprod(B)n1, prod(B:2)) B11=B11+B12*a cumsumprod(B) cumsumprod(X)

System Integration

Simplification Rewrites System Integration Simplification Rewrites Example #1: Suffix Sums Problem: Distributed reverse causes data shuffling Compute via column aggregates and prefix sums Example #2: Extract Lower Triangular Problem: Indexing cumbersome/slow; cumsum densifying Use dedicated operators = rev(cumsum(rev(X)))  X + colSums(X) – cumsum(X) (broadcast) (partitioning-preserving) 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 X * cumsum(diag(matrix(1,nrow(X),1)))  lower.tri(X)

Execution Plan Generation System Integration Execution Plan Generation Compilation Chain of Cumulative Aggregates Execution type selection based on memory estimates Physical operator config (broadcast, aggregation, in-place, #threads) Example Low-Level Operators (LOPs) Runtime Plan 1: ... 2: SP ucumack+ _mVar1 _mVar2 3: CP ucumk+ _mVar2 _mVar3 24 T 4: CP rmvar _mVar2 5: SP bcumoffk+ _mVar1 _mVar3 _mVar4 0 T 6: CP rmvar _mVar1 _mVar3 7: ... in-place #threads broadcast High-Level Operator (HOP) X cumagg u(cumsum) CP, 24, in-place SP, k+ SP, k+, broadcast nrow(X) ≥ b X u(cumsum)

Experimental Results

Experimental Setting Cluster Setup Baselines and Data Experimental Results Experimental Setting Cluster Setup 2+10 node cluster, 2x Intel Xeon E5-2620, 24 vcores, 128GB RAM 1Gb Ethernet, CentOS 7.2, OpenJDK 1.8, Haddop 2.7.3, Spark 2.3.1 Yarn client mode, 40GB driver, 10 executors (19 cores, 60GB mem) Aggregate memory: 10 * 60GB * [0.5,0.6] = [300GB, 360GB] Baselines and Data Local: SystemML 1.2++, Julia 0.7 (08/2018), R 3.5 (04/2018) Distributed: SystemML 1.2++, C-based MPI impl. (OpenMPI 3.1.3) Double precision (FP64) synthetic data

Local Baseline Comparisons Experimental Results Local Baseline Comparisons cumsumN2 cumsumNlogN Strawmen Scripts (w/ inplace) Built-in cumsum 2GB competitive single-node performance

Broadcasting and Blocksizes Experimental Results Broadcasting and Blocksizes Setup: Mean runtime of rep=100 print(min(cumsum(X))), including I/O and Spark context creation (~15s) once Results 160GB 19.6x (17.3x @ default) 1K good compromise (8MB, block overheads)

Scalability (from 4MB to 4TB) Experimental Results Scalability (from 4MB to 4TB) Setup: Mean runtime of rep=10 print(min(cumsum(X))) In the Paper Characterization of applicable operations; other operations: cumsum in removeEmpty More baselines comparisons; weak and strong scaling results #Cells System ML MPI 165M 0.97s 0.14s 500M 4.2s 0.26s 1.65G 5.3s 0.61s 5G 7.4s 1.96s 15.5G 13.9s 6.20s 50G 44.8s 19.8s 165G 1,531s N/A 500G 8,291s

Conclusions Summary Conclusions Future Work DistCumAgg: Efficient, data-parallel cumulative aggregates (self-similar) End-to-end compiler and runtime integration in SystemML Physical operators for hybrid (local/distribute) plans Conclusions Practical ML systems need support for a broad spectrum of operations Efficient parallelization of presumably sequential operations over blocked matrix representations on top frameworks like Spark or Flink Future Work Integration with automatic sum-product rewrites Operators for HW accelerators (dense and sparse) Application to parallel time series analysis / forecasting