Resource Elasticity for Large-Scale Machine Learning

Slides:

Advertisements

Similar presentations

Starfish: A Self-tuning System for Big Data Analytics.

Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

SLA-Oriented Resource Provisioning for Cloud Computing

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.

Spark: Cluster Computing with Working Sets

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.

Emalayan Vairavanathan

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Next Generation of Apache Hadoop MapReduce Owen

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Re-Architecting Apache Spark for Performance Understandability Kay Ousterhout Joint work with Christopher Canel, Max Wolffe, Sylvia Ratnasamy, Scott Shenker.

Image taken from: slideshare

Scalpel: Customizing DNN Pruning to the

- Inter-departmental Lab

R-Storm: Resource Aware Scheduling in Storm

Big Data is a Big Deal!.

Packing Tasks with Dependencies

Tarek Elgamal2, Shangyu Luo3, Matthias Boehm1, Alexandre V

Machine Learning Library for Apache Ignite

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Introduction to Distributed Platforms

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016

A Black-Box Approach to Query Cardinality Estimation

Resource Elasticity for Large-Scale Machine Learning

Apache Hadoop YARN: Yet Another Resource Manager

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

PA an Coordinated Memory Caching for Parallel Jobs

Hadoop Clusters Tess Fulkerson.

Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.

Applying Twister to Scientific Applications

MapReduce Simplied Data Processing on Large Clusters

湖南大学-信息科学与工程学院-计算机与科学系

Lecture 23: Feature Selection

Lecture 21: ML Optimizers

Wei Jiang Advisor: Dr. Gagan Agrawal

Data-Intensive Computing: From Clouds to GPU Clusters

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Resource-Efficient and QoS-Aware Cluster Management

Declarative Transfer Learning from Deep CNNs at Scale

Compressed Linear Algebra for Large-Scale Machine Learning

Acknowledgements: Apache SystemML Team, Tarek Elgamal

SystemML: Declarative Machine Learning on Spark

Compressed Linear Algebra for Large-Scale Machine Learning

TensorFlow: A System for Large-Scale Machine Learning

Matthias Boehm1, Alexandre V. Evfimievski2, Berthold Reinwald2

Fast, Interactive, Language-Integrated Cluster Computing

MapReduce: Simplified Data Processing on Large Clusters

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Resource Elasticity for Large-Scale Machine Learning Botong Huang2, Matthias Boehm1, Yuanyuan Tian1, Berthold Reinwald1, Shirish Tatikonda1, Frederick R. Reiss1 1 IBM Research – Almaden 2 Duke University Acknowledgements: Alexandre V. Evfimievski, Prithviraj Sen, Umar Farooq Minhas, and reviewers IBM Research

Motivation – Declarative Machine Learning (1) Model Training (learning) Example: Linear Regression Given an input data set of regressors X and response y Solve an ordinary least square problem y = X β Declarative Machine Learning Goal: Write ML algorithms independent of input data / cluster characteristics Full flexibility  ML DSL Data independence  Coarse-grained operations Efficiency and scalability  Hybrid runtime plans Specified algorithm  Opt for performance only β (2) Prediction (scoring) n Sex Age Salary M 32 60k F 29 145k 51 150k Insurance Cost X ŷ y β m ≈ ΔX ŷ Applications for large-scale machine learning are pervasive and range from customer classification, regression analysis, cluster, and recommendations. Let‘s use linear regression as an example. R1: Full flexibility: Specify new / customize existing ML algorithms  ML DSL R2: Data independence: Hide physical data representation (sparse/dense, row/column-major, blocking configs, partitioning, caching, compression)  Coarse-grained semantic operations Simplifies compilation / optimization (preserves semantics) Simplifies runtime (physical operator implementations) R3: Efficiency and scalability: Very small to very large usecases  Automatic optimization and hybrid runtime plans Pure in-memory, single node computation provide high efficiency on “small“ use cases Independence of underlying distributed runtime framework (e.g., MR vs Spark) R4: Specified algorithm semantics: Understand, debug and control algorithm behavior  Optimization for performance, not for performance and accuracy (consistent with language flexibility) IBM Research

Background: Hybrid Runtime Plans in SystemML Example: Linear Regression (in SystemML‘s R-like syntax) Cluster Config: client mem: 1 GB map/red mem: 2 GB X = read($1); y = read($2); intercept = $3; lambda = 0.001; ... if( intercept == 1 ) { ones = matrix(1,nrow(X),1); X = append(X, ones); } I = matrix(1, ncol(X), 1); A = t(X) %*% X + diag(I)*lambda; b = t(X) %*% y; beta = solve(A, b); ... write(beta, $4); Linreg DS (Direct Solve) HOP DAG (after rewrites) 8MB 8GB 8KB 172KB 16GB 16MB CP MR dg(rand) (103x1,103) r(diag) X (106x103,109) y (106x1,106) ba(+*) r(t) b(+) b(solve) write Scenario: X: 106 x 103, 109 y: 106 x 1, 106 LOP DAG X y r’(CP) mapmm(MR) tsmm(MR) ak+(MR) 16KB NOTE: lops dag simplified (no group after mapmm) – for single output block, group not necessary (for tsmm we never have a group because single output block guaranteed by blocksize constraint) reduce Memory-Sensitive Compilation Steps: Operator selection (cp/mr, phy. ops) Hop-lop rewrites Piggybacking bin constraints map 8MB 16MB IBM Research

Motivation – Resource Elasticity Scenario: X: 106 x 103, 109 y: 106 x 1, 106 Example: Linear Regression (in SystemML‘s R-like syntax) Estimated runtime costs w/ different memory configurations X = read($1); y = read($2); intercept = $3; lambda = 0.001; ... if( intercept == 1 ) { ones = matrix(1,nrow(X),1); X = append(X, ones); } I = matrix(1, ncol(X), 1); A = t(X) %*% X + diag(I)*lambda; b = t(X) %*% y; beta = solve(A, b); ... write(beta, $4); Linreg DS (Direct Solve) Linreg CG (Conjugate Gradient) X = read($1); y = read($2); intercept = $3; lambda = 0.001; ... r = -(t(X) %*% y); while(i < maxi & norm_r2 > norm_r2_trgt) { q = t(X) %*% (X %*% p) + lambda*p; alpha = norm_r2 / (t(p) %*% q); w = w + alpha * p; old_norm_r2 = norm_r2; r = r + alpha * q; norm_r2 = sum(r * r); beta = norm_r2 / old_norm_r2; p = -r + beta * p; i = i + 1; } ... write(w, $4); O(m n) per iteration O(m n2) Problem of memory-sensitive plans Issue #1: Users need to reason about plans and cluster configurations for best performance Issue #2: Finding a good static cluster configuration is a hard problem (variety of ML algorithms and workloads / multi-tenancy) One Size Does Not Fit All Variety of ML algorithms Data and cluster characteristics Multi-tenancy IBM Research

Motivation – Resource Elasticity (2) Problem of Memory-Sensitive Plans Issue #1: Memory sensitive compilation steps (user needs to reason about plans and cluster configurations) Issue #2: One size does not fit all (finding a good static cluster configuration is a hard problem) Resource Negotiation Frameworks YARN: resource container requests (mem, cpu, etc) Mesos: resource container offers (mem, cpu, storage, etc)  Resource Elasticity for Large-Scale ML Automatic resource provisioning for ML programs According to workload (ML script, input data, cluster) IBM Research

Resource Optimizer Overview Basic Ideas Optimize ML program resource configurations via online what-if analysis Generating and costing runtime plans Program-aware grid enumeration, pruning, and re-optimization techniques 1) Goal: minimize costs w/o unnecessary over-provisioning 2) Costing of runtime plans in order to take all compilation steps into account IBM Research

Optimization Framework ML Program Resource Allocation Problem Goal: minimize costs w/o unnecessary over-provisioning Cost Model White-box cost model based on generated runtime plans C(P, RP, cc) as estimated execution time of plan P Single pass w/ tracking of sizes/states of live variables Analytical model of IO (bandwidth), compute (FLOPs), latency (jobs / tasks) Search Space Characterization Def: resource dependency (cost influences of resources) Resource configuration RP = (rcP, r1P, …, rnP) P1: Monotonic Dependency Elimination  Pruning P2: Semi-Independent 2-Dim Problems  Parallelization (1) Performance (2) Throughput / Monetary costs [Matthias Boehm: Costing Generated Runtime Execution Plans for Large-Scale Machine Learning Programs, CoRR 2015] NOTE: P2 also leads to linear optimization time in the number of program blocks. IBM Research

Grid Enumeration Algorithm ALGORITHM 1: OPTIMIZERESOURCECONFIG Grid Enumeration Algorithm Src  ENUMGRIDPOINTS(P, cc, t1, ASC) Srm  ENUMGRIDPOINTS(P, cc, t1, ASC) for all rc ∈ Src do B  COMPILEPROGRAM(P, R=(rc, MIN)) B’  PRUNEPROGRAMBLOCKS(B) for all B’ ∈ B’ do for all ri ∈ Srm do Bi  RECOMPILE(Bi, R=(rc,ri)) MEMO | EVAL(C(Bi), ri, MEMO) P  RECOMPILE(P, R=(rc, MEMO)) R*P | EVAL(C(P), MEMO) return R*P Overall Algorithm Grid enumeration of CP/MR resources Program-aware grid point generation and pruning techniques Grid Point Generators Systematic: Equi-/Exp-spaced grid Directed: Memory-based grid Composite grids (cross product/union) Our default: (Exp ∪ Mem) Pruning Techniques Pruning blocks of small ops Pruning blocks of unknowns + w/o PU w/o pruning blocks of unknowns: Mlogreg (14/54 -> 26%) GLM (64/377 -> 17%) High impact of pruning strategies IBM Research

Runtime Resource Adaptation Example MLogreg: Y = table(seq(1,nrow(X)),y) Runtime Resource Adaptation Problem of unknown/changing sizes Unknown Dimensions / sparsity of intermediates Examples Conditional size and sparsity changes UDFs with unknown output sizes Data-dependent operations Overview Resource Adaptation Integrated w/ default dynamic recompilation At granularity of last-level program blocks 1 11 1 1 1 3 4 223 while( !outer_converge ){ while( !inner_converge ){ Q = P * X %*% V; } grad = t(X) %*% (P - Y); (1) Determine re-opt scope (2) Resource re-optimization (3) Adaptation decision (4) Runtime migration IBM Research

Experimental Setting HW cluster: 1+6 nodes Hadoop cluster 1 head node: 2x4 Intel E5530, HT enabled, 64 GB RAM, 6 worker nodes: 2x6 Intel E5-2440, HT enabled, 96 GB RAM, 12x2 TB disks 10 Gb Ethernet, and Red Hat Enterprise Linux Server 6.5 Hadoop cluster IBM Hadoop 2.2.0, IBM JDK 1.6.0 64bit SR12 YARN max allocation per node: 80 GB (1.5x -Xmx), HDFS blocksize 128 MB SystemML (as of 10/2014) ML Programs and Data 5 full-fledged ML scripts Dense / sparse (1.0 / 0.01) 1000 / 100 features XS-XL (107-1011 cells)  in dense: 80MB – 800GB Baselines B-SS (512MB), B-LS (54GB/512MB), B-SL (512MB/54GB), B-LL (54GB/4.4GB) Program # Blocks Unknowns Maxiter Linreg DS 22 N N/A Linreg CG 31 5 L2SVM 20 5 / ∞ MLogReg 54 Y 5 / 5 GLM 377 Note: meanwhile max overhead still 1.5x Xmx but capped at 2GB splitsize 128 MB, sort buffer 256 MB SystemML * SystemML AM * Memory budget of 15% / 70% of the max heap size IBM Research

End-to-End Runtime Improvements (1) Fit small/medium scenarios in memory End-to-End Runtime Improvements Linreg DS (XS-L), dense 1000 Linreg CG (XS-L), dense 1000 (2) Distribute compute-intensive scenarios 12.6x 6.8x 3.3x Linreg DS (XL) 4.6x (3) Pick right plan for large data IBM Research

End-to-End Throughput Improvements Secondary opt objective: no unnecessary over-provisioning Quantification via cluster app throughput Linreg DS, Scenario S, dense1000 B-LL: 53.3 GB / 4.4 GB  max parallelism: 6 Opt: 8GB / 2GB  max parallelism: 36 Throughput saturates at max parallelism Improved max parallelism (and throughput) via moderate resource configurations IBM Research

Optimization Overhead – Optimization Time Linreg DS Scenario dense1000 Linreg CG Worst overhead over all experiments Relative: GLM, XS: 4.56s / 35.1% Absolute: GLM, M: 11.17s / 2.4% Scenario # Comp. # Cost. Opt. Time % XS (80MB) 352 8 0.35s 3.5 S (800MB) 417 30 0.41s 1.0 M (8GB) 678 168 0.71s 0.8 L (80GB) 448 104 0.56s 0.2 XL (800GB) 536 149 0.73s <0.1 Scenario # Comp. # Cost. Opt. Time % XS (80MB) 496 8 0.43s 7.2 S (800MB) 558 9 0.47s 3.3 M (8GB) 924 192 0.88s 2.2 L (80GB) 640 152 0.66s 0.1 IBM Research

Conclusions Summary Conclusions Future work Automatic resource optimization for ML programs via online what-if analysis Program-aware enumeration / pruning / re-optimization techniques 03/2015: IBM BigInsights 4.0 GA, BigR/SystemML (opt level) What else is in the paper? Search space, Runtime migration strategies, Parallel resource optimizer, Many more experiments (algorithms / data characteristics / overhead / adaptation / Spark comparison) Conclusions Another step towards declarative large-scale machine learning (memory-sensitive plans, one configuration does not fit all) Higher performance, higher throughput, lower monetary costs (cloud) Future work Stateful frameworks like Spark (lazy evaluation) Task-parallel ML programs / multi-threaded operations (CPU resources) Extended runtime adaptation (cluster utilization, cloud settings) IBM Open Platform with Apache Hadoop (BigInsights) IBM Research

IBM Research

Motivation – Resource Elasticity Example: Linear regression (in SystemML‘s R-like syntax) Estimated runtime costs w/ different memory configurations X: 106 x 103, dense (8GB); y: 106 x 1, dense (8MB) X = read($1); y = read($2); intercept = $3; lambda = 0.001; ... if( intercept == 1 ) { ones = matrix(1,nrow(X),1); X = append(X, ones); } I = matrix(1, ncol(X), 1); A = t(X) %*% X + diag(I)*lambda; b = t(X) %*% y; beta = solve(A, b); ... write(beta, $4); Linreg DS (Direct Solve) X = read($1); y = read($2); intercept = $3; lambda = 0.001; ... r = -(t(X) %*% y); while(i < maxi & norm_r2 > norm_r2_trgt) { q = t(X) %*% (X %*% p) + lambda*p; alpha = norm_r2 / (t(p) %*% q); w = w + alpha * p; old_norm_r2 = norm_r2; r = r + alpha * q; norm_r2 = sum(r * r); beta = norm_r2 / old_norm_r2; p = -r + beta * p; i = i + 1; } ... write(w, $4); Linreg CG (Conjugate Gradient) O(m n) per iteration Problem of memory-sensitive plans Issue #1: Users need to reason about plans and cluster configurations for best performance Issue #2: Finding a good static cluster configuration is a hard problem (variety of ML algorithms and workloads / multi-tenancy) O(m n2) Compilation of hybrid runtime plans for efficiency/scalability  Key Observation: Problem of memory-sensitive plans IBM Research

Background: SystemML’s Compilation Chain Examples: CP vs MR op selection Physical op selection (mapmm, mapmmchain, pmm, map*, mappend, map/red wsloss) Hop-lop rewrites (t-mm, partitioning) Piggybacking bin constraints (Parfor optimizer) (1) Influences on Memory Estimates (input data, rewrites, size propagation) [Matthias Boehm et al: SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng. Bull 2014] (2) Memory-Sensitive Compilation Steps Resource negotiation frameworks like YARN, Mesos, or Omega allow us to tackle the problem of memory-sensitive plans IBM Research

Backup: Spark Comparison Experiments Resource optimization also important for Spark (high performance w/o over-provisioning) Scenario Simplified L2SVM, dense 1000 Spark 1.2 (Dec 2014), 1+6 executors w/ 24 cores and 55GB Performance w/ driver mem: 20GB Throughput #users w/ driver mem: 512MB Preview SystemML Spark BE (05/2015) 3s* (23s) 6s* (27s) 50s* (70s) 82s (102s) 5,156s (5,176s) Scenario (* CP-only) SystemML on MR w/ Opt SystemML on Spark Plan 1 (Hyb) | Plan 2 (Full) XS (80MB) 6s* 25s 59s S (800MB) 12s* 31s 126s M (8GB) 40s* 43s 184s L (80GB) 836s 167s 347s XL (800GB) 12,376s 10,119s 13,661s # of Users (Scenario S) SystemML on MR w/ Opt SystemML on Spark Plan 2 (Full) 1 5.1 app/min* 0.48 app/min 8 35.6 app/min* (7.0x) 0.84 app/min (1.8x) 32 69.8 app/min* (13.7x) 0.83 app/min (1.7x) IBM Research

Backup: Yet Another Resource Negotiator (YARN) [Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, Eric Baldeschwieler: Apache Hadoop YARN: yet another resource negotiator. SoCC 2013] IBM Research