Acknowledgements: Apache SystemML Team, Tarek Elgamal

Slides:

Advertisements

Similar presentations

Hybrid BDD and All-SAT Method for Model Checking Orna Grumberg Joint work with Assaf Schuster and Avi Yadgar Technion – Israel Institute of Technology.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Balajee Vamanan, Gwendolyn Voskuilen, and T. N. Vijaykumar School of Electrical & Computer Engineering SIGCOMM 2010.

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

OpenFOAM on a GPU-based Heterogeneous Cluster

Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

SAGE: Self-Tuning Approximation for Graphics Engines

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.

June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.

Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.

Amir Shaikhha, Mohammed ElSeidy, Daniel Espino, and Christoph Koch

SketchVisor: Robust Network Measurement for Software Packet Processing

Optimizing the Performance of Sparse Matrix-Vector Multiplication

R-Storm: Resource Aware Scheduling in Storm

TensorFlow– A system for large-scale machine learning

Tarek Elgamal2, Shangyu Luo3, Matthias Boehm1, Alexandre V

Hybrid BDD and All-SAT Method for Model Checking

Ioannis E. Venetis Department of Computer Engineering and Informatics

Chilimbi, et al. (2014) Microsoft Research

Spark Presentation.

Scaling SQL with different approaches

Resource Elasticity for Large-Scale Machine Learning

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Sub-millisecond Stateful Stream Querying over

Algorithm Analysis CSE 2011 Winter September 2018.

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries Bin Ren, Gagan Agrawal 9/18/2018.

A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

MapReduce Simplied Data Processing on Large Clusters

ICS-2018 June 12-15, Beijing Zwift : A Programming Framework for High Performance Text Analytics on Compressed Data Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng.

Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan

Lecture 23: Feature Selection

Communication and Memory Efficient Parallel Decision Tree Construction

Automatic Physical Design Tuning: Workload as a Sequence

Wei Jiang Advisor: Dr. Gagan Agrawal

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

Parallel Analytic Systems

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

A Comparison of Cache-conscious and Cache-oblivious Codes

Declarative Transfer Learning from Deep CNNs at Scale

Compressed Linear Algebra for Large-Scale Machine Learning

Resource Elasticity for Large-Scale Machine Learning

Compressed Linear Algebra for Large-Scale Machine Learning

Big Data Analytics: Exploring Graphs with Optimized SQL Queries

TensorFlow: A System for Large-Scale Machine Learning

Matthias Boehm1, Alexandre V. Evfimievski2, Berthold Reinwald2

Architecture of ML Systems 04 Operator Fusion and Runtime Adaptation

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.

MapReduce: Simplified Data Processing on Large Clusters

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Learning and Memorization

Lecture 29: Distributed Systems

Under a Concurrent and Hierarchical Scheme

Presentation transcript:

Acknowledgements: Apache SystemML Team, Tarek Elgamal On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML Matthias Boehm1, Berthold Reinwald1, Dylan Hutchison2, Prithviraj Sen1, Alexandre V. Evfimievski1, Niketan Pansare1 1 IBM Research – Almaden; San Jose, CA, USA 2 University of Washington; Seattle, WA, USA Acknowledgements: Apache SystemML Team, Tarek Elgamal

Motivation: Fusion Opportunities State-of-the art ML systems DAGs of linear algebra (LA) operations and statistical functions Materialized intermediates  ubiquitous fusion opportunities sum(X*Y*Z) a) Intermediates b) Single-Pass t(X)%*%(X%*%v) t(t(X%*%v)%*%X) c) Multi-Aggregates d) Sparsity Exploitation

A Case for Optimizing Fusion Plans Problem: Fusion heuristics  poor plans for complex DAGs (cost/structure), sparsity exploitation, and local/distributed operations Goal: Principled approach for optimizing fusion plans #1 Materialization Points (e.g., for multiple consumers) #2 Sparsity Exploitation (and ordering of sparse inputs) #3 Decisions on Fusion Patterns (e.g., template types) #4 Constraints (e.g., memory budget and block sizes) Y + X * (U %*% t(V)) sparse-safe over X  Search Space that requires optimization

System Architecture (Compiler Integration & Codegen Architecture) Practical, exact, cost-based optimizer [CIDR’17] (w/ fuse-all heuristic) Lacked maintainability Poor plans for complex DAGs and local/distributed operations Templates: Cell, Row, MAgg, Outer w/ different data bindings

Codegen Example L2SVM (Cell/MAgg) I L2SVM Inner Loop # of Vector Intermediates Base (w/o fused ops): 10 Fused (w/ fused ops): 4 1: while(continueOuter & iter < maxi) { 2 #... 3: while(continueInner) { 4: out = 1-Y* (Xw+step_sz*Xd); 5: sv = (out > 0); 6: out = out * sv; 7: g = wd + step_sz*dd - sum(out * Y * Xd); 8: h = dd + sum(Xd * sv * Xd); 9: step_sz = step_sz - g/h; 10: }} ...

Codegen Example L2SVM (Cell/MAgg) II Template Skeleton Data access, blocking Multi-threading Final aggregation # of Vector Intermediates Gen (codegen ops): 0 public final class TMP25 extends SpoofMAgg { public TMP25() { super(false, AggOp.SUM, AggOp.SUM); } protected void genexec(double a, SideInput[] b, double[] scalars, double[] c, ...) { double TMP11 = getValue(b[0], rowIndex); double TMP12 = getValue(b[1], rowIndex); double TMP13 = a * scalars[0]; double TMP14 = TMP12 + TMP13; double TMP15 = TMP11 * TMP14; double TMP16 = 1 - TMP15; double TMP17 = (TMP16 > 0) ? 1 : 0; double TMP18 = a * TMP17; double TMP19 = TMP18 * a; double TMP20 = TMP16 * TMP17; double TMP21 = TMP20 * TMP11; double TMP22 = TMP21 * a; c[0] += TMP19; c[1] += TMP22;

Codegen Example MLogreg (Row) MLogreg Inner Loop (main expression on feature matrix X) 1: Q = P[, 1:k] * (X %*% v) 2: H = t(X) %*% (Q - P[, 1:k] * rowSums(Q)) public final class TMP25 extends SpoofRow { public TMP25() { super(RowType.COL_AGG_B1_T, true, 5); } protected void genexecDense(double[] a, int ai, SideInput[] b, double[] c,..., int len) { double[] TMP11 = getVector(b[1].vals(rix),...); double[] TMP12 = vectMatMult(a, b[0].vals(rix),...); double[] TMP13 = vectMult(TMP11, TMP12, 0, 0,...); double TMP14 = vectSum(TMP13, 0, TMP13.length); double[] TMP15 = vectMult(TMP11, TMP14, 0,...); double[] TMP16 = vectMinus(TMP13, TMP15, 0, 0,...); vectOuterMultAdd(a, TMP16, c, ai, 0, 0,...); } protected void genexecSparse(double[] avals, int[] aix, int ai, SideInput[] b, ..., int len) {...}

Candidate Exploration (by example MLogreg Row template) Memo Table Memo Table for partial fusion plans (candidates) OFMC Template Fusion API Open Fuse, Merge Close OFMC Algorithm Bottom-up Exploration (single-pass, template- agnostic) Linear space and time

Candidate Selection I (Plan Partitions and Interesting Points) #1 Determine Plan Partitions Materialization Points M Connected components of fusion references Root and input nodes Optimize partitions independently #2 Determine Interesting Points Materialization Point Consumers: Each data dependency on materialization points considered separately Template / Sparse Switches: Data dependencies where producer has templates that are non-existing for consumers  Optimizer considers all 2|M’i| plans (with |M’i| ≥ |Mi|) per partition

Candidate Selection II (Costs and Constraints) Overview Cost Model Cost partition with analytical cost model based on peak memory and compute bandwidth Plan comparisons / fusion errors don’t propagate / dynamic recompilation #3 Evaluate Costs #1: Memoization of already processed sub-DAGs #2: Account for shared reads and CSEs within operators #3: Account for redundant computation (overlap)  DAG traversal and cost vectors per fused operator (with memoization of pairs of operators and cost vectors) #4 Handle Constraints Prefiltering violated constraints (e.g., row template in distributed ops) Assign infinite costs for violated constraints during costing

Candidate Selection III (Enumeration Algorithm MPSkipEnum and Pruning) #5 Basic Enumeration Linearized search space: from - to * #6 Cost-Based Pruning Upper bound: cost CU of best plan q* (monotonically decreasing) Opening heuristic: evaluate FA and FNR heuristics first Lower bound: CLS (read input, write output, min compute) + dynamic CLD (materialize intermediates q)  skip subspace if CU ≤ CLS + CLD #7 Structural Pruning Observation: Assignments can create independent sub problems Build reachability graph to determine cut sets During enum: probe cut sets, recursive enum, combine, and skip for( j in 1:pow(2,|M’i|) ) q = createAssignment(j) C = getPlanCost(Pi, q) maintainBest(q, C)

Experimental Setting Setup Baselines 1+6 node cluster (head 2x4 Intel Xeon E5530, 64GB RAM; 6workers 2x6 Intel Xeon E5-2440, 96GB RAM, peak 2x32GB/s 2x115GFLOP/s, 10Gb Ethn) Modern scale-up server (2x20 Intel Xeon Gold 6138, 768GB RAM, peak 2x119 GB/s 2x1.25TFLOP/s) Java 1.8.0, Hadoop 2.7.3, Spark 2.2.0 (client w/ 35GB driver, 6 executors w/ 65 GB and 24 cores, aggregate cluster memory: 234 GB) Baselines SystemML 1.0++ (Feb 2018): Base, Fused (hand-coded, default), Gen (optimizer), and heuristics FA (all) and FNR (no redundancy) Julia 0.6.2 (Dec 13 2017): LLVM code generation, Julia (without fusion) and JuliaGen (fusion via dot syntax) TensorFlow 1.5 (Jan 26 2018): TF (without fusion), and TFGen (fusion via TensorFlow XLA), limited support for sparse

Operations Performance Cell Template: sum(X*Y*Z) dense sparse (0.1) Row: t(X)%*%(w*(X%*%v)) Outer: sum(X*log(U%*%t(V)+1e-15)) TF w/ manual rewrite  t(t(w*(X%*%v))%*%X): 9.2 s to 1.6 s (compared to Gen 283ms) dense 20K x 20K, rank 100

#1 Heuristics struggle w/ hybrid plans L2SVM End-to-End Performance (20 outer/∞ inner iterations, reg., tight eps) Local and Distributed [seconds] Julia Comparison Dataset: 108 x 10 (8GB) Hand-tuned fusion script for JuliaGen Data Base Fused* Gen FA FNR 108 x 10, D 446 276 37 44 92 Airline78, D 151 105 24 26 45 Mnist8m, S 203 156 113 115 116 #1 Heuristics struggle w/ hybrid plans 2*108 x 100, D 1218 895 347 1433 539 2*108 x 103, S 1481 1066 373 2205 575 Mnist80m, S 1593 1114 552 1312 896

ALS-CG End-to-End Performance (20 outer/20 inner iterations, rank 20) ALG-CG Representative for many matrix factorization algorithms Requires sparsity exploitation in loss computation and update rules Local single node [seconds] Data Base Fused* Gen FA FNR 104 x 104, S (0.01) 426 20 25 215 226 105 x 105, S (0.01) 23,585 96 80 13,511 12,353 106 x 106, S (0.01) N/A 860 722 Netflix 1,026 789 Amazon Books 17,335 7,420 (8,026,324 x 2,330,066; sparsity=0.0000012) #2 Heuristics struggle w/ sparsity exploitation #3 Heuristics struggle w/ complex DAGs (see paper)

Conclusions Summary Additional contributions (in the paper) Exact, cost-based optimizer for operator fusion plans over DAGs of LA ops End-to-end compiler and runtime integration into SystemML Additional contributions (in the paper) Fused operators for sparsity-exploitation, and compressed matrices Graceful approximation, plan and operator caching, and rewrites Experiments for operations, compilation overhead, algorithms Conclusions Significant end-to-end improvements due to generality and cost-awareness Low compilation overhead due to fast compiler, pruning, and op/plan caching Optimizing fusion plans is a corner-stone of future declarative ML systems Available open source in SystemML https://github.com/apache/systemml

Backup: Compilation Overhead Java Class Compilation and Loading Partitioning and Pruning w/ operator cache - Total codegen overhead: 55ms 366ms 350ms 112ms 1067ms 491ms