SystemML: Declarative Machine Learning on Spark

SystemML: Declarative Machine Learning on Spark
Matthias Boehm1, Michael W. Dusenberry2, Deron Eriksson2, Alexandre V. Evfimievski1, Faraz Makari Manshadi1, Niketan Pansare1, Berthold Reinwald1, Frederick R. Reiss1,2, Prithviraj Sen1, Arvind C. Surve2, Shirish Tatikonda1 1 IBM Research – Almaden 2 IBM Spark Technology Center Acknowledgements: Nakul Jindal, Christian R. Kadner, Jihyoung Kim, Narine Kokhlikyan, Deepak Kumar, Min Li, Luciano Resende, Alok Singh, Glenn Weidner, and Wen Pei Yu Thanks for the kind introduction. It’s my pleasure to present SystemML on Spark, to which many IBM-internal and external people contributed. IBM Research

Motivation and History of SystemML
Need for large-scale ML Ever growing data collections Gain value via custom advanced analytics / machine learning Declarative ML in SystemML Simplify development and use of ML algorithms High-level language, data independence and automatic plan generation History of Apache SystemML Started w/ MR+CP backends (shared clusters) Extended by Spark (caching, data preparation) Apache SystemML (incubating), 11/2015 ICDE paper Project kickoff SparkTC/systemml 08/2015 Motivation Spark: Yarn enabled multiple frameworks; distributed caching and seamless data preparation Benefits of declarative ML Simple, Analysis-Centric Specification, Physical Data Independence, Automatic Execution Plan Generation (optimization, platform independence, data-size independence), Ease of Deployment (platform independence, adaptivity of \packaged" applications), and Separation of Concerns (skill sets of users/devs). History Initial projects with interns 2007/2008 Project kickoff Jan 2010 (IMJP Dec 2009) CP Backend Oct 2012 Beta: 08/2014, 12/2014 GA: 03/2015, 08/2015 Proof of concepts & initial discussions MR Backend Spark Backend community CP Backend Beta GA 2007 2010 2011 2015 2016 2012 2014 IBM Research

Running Example Collaborative filtering
Matrix completion Low rank factorization X ≈ U VT ALS-CG (alternating least squares via conjugate gradient) L2-regularized squared loss Repeatedly fixes one factor and optimizes the other factor Conjugate Gradient to solve least-squares problems jointly 1: X = read($inFile); 2: r = $rank; lambda = $lambda; mi = $maxiter; 3: U = rand(rows=nrow(X), cols=r, min=-1.0, max=1.0); 4: V = rand(rows=r, cols=ncol(X), min=-1.0, max=1.0); 5: W = (X != 0); mii = r; i = 0; is_U = TRUE; 6: while( i < mi ) { 7: i = i + 1; ii = 1; 8: if( is_U ) 9: G = (W *(U %*% V - X)) %*% t(V) + lambda * U; 10: else ... 11: norm_G2 = sum(G^2); norm_R2 = norm_G2; ... 12: while( norm_R2 > 10E-9 * norm_G2 & ii <= mii ) { 13: if( is_U ) { 14: HS = (W * (S %*% V)) %*% t(V) + lambda * S; 15: alpha = norm_R2 / sum(S * HS); 16: U = U + alpha * S; 17: } else {...} 18: 19: } 20: is_U = !is_U; 21: } 22: write(U, $outUFile, format="text"); 23: write(V, $outVFile, format="text"); Products VT U 1 4 2 5 ? Users X IBM Research

SystemML Architecture and APIs
Command Line JMLC Spark MLContext Spark ML APIs High-Level Operators (HOPs) Parser/Language Low-Level Operators (LOPs) Compiler Spark Command Line: ./spark-submit --master yarn-client SystemML.jar \ -f ALS-CG.dml –nvargs X=./in/X ... Spark-Specific Optimizer and Runtime Techniques Spark MLContext Scala API: ./spark-shell --jars SystemML.jar import org.apache.sysml.api.mlcontext._ import org.apache.sysml.api.mlcontext.ScriptFactory._ val ml = new MLContext(sc) val X = // ... RDD, DataFrame, etc. val script = dmlFromFile("ALS-CG.dml").in("X", X) .in(...).out("U", "V") val (u, v) = ml.execute(script) .getTuple[Matrix,Matrix]("U", "V") Runtime Control Program Runtime Prog Buffer Pool ParFor Optimizer/Runtime MR Inst Spark Inst CP Inst Recompiler DFS IO Mem/FS IO Generic MR Jobs MatrixBlock Library (single/multi-threaded) IBM Research

Spark-Specific Optimizations
Spark-specific rewrites Automatic caching/checkpoint injection (MEM_DISK (+CSR) / MEM_DISK_SER) Automatic repartition injection Extended ParFor optimizer Deferred checkpoint/repartition injection Eager checkpointing/repartitioning Fair scheduling for concurrent jobs Runtime optimizations Lazy Spark context creation Short-circuit read/collect Operator selection Spark exec type selection: M(op)>MCP Transitive Spark exec type: e.g., sum(X^2) Physical operator selection Ex: Checkpoint Injection LinregCG X = read($1); y = read($2); ... r = -(t(X) %*% y); while(i < maxi & norm_r2 > norm_r2_trgt) { q = t(X)%*%(X%*%p) + lambda*p; alpha = norm_r2 / sum(p * q); w = w + alpha * p; old_norm_r2 = norm_r2; r = r + alpha * q; norm_r2 = sum(r * r); beta = norm_r2 / old_norm_r2; p = -r + beta * p; i = i + 1; } ... write(w, $4); chkpt X MEM_DISK Spark Exec (24 cores) 25% user 75% data&exec (50% Min & 75% Max) Spark ≥1.6 IBM Research

Physical Operator Selection
Basic physical Spark MM operators Fused physical Spark MM operators Sparsity-exploiting fused operators Map operations iff M(U)+M(V)<MB MM Ops Pattern Constraints MapMM MapMMChain TSMM ZIPMM CPMM RMM PMM X Y XT(w*(X v)) XT X XT Y rmr(diag(v)) X M(X)<MB  M(Y)<MB M(w)+M(v)<MB  ncol(X)≤Bc ncol(X)≤Bc ncol(X)≤Bc  ncol(Y)≤Bc - M(v)<MB MM Ops Pattern Map/Red WSLoss Map/Red WSigmoid Map/Red WDivMM Map/Red WCeMM Map/Red WuMM sum(W*(U VT - X)2) sum((X - W*(U VT))2) X * sigmoid(U VT) X * log(sigmoid(-(U VT))) (W/(U VT))V, (W/(U VT))V (UT(W/(U VT)))T sum(X*log(U VT)) X*exp(U VT), X/(U VT)2 IBM Research

Example Fused Operator: WSLoss
Weighted squared loss: wsl = sum(W * (X – U %*% t(V))^2) Common pattern for factorization algorithms (e.g., ALS) W and X usually very sparse (< 0.001) Problem: “Outer” product of U%*%t(V) creates three dense intermediates in the size of X  Fused wsloss operator Key observations: Sparse W* allows selective computation, full aggregate significantly reduces memory requirements U – t(V) X W sum * 2 IBM Research

Runtime Buffer Pool Integration
Motivation Integration with lazy RDD evaluation  buffer pool integration Exchange of intermediates local - remote (CPU, GPU, HDFS, RDDs) Eviction of in-memory objects Primitives Pinning & unpinning of in-memory matrices Export to remote storage (HDFS, GPU) Construct RDD or broadcast objects Spark specifics Lineage tracking RDDs/broadcasts Guarded RDD collect/parallelize Partitioned broadcasts Lineage Tracking: IBM Research

Partitioning-Preserving Operations
Shuffle is major bottleneck for SystemML on Spark Partitioning-preserving ops Op is partitioning-preserving if keys unchanged (guaranteed) Implicit: Use restrictive APIs (mapValues() vs mapToPair()) Explicit: Partition computation w/ declaration of partitioning-preserving Partitioning-exploiting ops Implicit: Operations based on join, cogroup, etc Explicit: Custom operators (e.g., zipmm) Example: Multiclass SVM Vectors fit neither into driver nor broadcast ncol(X) ≤ Bc repart, chkpt X MEM_DISK parfor(iter_class in 1:num_classes) { Y_local = 2 * (Y == iter_class) - 1 g_old = t(X) %*% Y_local ... while( continue ) { Xd = X %*% s ... inner while loop (compute step_sz) Xw = Xw + step_sz * Xd; out = 1 - Y_local * Xw; out = (out > 0) * out; g_new = t(X) %*% (out * Y_local) ... chkpt y_local MEM_DISK chkpt Xd, Xw MEM_DISK zipmm IBM Research

Experimental Setting Cluster setup ML programs
1 head node (2x4 Intel E5530, 64GB RAM), 6 worker nodes (2x6 Intel E5-2440, 96GB RAM, 12x2TB disks) Spark with 6 executors (24 cores, 55GB), 20GB driver memory MR with 1.6GB map/reduce, 20GB driver memory ML programs Algorithm Maxi ε λ Icpt #C ParFor L2SVM 20/∞ 1e-6 1e-2 N 2 GLM LinregCG 20 N/A LinregDS Mlogreg 5 MSVM 5 x 20/∞ Y Naïve Bayes N (Y) Kmeans 10 x 20 1e-4 10 ALS 6/50 1 50 OpenJDK bit, Hadoop (384MB sort buffer), Spark (yarn-client) IBM Research

End-To-End Experiments
Total execution time (incl I/O from HDFS, Spark context creation) Comparison: cp+mr (hadoop), cp+spark, spark (spark_submit) LinregCG (10K-100M rows, 1K features) MSVM (5 classes, 10K-100M rows, 1k features) IBM Research

SystemML on Spark: Lessons Learned
Spark over Custom Framework Well engineered framework with strong contributor base Unified programming model: seamless data prep. and feature engineering Stateful Distributed Caching Standing executors with distributed caching and fast task scheduling Challenges: task parallelism, memory constraints, fair resource management Memory Efficiency Compact data structures to avoid cache spilling (serialization, CSR) Custom serialization and compression Lazy RDD Evaluation Automatic grouping of operations into distributed jobs, incl partitioning Challenges: multiple actions/repeated execution, runtime plan compilation! Declarative ML Introduction of Spark backend did not require algorithm changes! Automatically exploit distributed caching and partitioning via rewrites IBM Research

Conclusions Summary Conclusions Ongoing work
History and architecture of SystemML on Spark Spark-specific optimizer and runtime extensions SystemML on Spark – Lessons Learned Conclusions Integrated compiled execution plans w/ Spark’s lazy evaluation 5-10x (in-memory) and 2x (out-of-core) improvements over MR SystemML is open source – extend it for your research project! Ongoing work Simplification/extensions of APIs (e.g., batch, context, DSL, scoring) Additional data types and operations (e.g., frames) Low-level performance features (e.g., compression, codegen, GPUs) Summary History and architecture of SystemML on Spark Spark-specific optimizer extensions (caching, partitioning operator selection) Spark-specific runtime extensions (buffer pool, partitioning-preserving ops) SystemML on Spark – Lessons Learned Conclusions Integrated compiled execution plans w/ Spark’s lazy evaluation 5-10x (in-memory) and 2x (out-of-core) improvements over MR SystemML is open source – extend it for your research project! Ongoing work Simplification/extensions of APIs (e.g., batch, context, DSL, scoring) Additional data types and related operations (e.g., frames) Continued compiler/runtime improvements Low-level performance features (e.g., compression, codegen, accelerators) IBM Research

SystemML is Open Source:
Upcoming: Wed Sep 7, 11.15am D3b: CLA Poster Fri Sep 9, 9am-5.30pm SystemML is Open Source: Apache Incubator Project since 11/2015 Website: Sources: IBM Research

Backup: High-Level SystemML Architecture
DML Scripts DML (Declarative Machine Learning Language) Language Compiler Runtime This Talk In-Memory Single Node (scale-up) Hadoop or Spark Cluster (scale-out) since 2012 since 2010/11 since 2015 IBM Research IBM Research

Backup: End-To-End Experiments (1)
GLM, Binomial Probit L2SVM LinregCG LinregDS MLogreg MSVM Naïve Bayes KMeans IBM Research

Backup: End-To-End Experiments (2)
ALS-CG rank 50 Scenario cp+mr cp+spark spark S (100K x 100K, sp=0.01, 1.2GB) 131s 136s 135s M (1M x 100K, sp=0.01, 12GB) 1,088s 342s 432s L (10M x 100K, sp=0.01, 120GB) >24h 10,537s 15,487s IBM Research

Backup: Partitioning and Memory Efficiency
ParFor-specific optimizations MSVM k=5 classes 250M x 40 (80GB) MCP =0.7 * 5GB Memory efficiency LinregCG 5M x 100k/1M (120GB) Scenario Runtime Without repartitioning 9,102s Without fair scheduling 2,088s Without eager caching 1,885s All optimizations 1,470s 6.2x Scenario 5M x 100K, sp = 0.02 5M x 1M, sp = 0.002 MCSR 165s 2,152s MCSR + Ser 159s 764s CSR 126s 349s 6.1x IBM Research

SystemML: Declarative Machine Learning on Spark

Similar presentations

Presentation on theme: "SystemML: Declarative Machine Learning on Spark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SystemML: Declarative Machine Learning on Spark

Similar presentations

Presentation on theme: "SystemML: Declarative Machine Learning on Spark"— Presentation transcript:

Similar presentations

About project

Feedback