Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics Yi Wang* Gagan Agrawal* Tekin Bicer# Wei Jiang^ *The Ohio State University #Argonne National Laboratory ^Quantcast Corp.
Outline Introduction Challenges System Design Experimental Results Conclusion
The Big Picture of In Situ In-Situ Algorithms No disk I/O Indexing, compression, visualization, statistical analysis, etc. In-Situ Resource Scheduling Systems Enhance resource utilization Simplify the management of analytics code GoldRush, Glean, DataSpaces, FlexIO, etc. Algorithm/Application Level Seamlessly Connected? Platform/System Level These two layers have been extensively studied. Is there anything to do between the two levels?
Rethink These Two Levels In-Situ Algorithms Implemented with low-level APIs like OpenMP/MPI Manually handle all the parallelization details In-Situ Resource Scheduling Systems Play the role of coordinator Focus on scheduling issues like cycle stealing and asynchronous I/O No high-level parallel programming API Motivation Can the applications be mapped more easily to the platforms for in-situ analytics? Can the offline and in-situ analytics code be (almost) identical?
Opportunity Explore the Programming Model Level in In-Situ Environment Between application level and system level Hides all the parallelization complexities by simplified API A prominent example: MapReduce + In Situ
Challenges Hard to Adapt MR to In-Situ Environment 4 Mismatches MR is not designed for in-situ analytics 4 Mismatches Data Loading Mismatch Programming View Mismatch Memory Constraint Mismatch Programming Language Mismatch
Data Loading Mismatch In Situ Requires Taking Input From Memory Ways to Load Data into MRs From distributed file systems Hadoop and many variants (on HDFS), Google MR (on GFS), and Disco (on DDFS) From shared/local file systems MARIANE and CGL-MapReduce MPI-Based: MapReduce-MPI and MRO-MPI From memory Phoenix (shared-memory) From data streams HOP, M3, and iMR
Data Loading Mismatch (Cont’d) Few MR Option Most MRs load data from file systems Loading data from memory is mostly restricted to shared-memory environment Wrap simulation output as a data stream? Periodical stream spiking Only one-time scan is allowed An Exception -- Spark Can support loading data from file systems, memory, or data stream Spark works as long as input can be converted into RDD.
Programming View Mismatch Scientific Simulation Parallel programming view Explicit parallelism: partitioning, message passing, and synchronization MapReduce Sequential programming view Partitions are transparent Need a Hybrid Programming View Exposes partitions during data loading Hides parallelism after data loading Traditional MapReduce implementations cannot explicitly take partitioned simulation output as the input, or launch the execution of analytics from an SPMD region.
Memory Constraint Mismatch MR is Often Memory/Disk Intensive Map phase creates intermediate data Sorting, shuffling, and grouping do not reduce intermediate data at all Local combiner cannot reduce the peak memory consumption (in map phase) Need Alternate MR API Avoids key-value pair emission in the map phase Eliminates intermediate data in the shuffling phase
Programming Language Mismatch Simulation Code in Fortran or C/C++ Impractical to rewrite in other languages Mainstream MRs in Java/Scala Hadoop in Java Spark in Scala/Java/Python Other MRs in C/C++ are not widely adopted
Bridging the Gap Addresses All the Mismatches Loads data from (distributed) memory, even without extra memcpy in time sharing mode Presents a hybrid programming view High memory efficiency with alternate API Implemented in C++11, with OpenMP + MPI
System Overview In-Situ System Distributed System Shared-Memory System Distributed System = Partitioning + Shared-Memory System + Combination In-Situ System = Shared-Memory System + Combination = Distributed System – Partitioning This is because the input data is automatically partitioned based on simulation code.
Two In-Situ Modes Space Sharing Mode: Enhances resource utilization when simulation reaches its scalability bottleneck Time Sharing Mode: Minimizes memory consumption
Ease of Use Launching Smart Application Development No extra libraries or configuration Minimal changes to the simulation code Analytics code remains the same in different modes Application Development Define a reduction object Derive a Smart scheduler class gen_key(s): generates key(s) for a data chunk accumulate: accumulates data on a reduction object merge: merges two reduction objects Both reduction map and combination map are transparent to the user, and all the parallelization complexity is hidden in a sequential view.
Smart vs. Spark To Make a Fair Comparison Bypass programming view mismatch Run on an 8-core node: multi-threaded but not distributed Bypass memory constraint mismatch Use a simulation emulator that consumes little memory Bypass programming language mismatch Rewrite the simulation in Java and only compare computation time 40 GB input and 0.5 GB per time-step K-Means Histogram 62X 92X
Smart vs. Spark (Cont’d) Faster Execution Spark 1) emits intermediate data, 2) makes immutable RDDs, and 3) serializes RDDs and sends them through network even in the local mode Smart 1) avoids intermediate data, 2) performs data reduction in place, and 3) takes advantage of shared-memory environment (of each node) Greater (Thread) Scalability Spark launches extra threads for other tasks, e.g., communication and driver’s UI Smart launches no extra thread Higher Memory Efficiency Spark: over 90% of 12 GB memory Smart: around 16 MB besides 0.5 GB time-step 3 reasons of such performance difference: Spark emits massive amounts of intermediate data, whereas Smart performs data reduction in place and eliminates emission of intermediate data. Spark makes a new RDD for every transformation due to its immutability, whereas Smart reuses reduction/combination maps even for iterative processing; 3) Spark serializes RDDs and send them through network even in local mode, whereas Smart takes advantage of share-memory environment and avoids replicating any reduction object.
Smart vs. Low-Level Implementations Setup Smart: time sharing mode; Low-Level: OpenMP + MPI Apps: K-means and logistic regression 1 TB input on 8–64 nodes Programmability 55% and 69% parallel codes are either eliminated or converted into sequential code Performance Up to 9% extra overheads for k-means Nearly unnoticeable overheads for logistic regression Logistic Regression K-Means Up to 9% extra overheads for k-means: low-level implementation can store reduction objects in contiguous arrays, whereas Smart stores data in map structures, and extra serialization is required for each synchronization. Nearly unnoticeable overheads for logistic regression: only a single key-value pair created.
Conclusion Positioning Functionality Performance Open Source Programming model research on in-situ analytics Functionality Addresses all the 4 mismatches Time sharing and space sharing modes Performance Outperforms Spark by at least an order of magnitude Comparable to low-level implementations Open Source https://github.com/SciPioneer/Smart
My Google Team is Hiring! Q&A My Google Team is Hiring!
Backup Slides
Launching Smart in Time Sharing Mode
Launching Smart in Space Sharing Mode
Data Processing Mechanism
Node Scalability Setup 1 TB data output by Heat3D; time sharing; 8 cores per node 4-32 nodes
Thread Scalability Setup 1 TB data output by Lulesh; time sharing; 64 nodes 1-8 threads per node
Memory Efficiency of Time Sharing Setup Logistic regression on Heat3D using 4 nodes (left) Mutual information on Lulesh using 64 nodes (right)
Efficiency of Space Sharing Mode Setup 1 TB data output by Lulesh 8 Xeon Phi nodes and 60 threads per node Apps: K-Means (Left) and Moving Median (Right) For histogram, the performance of time sharing mode is better, because its computation is very lightweight, and the overall cost is dominated by synchronization. outperform time sharing by 10% outperform time sharing by 48%
Optimization: Early Emission of Reduction Object Motivation: Mainly considers window-based analytics, e.g., moving average A large # of reduction objects to maintain -> high memory consumption Key Insight: Most reduction objects can be finalized in the reduction phase Set a customizable trigger: outputs these reduction objects (locally) as early as possible