Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

Spark: Cluster Computing with Working Sets

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

Storage in Big Data Systems

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

MapReduce How to painlessly process terabytes of data.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, and others)

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Big Data is a Big Deal!.

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Machine Learning Library for Apache Ignite

Introduction to Distributed Platforms

Open Source distributed document DB for an enterprise

Spark Presentation.

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Introduction to MapReduce and Hadoop

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Distributed Systems CS

Tools and Techniques for Processing (and Management) of Data

Applying Twister to Scientific Applications

MapReduce Simplied Data Processing on Large Clusters

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

MapReduce: Data Distribution for Reduce

February 26th – Map/Reduce

Communication and Memory Efficient Parallel Decision Tree Construction

Cse 344 May 4th – Map/Reduce.

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

CS110: Discussion about Spark

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Gagan Agrawal The Ohio State University

Data-Intensive Computing: From Clouds to GPU Clusters

Overview of big data tools

Parallel Computation Patterns (Reduction)

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Data processing with Hadoop

Yi Wang, Wei Jiang, Gagan Agrawal

Distributed Systems CS

COS 518: Distributed Systems Lecture 11 Mike Freedman

MapReduce: Simplified Data Processing on Large Clusters

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

Map Reduce, Types, Formats and Features

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics Yi Wang* Gagan Agrawal* Tekin Bicer# Wei Jiang^ *The Ohio State University #Argonne National Laboratory ^Quantcast Corp.

Outline Introduction Challenges System Design Experimental Results Conclusion

The Big Picture of In Situ In-Situ Algorithms No disk I/O Indexing, compression, visualization, statistical analysis, etc. In-Situ Resource Scheduling Systems Enhance resource utilization Simplify the management of analytics code GoldRush, Glean, DataSpaces, FlexIO, etc. Algorithm/Application Level Seamlessly Connected? Platform/System Level These two layers have been extensively studied. Is there anything to do between the two levels?

Rethink These Two Levels In-Situ Algorithms Implemented with low-level APIs like OpenMP/MPI Manually handle all the parallelization details In-Situ Resource Scheduling Systems Play the role of coordinator Focus on scheduling issues like cycle stealing and asynchronous I/O No high-level parallel programming API Motivation Can the applications be mapped more easily to the platforms for in-situ analytics? Can the offline and in-situ analytics code be (almost) identical?

Opportunity Explore the Programming Model Level in In-Situ Environment Between application level and system level Hides all the parallelization complexities by simplified API A prominent example: MapReduce + In Situ

Challenges Hard to Adapt MR to In-Situ Environment 4 Mismatches MR is not designed for in-situ analytics 4 Mismatches Data Loading Mismatch Programming View Mismatch Memory Constraint Mismatch Programming Language Mismatch

Data Loading Mismatch In Situ Requires Taking Input From Memory Ways to Load Data into MRs From distributed file systems Hadoop and many variants (on HDFS), Google MR (on GFS), and Disco (on DDFS) From shared/local file systems MARIANE and CGL-MapReduce MPI-Based: MapReduce-MPI and MRO-MPI From memory Phoenix (shared-memory) From data streams HOP, M3, and iMR

Data Loading Mismatch (Cont’d) Few MR Option Most MRs load data from file systems Loading data from memory is mostly restricted to shared-memory environment Wrap simulation output as a data stream? Periodical stream spiking Only one-time scan is allowed An Exception -- Spark Can support loading data from file systems, memory, or data stream Spark works as long as input can be converted into RDD.

Programming View Mismatch Scientific Simulation Parallel programming view Explicit parallelism: partitioning, message passing, and synchronization MapReduce Sequential programming view Partitions are transparent Need a Hybrid Programming View Exposes partitions during data loading Hides parallelism after data loading Traditional MapReduce implementations cannot explicitly take partitioned simulation output as the input, or launch the execution of analytics from an SPMD region.

Memory Constraint Mismatch MR is Often Memory/Disk Intensive Map phase creates intermediate data Sorting, shuffling, and grouping do not reduce intermediate data at all Local combiner cannot reduce the peak memory consumption (in map phase) Need Alternate MR API Avoids key-value pair emission in the map phase Eliminates intermediate data in the shuffling phase

Programming Language Mismatch Simulation Code in Fortran or C/C++ Impractical to rewrite in other languages Mainstream MRs in Java/Scala Hadoop in Java Spark in Scala/Java/Python Other MRs in C/C++ are not widely adopted

Bridging the Gap Addresses All the Mismatches Loads data from (distributed) memory, even without extra memcpy in time sharing mode Presents a hybrid programming view High memory efficiency with alternate API Implemented in C++11, with OpenMP + MPI

System Overview In-Situ System Distributed System Shared-Memory System Distributed System = Partitioning + Shared-Memory System + Combination In-Situ System = Shared-Memory System + Combination = Distributed System – Partitioning This is because the input data is automatically partitioned based on simulation code.

Two In-Situ Modes Space Sharing Mode: Enhances resource utilization when simulation reaches its scalability bottleneck Time Sharing Mode: Minimizes memory consumption

Ease of Use Launching Smart Application Development No extra libraries or configuration Minimal changes to the simulation code Analytics code remains the same in different modes Application Development Define a reduction object Derive a Smart scheduler class gen_key(s): generates key(s) for a data chunk accumulate: accumulates data on a reduction object merge: merges two reduction objects Both reduction map and combination map are transparent to the user, and all the parallelization complexity is hidden in a sequential view.

Smart vs. Spark To Make a Fair Comparison Bypass programming view mismatch Run on an 8-core node: multi-threaded but not distributed Bypass memory constraint mismatch Use a simulation emulator that consumes little memory Bypass programming language mismatch Rewrite the simulation in Java and only compare computation time 40 GB input and 0.5 GB per time-step K-Means Histogram 62X 92X

Smart vs. Spark (Cont’d) Faster Execution Spark 1) emits intermediate data, 2) makes immutable RDDs, and 3) serializes RDDs and sends them through network even in the local mode Smart 1) avoids intermediate data, 2) performs data reduction in place, and 3) takes advantage of shared-memory environment (of each node) Greater (Thread) Scalability Spark launches extra threads for other tasks, e.g., communication and driver’s UI Smart launches no extra thread Higher Memory Efficiency Spark: over 90% of 12 GB memory Smart: around 16 MB besides 0.5 GB time-step 3 reasons of such performance difference: Spark emits massive amounts of intermediate data, whereas Smart performs data reduction in place and eliminates emission of intermediate data. Spark makes a new RDD for every transformation due to its immutability, whereas Smart reuses reduction/combination maps even for iterative processing; 3) Spark serializes RDDs and send them through network even in local mode, whereas Smart takes advantage of share-memory environment and avoids replicating any reduction object.

Smart vs. Low-Level Implementations Setup Smart: time sharing mode; Low-Level: OpenMP + MPI Apps: K-means and logistic regression 1 TB input on 8–64 nodes Programmability 55% and 69% parallel codes are either eliminated or converted into sequential code Performance Up to 9% extra overheads for k-means Nearly unnoticeable overheads for logistic regression Logistic Regression K-Means Up to 9% extra overheads for k-means: low-level implementation can store reduction objects in contiguous arrays, whereas Smart stores data in map structures, and extra serialization is required for each synchronization. Nearly unnoticeable overheads for logistic regression: only a single key-value pair created.

Conclusion Positioning Functionality Performance Open Source Programming model research on in-situ analytics Functionality Addresses all the 4 mismatches Time sharing and space sharing modes Performance Outperforms Spark by at least an order of magnitude Comparable to low-level implementations Open Source https://github.com/SciPioneer/Smart

My Google Team is Hiring! Q&A My Google Team is Hiring!

Backup Slides

Launching Smart in Time Sharing Mode

Launching Smart in Space Sharing Mode

Data Processing Mechanism

Node Scalability Setup 1 TB data output by Heat3D; time sharing; 8 cores per node 4-32 nodes

Thread Scalability Setup 1 TB data output by Lulesh; time sharing; 64 nodes 1-8 threads per node

Memory Efficiency of Time Sharing Setup Logistic regression on Heat3D using 4 nodes (left) Mutual information on Lulesh using 64 nodes (right)

Efficiency of Space Sharing Mode Setup 1 TB data output by Lulesh 8 Xeon Phi nodes and 60 threads per node Apps: K-Means (Left) and Moving Median (Right) For histogram, the performance of time sharing mode is better, because its computation is very lightweight, and the overall cost is dominated by synchronization. outperform time sharing by 10% outperform time sharing by 48%

Optimization: Early Emission of Reduction Object Motivation: Mainly considers window-based analytics, e.g., moving average A large # of reduction objects to maintain -> high memory consumption Key Insight: Most reduction objects can be finalized in the reduction phase Set a customizable trigger: outputs these reduction objects (locally) as early as possible