Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, and others)

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark: Cluster Computing with Working Sets
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
Storage in Big Data Systems
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.
Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu.
SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Sunpyo Hong, Hyesoon Kim
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Big Data is a Big Deal!.
Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Tools and Techniques for Processing (and Management) of Data
Applying Twister to Scientific Applications
MapReduce Simplied Data Processing on Large Clusters
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Communication and Memory Efficient Parallel Decision Tree Construction
Gagan Agrawal The Ohio State University
Data-Intensive Computing: From Clouds to GPU Clusters
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.
MapReduce: Simplified Data Processing on Large Clusters
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, and others)

In-Situ Scientific Analytics What is “In Situ”? – Co-locating simulation and analytics programs – Moving computation instead of data Constraints of “In Situ” – Minimize the impact on simulation Memory constraint Time constraint 2 SimulationAnalytics Persistent Storage Simulation Analytics

In-Situ Analysis – What and Why Process of transforming data at run time – Analysis – Classification – Reduction – Visualization In-Situ has the promise of – Saving more information dense data – Saving I/O or network transfer time – Saving disk space – Saving time in analysis

Key Questions How do we decide what data to save? – This analysis cannot take too much time/memory – Simulations already consume most available memory – Scientists cannot accept much slowdown for analytics How insights can be obtained in-situ? – Must be memory and time efficient What representation to use for data stored in disks? – Effective analysis/visualization – Disk/Network Efficient

A Vertical View In-Situ Algorithms – No disk I/O – Indexing, compression, visualization, statistical analysis, etc. In-Situ Resource Scheduling Systems – Enhance resource utilization – Simplify the management of analytics code – GoldRush, Glean, DataSpaces, FlexIO, etc. 5 Algorithm/Application Level Platform/System Level Seamlessly Connected?

Rethink These Two Levels In-Situ Algorithms – Implemented with low-level APIs like OpenMP/MPI – Manually handle all the parallelization details In-Situ Resource Scheduling Systems – Play the role of coordinator – Focus on scheduling issues like cycle stealing and asynchronous I/O – No high-level parallel programming API Motivation – Can the applications be mapped more easily to the platforms for in-situ analytics? – Can the offline and in-situ analytics code be (almost) identical? 6

Outline Background Bitmap based summarization and processing – Key Ideas – Algorithms – Evaluation Smart Middleware System – Motivation – Design – Evaluation Conclusions 7

Key Questions How do we decide what data to save? – This analysis cannot take too much time/memory – Simulations already consume most available memory – Scientists cannot accept much slowdown for analytics How insights can be obtained in-situ? – Must be memory and time efficient What representation to use for data stored in disks? – Effective analysis/visualization – Disk/Network Efficient

Quick Answers How do we decide what data to save? – Use Bitmaps! How insights can be obtained in-situ? – Use Bitmaps!! What representation to use for data stored on disks? – Bitmaps!!!

Specific Issues Bitmaps as data summarization – Utilize extra computer power for data reduction – Save memory usage, disk I/O and network transfer time In-Situ Data Reduction – In-Situ generate bitmaps Bitmaps generation is time-consuming Bitmaps before compression has big memory cost In-Situ Data Analysis – Time steps selection Can bitmaps support time step selection? Efficiency of time step selection using bitmaps Offline Analysis: – Only keep bitmaps instead of data – Types of analysis supported by bitmaps

Background: Bitmaps Widely used in scientific data management Suitable for floating value by binning small ranges Run Length Compression (WAH, BBC) Bitmaps can be treated as a small profile of the data

In-Situ Bitmaps Generation

Parallel index generation – Save the data loading cost – Multi-Core based index generation Core allocation strategies – Shared Cores Allocate all cores to simulation and bitmaps generation Executed in sequence – Separate Cores Allocate different core sets to simulation and bitmaps generation A data queue is shared between simulation and bitmaps generation Executed in parallel In-place bitvector compression – Scan data by segments – Merge segment into compressed bitvectors

Time-Steps Selection

Correlation Metrics Earth Mover’s Distance: – Indicate distance between two probability distributions over a region – Cost of changing value distributions of data Shannon’s Entropy: – A metric to show the variability of the dataset – High entropy => more random distributed data Mutual Information: – A metric for computing the dependence between two variables – Low M => two variables are relatively independent Conditional Entropy: – Self-contained information – Information with respect to others

Calculate Earth Mover’s Distance Using Bitmaps Divide T i and T j into bins over value subsets Generate a CFP based on value differences between bins of T i and T j Accumulate results

Correlation Mining Using Bitmaps Correlation mining – Automatically suggest data subsets with high correlations – Correlation Analysis: keep submitting queries – Traditional Method Exhaustive calculation over data subsets (spatial and value) Huge time and memory cost Correlation mining using bitmap – Mutual Information Calculated by probability distribution (value subsets) – A top-down method for value subsets Multi-level bitmap indexing Go to low-level index only if high-level has high mutual info – A bottom-up method for spatial subsets Divide bitvectors (with high correlations) into basic strides Perform 1-bits count operation over strides

Correlation Mining

Experiment Results Goals: – Efficiency and storage improvement using bitmaps – Scalability in parallel in-situ environment – Efficiency improvement for correlation mining – Efficiency and accuracy comparison with sampling Simulations: Heat3D, Lulesh Datasets: Parallel Ocean Program Environment: – 32 Intel Xeon x5650 CPUs and 1TB memory – MIC: 60 Intel Xeon Phi coprocessors and 8GB memory – OSC Oakley Cluster: 32 nodes with 12 Intel Xeon x5650 CPUs and 48 GB memory

Efficiency Comparison for In-Situ Analysis - CPU Full Data (original): Simulation: bad scalability Time Step Selection: big Data Writing: big and bad scalability Bitmaps: Simulation: utilize extra computing power for bitmaps generation Extra bitmaps generation time but good scalability Time Step Selection Using Bitmaps: 1.38x to 1.5x Bitmaps Writing: 6.78x Overall: 0.79x to 2.38x More number of cores, better speedup we can achieve Simulation: Heat3D; Processor: CPU Time steps: select 25 over 100 time steps 6.4 GB per time step (800*1000*1000) Metrics: Conditional Entropy

Efficiency Comparison for In-Situ Analysis - MIC MIC: More cores Lower bandwidth Full Data (original): Huge data writing time Bitmaps: Good scalability of both bitmaps generation and time step selection using bitmaps Much smaller data writing time Overall: 0.81x to 3.28x Simulation: Heat3D; Processor: MIC Time steps: select 25 over 100 time steps 1.6 GB per time step (200*1000*1000) Metrics: Conditional Entropy

Memory Cost of In-Situ Analysis Simulation: Heat3D, Lulesh Processor: CPU, MIC Keep 10 time steps in memory Heat3D - No Indexing: 12 time steps (pre, temp, cur) Heat3D - Bitmap Indexing: 2 time steps (pre, temp) 1 previous selected indices 10 current indices Lulesh – No Indexing: 11 time steps (pre, cur) Huge extra memory for edges Lulesh – Bitmap Indexing: 1 time step (pre) 1 previous selected indices 10 current indices Huge extra memory for edges 2.0x to 3.59x smaller memory Better as bigger data simulated and more time steps to hold

Scalability in Parallel Environment Select 25 time steps out of 100 TEMP Variable: 6.4 GB per time step Number of nodes: 1 to 32 Number of cores: 8 Simulation: Heat3D Full Data– Local: Each node write its data subblock into its own disk Bitmaps– Local: Each node writes its bitmaps subblock into its own disk Fast time step selection and local writing 1.24x – 1.29x speedup Full Data– Remote: Different nodes send data sub- blocks to a master node Bitmaps – Remote: Greatly alleviate data transfer burden of master node 1.24x – 3.79x speedup

Speedup for Correlation Mining Variables: TEMP, SALT Data size per variable: 1.4 GB to 11.2 GB Number of cores: 1 Simulation: POP Full Data: Big data loading cost Exhaustive calculations over data subsets Each calculation is time consuming Bitmaps: Smaller data loading Multi-level bitmaps to improve the mining process Bitwise AND and 1-bits count operations to improve the calculation efficiency 3.81x – 4.92x speedup

In-Situ Sampling vs. Bitmaps Heat3D,100 time steps (6.4 GB), 32 cores Bitmaps generation (binning, compression) has more time cost then down-sampling Sampling can effectively improve the time step selection cost Bitmaps generation can still achieve better efficiency if the index size is smaller than sample size Bitmaps: using the same binning scale, does not have any information loss Sampling: information loss is unavoidable no matter what sample% 30% % loss 15% % loss 5% % loss

Outline Background Bitmap based summarization and processing – Key Ideas – Algorithms – Evaluation Smart Middleware System – Motivation – Design – Evaluation Conclusions 26

The Big Picture In-Situ Algorithms – No disk I/O – Indexing, compression, visualization, statistical analysis, etc. In-Situ Resource Scheduling Systems – Enhance resource utilization – Simplify the management of analytics code – GoldRush, Glean, DataSpaces, FlexIO, etc. 27 Algorithm/Application Level Platform/System Level Seamlessly Connected?

Opportunity Explore the Programming Model Level in In- Situ Environment – Between application level and system level – Hides all the parallelization complexities by simplified API – A prominent example: MapReduce 28 + In Situ

Challenges Hard to Adapt MR to In-Situ Environment – MR is not designed for in-situ analytics 4 Mismatches – Data Loading Mismatch – Programming View Mismatch – Memory Constraint Mismatch – Programming Language Mismatch 29

Data Loading Mismatch In Situ Requires Taking Input From Memory Ways to Load Data into MRs – From distributed file systems Hadoop and many variants (on HDFS), Google MR (on GFS), and Disco (on DDFS) – From shared/local file systems MARIANE and CGL-MapReduce MPI-Based: MapReduce-MPI and MRO-MPI – From memory Phoenix (shared-memory) – From data streams HOP, M3, and iMR 30

Data Loading Mismatch (Cont’d) Few MR Option – Most MRs load data from file systems – Loading data from memory is mostly restricted to shared-memory environment – Wrap simulation output as a data stream? Periodical stream spiking Only one-time scan is allowed An Exception -- Spark – Can support loading data from file systems, memory, or data stream 31

Programming View Mismatch Scientific Simulation – Parallel programming view – Explicit parallelism: partitioning, message passing, and synchronization MapReduce – Sequential programming view – Partitions are transparent Need a Hybrid Programming View – Exposes partitions during data loading – Hides parallelism after data loading 32

Memory Constraint Mismatch MR is Often Memory/Disk Intensive – Map phase creates intermediate data – Sorting, shuffling, and grouping do not reduce intermediate data at all – Local combiner cannot reduce the peak memory consumption (in map phase) Need Alternate MR API – Avoids key-value pair emission in the map phase – Eliminates intermediate data in the shuffling phase 33

Programming Language Mismatch Simulation Code in Fortran or C/C++ – Impractical to rewrite in other languages Mainstream MRs in Java/Scala – Hadoop in Java – Spark in Scala/Java/Python – Other MRs in C/C++ are not widely adopted 34

Bridging the Gap Addresses All the Mismatches – Loads data from (distributed) memory, even without extra memcpy in time sharing mode – Presents a hybrid programming view – High memory efficiency with alternate API – Implemented in C++11, with OpenMP + MPI 35

System Overview 36 Shared-Memory System Distributed System In-Situ System In-Situ System = Shared-Memory System + Combination = Distributed System – Partitioning

Two In-Situ Modes 37 Time Sharing Mode: Minimizes memory consumption Space Sharing Mode: Enhances resource utilization when simulation reaches its scalability bottleneck

Launching Smart in Time Sharing Mode 38

Launching Smart in Space Sharing Mode 39

Ease of Use Launching Smart – No extra libraries or configuration – Minimal changes to the simulation code – Analytics code remains the same in different modes Application Development – Define a reduction object – Derive a Smart scheduler class gen_key(s): generates key(s) for a data chunk accumulate: accumulates data on a reduction object merge: merges two reduction objects 40

Optimization: Early Emission of Reduction Object Motivation: – Mainly considers window-based analytics, e.g., moving average – A large # of reduction objects to maintain -> high memory consumption Key Insight: – Most reduction objects can be finalized in the reduction phase – Set a customizable trigger: outputs these reduction objects (locally) as early as possible 41

Smart vs. Spark To Make a Fair Comparison – Bypass programming view mismatch Run on an 8-core node: multi-threaded but not distributed – Bypass memory constraint mismatch Use a simulation emulator that consumes little memory – Bypass programming language mismatch Rewrite the simulation in Java and only compare computation time 40 GB input and 0.5 GB per time-step 42 62X 92X K-MeansHistogram

Smart vs. Spark (Cont’d) Faster Execution – Spark 1) emits intermediate data, 2) makes immutable RDDs, and 3) serializes RDDs and sends them through network even in the local mode – Smart 1) avoids intermediate data, 2) performs data reduction in place, and 3) takes advantage of shared-memory environment (of each node) Better (Thread) Scalability – Spark launches extra threads for other tasks, e.g., communication and driver’s UI – Smart launches no extra thread Higher Memory Efficiency – Spark: over 90% of 12 GB memory – Smart: around 16 MB besides 0.5 GB time-step 43

Smart vs. Low-Level Implementations Setup – Smart: time sharing mode; Low-Level: OpenMP + MPI – Apps: K-means and logistic regression – 1 TB input on 8–64 nodes Programmability – 55% and 69% parallel codes are either eliminated or converted into sequential code Performance – Up to 9% extra overheads for k-means – Nearly unnoticeable overheads for logistic regression 44 K-Means Logistic Regression

Node Scalability Setup – 1 TB data output by Heat3D; time sharing; 8 cores per node – 4-32 nodes 45

Thread Scalability Setup – 1 TB data output by Lulesh; time sharing; 64 nodes – 1-8 threads per node 46

Memory Efficiency of Time Sharing Setup – Logistic regression on Heat3D using 4 nodes (left) – Mutual information on Lulesh using 64 nodes (right) 47

Efficiency of Space Sharing Mode Setup – 1 TB data output by Lulesh – 8 Xeon Phi nodes and 60 threads per node – Apps: K-Means (Left) and Moving Median (Right) 48 outperform time sharing by 48%outperform time sharing by 10%

Conclusions In-Situ Analytics needs to be carefully architecture – Memory Constraints – Programmability Issues – Many-cores are changing the game Bitmaps can be generated sufficiently fast – Effective summarization structure – Memory efficiency – No loss of accuracy in most cases Smart Middleware Beats Conventional Wisdom – Commercial `Big Data’ Ideas can be applied – Requires careful design of middleware 49