Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, and others)
In-Situ Scientific Analytics What is “In Situ”? – Co-locating simulation and analytics programs – Moving computation instead of data Constraints of “In Situ” – Minimize the impact on simulation Memory constraint Time constraint 2 SimulationAnalytics Persistent Storage Simulation Analytics
In-Situ Analysis – What and Why Process of transforming data at run time – Analysis – Classification – Reduction – Visualization In-Situ has the promise of – Saving more information dense data – Saving I/O or network transfer time – Saving disk space – Saving time in analysis
Key Questions How do we decide what data to save? – This analysis cannot take too much time/memory – Simulations already consume most available memory – Scientists cannot accept much slowdown for analytics How insights can be obtained in-situ? – Must be memory and time efficient What representation to use for data stored in disks? – Effective analysis/visualization – Disk/Network Efficient
A Vertical View In-Situ Algorithms – No disk I/O – Indexing, compression, visualization, statistical analysis, etc. In-Situ Resource Scheduling Systems – Enhance resource utilization – Simplify the management of analytics code – GoldRush, Glean, DataSpaces, FlexIO, etc. 5 Algorithm/Application Level Platform/System Level Seamlessly Connected?
Rethink These Two Levels In-Situ Algorithms – Implemented with low-level APIs like OpenMP/MPI – Manually handle all the parallelization details In-Situ Resource Scheduling Systems – Play the role of coordinator – Focus on scheduling issues like cycle stealing and asynchronous I/O – No high-level parallel programming API Motivation – Can the applications be mapped more easily to the platforms for in-situ analytics? – Can the offline and in-situ analytics code be (almost) identical? 6
Outline Background Bitmap based summarization and processing – Key Ideas – Algorithms – Evaluation Smart Middleware System – Motivation – Design – Evaluation Conclusions 7
Key Questions How do we decide what data to save? – This analysis cannot take too much time/memory – Simulations already consume most available memory – Scientists cannot accept much slowdown for analytics How insights can be obtained in-situ? – Must be memory and time efficient What representation to use for data stored in disks? – Effective analysis/visualization – Disk/Network Efficient
Quick Answers How do we decide what data to save? – Use Bitmaps! How insights can be obtained in-situ? – Use Bitmaps!! What representation to use for data stored on disks? – Bitmaps!!!
Specific Issues Bitmaps as data summarization – Utilize extra computer power for data reduction – Save memory usage, disk I/O and network transfer time In-Situ Data Reduction – In-Situ generate bitmaps Bitmaps generation is time-consuming Bitmaps before compression has big memory cost In-Situ Data Analysis – Time steps selection Can bitmaps support time step selection? Efficiency of time step selection using bitmaps Offline Analysis: – Only keep bitmaps instead of data – Types of analysis supported by bitmaps
Background: Bitmaps Widely used in scientific data management Suitable for floating value by binning small ranges Run Length Compression (WAH, BBC) Bitmaps can be treated as a small profile of the data
In-Situ Bitmaps Generation
Parallel index generation – Save the data loading cost – Multi-Core based index generation Core allocation strategies – Shared Cores Allocate all cores to simulation and bitmaps generation Executed in sequence – Separate Cores Allocate different core sets to simulation and bitmaps generation A data queue is shared between simulation and bitmaps generation Executed in parallel In-place bitvector compression – Scan data by segments – Merge segment into compressed bitvectors
Time-Steps Selection
Correlation Metrics Earth Mover’s Distance: – Indicate distance between two probability distributions over a region – Cost of changing value distributions of data Shannon’s Entropy: – A metric to show the variability of the dataset – High entropy => more random distributed data Mutual Information: – A metric for computing the dependence between two variables – Low M => two variables are relatively independent Conditional Entropy: – Self-contained information – Information with respect to others
Calculate Earth Mover’s Distance Using Bitmaps Divide T i and T j into bins over value subsets Generate a CFP based on value differences between bins of T i and T j Accumulate results
Correlation Mining Using Bitmaps Correlation mining – Automatically suggest data subsets with high correlations – Correlation Analysis: keep submitting queries – Traditional Method Exhaustive calculation over data subsets (spatial and value) Huge time and memory cost Correlation mining using bitmap – Mutual Information Calculated by probability distribution (value subsets) – A top-down method for value subsets Multi-level bitmap indexing Go to low-level index only if high-level has high mutual info – A bottom-up method for spatial subsets Divide bitvectors (with high correlations) into basic strides Perform 1-bits count operation over strides
Correlation Mining
Experiment Results Goals: – Efficiency and storage improvement using bitmaps – Scalability in parallel in-situ environment – Efficiency improvement for correlation mining – Efficiency and accuracy comparison with sampling Simulations: Heat3D, Lulesh Datasets: Parallel Ocean Program Environment: – 32 Intel Xeon x5650 CPUs and 1TB memory – MIC: 60 Intel Xeon Phi coprocessors and 8GB memory – OSC Oakley Cluster: 32 nodes with 12 Intel Xeon x5650 CPUs and 48 GB memory
Efficiency Comparison for In-Situ Analysis - CPU Full Data (original): Simulation: bad scalability Time Step Selection: big Data Writing: big and bad scalability Bitmaps: Simulation: utilize extra computing power for bitmaps generation Extra bitmaps generation time but good scalability Time Step Selection Using Bitmaps: 1.38x to 1.5x Bitmaps Writing: 6.78x Overall: 0.79x to 2.38x More number of cores, better speedup we can achieve Simulation: Heat3D; Processor: CPU Time steps: select 25 over 100 time steps 6.4 GB per time step (800*1000*1000) Metrics: Conditional Entropy
Efficiency Comparison for In-Situ Analysis - MIC MIC: More cores Lower bandwidth Full Data (original): Huge data writing time Bitmaps: Good scalability of both bitmaps generation and time step selection using bitmaps Much smaller data writing time Overall: 0.81x to 3.28x Simulation: Heat3D; Processor: MIC Time steps: select 25 over 100 time steps 1.6 GB per time step (200*1000*1000) Metrics: Conditional Entropy
Memory Cost of In-Situ Analysis Simulation: Heat3D, Lulesh Processor: CPU, MIC Keep 10 time steps in memory Heat3D - No Indexing: 12 time steps (pre, temp, cur) Heat3D - Bitmap Indexing: 2 time steps (pre, temp) 1 previous selected indices 10 current indices Lulesh – No Indexing: 11 time steps (pre, cur) Huge extra memory for edges Lulesh – Bitmap Indexing: 1 time step (pre) 1 previous selected indices 10 current indices Huge extra memory for edges 2.0x to 3.59x smaller memory Better as bigger data simulated and more time steps to hold
Scalability in Parallel Environment Select 25 time steps out of 100 TEMP Variable: 6.4 GB per time step Number of nodes: 1 to 32 Number of cores: 8 Simulation: Heat3D Full Data– Local: Each node write its data subblock into its own disk Bitmaps– Local: Each node writes its bitmaps subblock into its own disk Fast time step selection and local writing 1.24x – 1.29x speedup Full Data– Remote: Different nodes send data sub- blocks to a master node Bitmaps – Remote: Greatly alleviate data transfer burden of master node 1.24x – 3.79x speedup
Speedup for Correlation Mining Variables: TEMP, SALT Data size per variable: 1.4 GB to 11.2 GB Number of cores: 1 Simulation: POP Full Data: Big data loading cost Exhaustive calculations over data subsets Each calculation is time consuming Bitmaps: Smaller data loading Multi-level bitmaps to improve the mining process Bitwise AND and 1-bits count operations to improve the calculation efficiency 3.81x – 4.92x speedup
In-Situ Sampling vs. Bitmaps Heat3D,100 time steps (6.4 GB), 32 cores Bitmaps generation (binning, compression) has more time cost then down-sampling Sampling can effectively improve the time step selection cost Bitmaps generation can still achieve better efficiency if the index size is smaller than sample size Bitmaps: using the same binning scale, does not have any information loss Sampling: information loss is unavoidable no matter what sample% 30% % loss 15% % loss 5% % loss
Outline Background Bitmap based summarization and processing – Key Ideas – Algorithms – Evaluation Smart Middleware System – Motivation – Design – Evaluation Conclusions 26
The Big Picture In-Situ Algorithms – No disk I/O – Indexing, compression, visualization, statistical analysis, etc. In-Situ Resource Scheduling Systems – Enhance resource utilization – Simplify the management of analytics code – GoldRush, Glean, DataSpaces, FlexIO, etc. 27 Algorithm/Application Level Platform/System Level Seamlessly Connected?
Opportunity Explore the Programming Model Level in In- Situ Environment – Between application level and system level – Hides all the parallelization complexities by simplified API – A prominent example: MapReduce 28 + In Situ
Challenges Hard to Adapt MR to In-Situ Environment – MR is not designed for in-situ analytics 4 Mismatches – Data Loading Mismatch – Programming View Mismatch – Memory Constraint Mismatch – Programming Language Mismatch 29
Data Loading Mismatch In Situ Requires Taking Input From Memory Ways to Load Data into MRs – From distributed file systems Hadoop and many variants (on HDFS), Google MR (on GFS), and Disco (on DDFS) – From shared/local file systems MARIANE and CGL-MapReduce MPI-Based: MapReduce-MPI and MRO-MPI – From memory Phoenix (shared-memory) – From data streams HOP, M3, and iMR 30
Data Loading Mismatch (Cont’d) Few MR Option – Most MRs load data from file systems – Loading data from memory is mostly restricted to shared-memory environment – Wrap simulation output as a data stream? Periodical stream spiking Only one-time scan is allowed An Exception -- Spark – Can support loading data from file systems, memory, or data stream 31
Programming View Mismatch Scientific Simulation – Parallel programming view – Explicit parallelism: partitioning, message passing, and synchronization MapReduce – Sequential programming view – Partitions are transparent Need a Hybrid Programming View – Exposes partitions during data loading – Hides parallelism after data loading 32
Memory Constraint Mismatch MR is Often Memory/Disk Intensive – Map phase creates intermediate data – Sorting, shuffling, and grouping do not reduce intermediate data at all – Local combiner cannot reduce the peak memory consumption (in map phase) Need Alternate MR API – Avoids key-value pair emission in the map phase – Eliminates intermediate data in the shuffling phase 33
Programming Language Mismatch Simulation Code in Fortran or C/C++ – Impractical to rewrite in other languages Mainstream MRs in Java/Scala – Hadoop in Java – Spark in Scala/Java/Python – Other MRs in C/C++ are not widely adopted 34
Bridging the Gap Addresses All the Mismatches – Loads data from (distributed) memory, even without extra memcpy in time sharing mode – Presents a hybrid programming view – High memory efficiency with alternate API – Implemented in C++11, with OpenMP + MPI 35
System Overview 36 Shared-Memory System Distributed System In-Situ System In-Situ System = Shared-Memory System + Combination = Distributed System – Partitioning
Two In-Situ Modes 37 Time Sharing Mode: Minimizes memory consumption Space Sharing Mode: Enhances resource utilization when simulation reaches its scalability bottleneck
Launching Smart in Time Sharing Mode 38
Launching Smart in Space Sharing Mode 39
Ease of Use Launching Smart – No extra libraries or configuration – Minimal changes to the simulation code – Analytics code remains the same in different modes Application Development – Define a reduction object – Derive a Smart scheduler class gen_key(s): generates key(s) for a data chunk accumulate: accumulates data on a reduction object merge: merges two reduction objects 40
Optimization: Early Emission of Reduction Object Motivation: – Mainly considers window-based analytics, e.g., moving average – A large # of reduction objects to maintain -> high memory consumption Key Insight: – Most reduction objects can be finalized in the reduction phase – Set a customizable trigger: outputs these reduction objects (locally) as early as possible 41
Smart vs. Spark To Make a Fair Comparison – Bypass programming view mismatch Run on an 8-core node: multi-threaded but not distributed – Bypass memory constraint mismatch Use a simulation emulator that consumes little memory – Bypass programming language mismatch Rewrite the simulation in Java and only compare computation time 40 GB input and 0.5 GB per time-step 42 62X 92X K-MeansHistogram
Smart vs. Spark (Cont’d) Faster Execution – Spark 1) emits intermediate data, 2) makes immutable RDDs, and 3) serializes RDDs and sends them through network even in the local mode – Smart 1) avoids intermediate data, 2) performs data reduction in place, and 3) takes advantage of shared-memory environment (of each node) Better (Thread) Scalability – Spark launches extra threads for other tasks, e.g., communication and driver’s UI – Smart launches no extra thread Higher Memory Efficiency – Spark: over 90% of 12 GB memory – Smart: around 16 MB besides 0.5 GB time-step 43
Smart vs. Low-Level Implementations Setup – Smart: time sharing mode; Low-Level: OpenMP + MPI – Apps: K-means and logistic regression – 1 TB input on 8–64 nodes Programmability – 55% and 69% parallel codes are either eliminated or converted into sequential code Performance – Up to 9% extra overheads for k-means – Nearly unnoticeable overheads for logistic regression 44 K-Means Logistic Regression
Node Scalability Setup – 1 TB data output by Heat3D; time sharing; 8 cores per node – 4-32 nodes 45
Thread Scalability Setup – 1 TB data output by Lulesh; time sharing; 64 nodes – 1-8 threads per node 46
Memory Efficiency of Time Sharing Setup – Logistic regression on Heat3D using 4 nodes (left) – Mutual information on Lulesh using 64 nodes (right) 47
Efficiency of Space Sharing Mode Setup – 1 TB data output by Lulesh – 8 Xeon Phi nodes and 60 threads per node – Apps: K-Means (Left) and Moving Median (Right) 48 outperform time sharing by 48%outperform time sharing by 10%
Conclusions In-Situ Analytics needs to be carefully architecture – Memory Constraints – Programmability Issues – Many-cores are changing the game Bitmaps can be generated sufficiently fast – Effective summarization structure – Memory efficiency – No loss of accuracy in most cases Smart Middleware Beats Conventional Wisdom – Commercial `Big Data’ Ideas can be applied – Requires careful design of middleware 49