Yi Wang Department of Computer Science and Engineering

Slides:



Advertisements
Similar presentations
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Spark: Cluster Computing with Working Sets
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†
Advisor: Gagan Agrawal
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, and others)
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Dense-Region Based Compact Data Cube
Image taken from: slideshare
Presented by: Omar Alqahtani Fall 2016
Big Data is a Big Deal!.
Data Transformation: Normalization
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Record Storage, File Organization, and Indexes
Spark Presentation.
Parallel Programming By J. H. Wang May 2, 2017.
Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics
Database Performance Tuning &
COMP 430 Intro. to Database Systems
Database Performance Tuning and Query Optimization
CSCE 990: Advanced Distributed Systems
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Chapter 15 QUERY EXECUTION.
Tools and Techniques for Processing (and Management) of Data
Applying Twister to Scientific Applications
ICS-2018 June 12-15, Beijing Zwift : A Programming Framework for High Performance Text Analytics on Compressed Data Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng.
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Department of Computer Science University of California, Santa Barbara
Predictive Performance
On Spatial Joins in MapReduce
Communication and Memory Efficient Parallel Decision Tree Construction
Objective of This Course
Gagan Agrawal The Ohio State University
Data-Intensive Computing: From Clouds to GPU Clusters
Overview of big data tools
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.
SAMANVITHA RAMAYANAM 18TH FEBRUARY 2010 CPE 691
Yi Wang, Wei Jiang, Gagan Agrawal
Chapter 11 Database Performance Tuning and Query Optimization
Evaluation of Relational Operations: Other Techniques
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Fast, Interactive, Language-Integrated Cluster Computing
MapReduce: Simplified Data Processing on Large Clusters
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Data Management and Data Processing Support on Array-Based Scientific Data Yi Wang Department of Computer Science and Engineering The Ohio State University Advisor: Gagan Agrawal

Big Data Is Often Big Arrays Array data is everywhere Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Earth Science: Ocean and Climate Data Space Science: Astronomy Data Array data is especially prevalent in the scientific domain. From the data structure perspective, along with the data structures like graphs, sets and trees, various array storages have been serving as the heart of the large-scale scientific data analytics.

Inherent Limitations of Current Tools and Paradigms Most scientific data management and data processing tools are too heavy-weight Hard to cope with different data formats and physical structures (Variety) Data transformation and data movement are often prohibitively expensive (Volume) Prominent Examples RDBMSs and data mining tools: not suited for array data Array DBMSs: data ingestion MapReduce: specialized file system

Example Array Data Format - HDF5 HDF5 (Hierarchical Data Format) Each array element maintains two kinds of information: Val-based attribute; Dim-based attribute.

Thesis Statement Native Data Can Be Queried and/or Processed Efficiently Using Popular Abstractions Process data stored in the native format (e.g., NetCDF and HDF5), entirely based on summary structure (e.g., bitmap indices), or in situ Support SQL-like operators, e.g., selection and aggregation Support array operations, e.g., structural aggregations Support data mining, e.g., subgroup discovery Support MapReduce-like processing API

Server-Side Aggregation Converter -> Adapter Thesis Work Avoid No Protocol & NoDB Selections and aggregations over native arrays Aggregations and data mining entirely over bitmap Data Translation Server-Side Aggregation Data Management Data Ingestion Approximation Data Transfer Offline processing over multiple array data formats In-situ processing during scientific simulation Converter -> Adapter Data Integration Data Processing Data Storage In Situ

Thesis Work (Cont’d) Data Management Support Supporting a Light-Weight Data Management Layer Over HDF5[CCGrid’13] SAGA: Array Storage as a DB with Support for Structural Aggregations[SSDBM’14] A Novel Approach for Approximate Aggregations Over Arrays [SSDBM’15] SciSD: Novel Subgroup Discovery over Scientific Datasets Using Bitmap Indices [ICDM’15 submission] Data Processing Support SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats [CCGrid’12] Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics [SC’15 submission]

Outline After Candidacy SciSD: Novel Subgroup Discovery over Scientific Datasets Using Bitmap Indices Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics Before Candidacy Lightweight Data Management Layer SQL-like operators over native arrays Structural aggregations over native arrays Approximate aggregations over bitmap indices MapReduce-Like Processing API over Multiple Scientific Data Formats

What is Subgroup Discovery (SD)? Goal Identifies all the subgroups (i.e., subsets) that are significantly different from the entire dataset/general population, w.r.t. a target variable Extracts rules in the form of ‘Subgroup -> Target’, where subgroup can involve a number of explaining variables Real-Life Example Basketball players are significantly higher than the general population Subgroup description: occupation = `basketball player’ Target variable: height How to be significantly different: much greater than the average

Four Elements in SD Subgroup Description A conjunction of attribute-value pairs Quality Function Measures the “interestingness” or the extent of being different No consensus: interest, novelty, significance, specificity, Weighted Relative Accuracy (WRAcc), etc. Target Variable (and Explaining Variables) Binary, categorical, or numeric? Search Strategy Algorithm design: exhaustive or heuristic?

SD Also Applicable to Scientific Discovery (over Array Data) Motivating Example: Exploring an Ocean Dataset 3D array layout Dim-based Attr: latitude, longitude, and depth For each array element Val-based Attr: salinity, temperature, pressure, etc. What is the underlying relationships between high salinity and other attributes? High/low depth -> high salinity High/low temperature -> high salinity Motivation Low depth -> evaporation effect High depth -> high temp -> high solubility Note that the results only indicate an association, not necessarily a causal relationship.

Challenges Not Suited for Array-Based Scientific Data Primarily restricted to relational data -> dimension-based attribute? Mainly target binary or categorical attributes -> all numeric attributes? Mostly only works for small input (e.g., less than 1 GB) -> large datasets? Our Work Focuses on array data Boosts the performance using bitmap indices

Our SD over Array Data Subgroup Description A conjunction of attribute-value pairs => a conjunction of attribute-range pairs Quality Function Consider the mean w.r.t. the target variable Continuous Weighted Relative Accuracy (CWRAcc) Target Variable (and Explaining Variables) All numeric Search Strategy Exhaustive search + tight optimistic estimates Efficient pruning + combination The quality function can also be other statistics.

Running Example 2) Extracted by val range 1) Extracted by dim range 3) Extracted by dim range + val range

Search Strategy Discretizes each attribute range into a number of bins Based on a Set-Enumeration Tree All attributes are enumerated in a certain order Dim-based atrri and val-based attri are equally treated Every node represents a subgroup candidate Exhaustive but Efficient Search Level-wise (top-down) Pruning + combination (discuss later) Every node is either visited only once or not visited at all (if pruned) Top-down search is much more efficient than bottom-up search. This is similar to that greedy algorithm is much more efficient than dynamic programming

An Example of Search Tree 2 explaining variables: A and B A has 2 ranges: A1 and A2 B has 3 ranges: B1, B2, and B3 Ø The general population: # of attribute-range pairs = 0 A∈A1 A∈A2 B∈B1 B∈B2 B∈B3 The 1st-level subgroups: # of attribute-range pairs = 1 A∈A1^ B∈B1 A∈A1^ B∈B2 A∈A1^ B∈B3 A∈A2^ B∈B1 A∈A2^ B∈B2 A∈A2^ B∈B3 The 2nd-level subgroups: # of attribute-range pairs = 2

Is Top-Down Search Safe? Quality Function is Non-monotonic The quality of a subgroup can be higher than its ancestor Solution: Tight Optimistic Estimates Checks the upper bound of the quality Tight: this upper bound can be reached by a real subset (though not necessarily captured by a subgroup)

Tight Optimistic Estimates Main Idea The mean value of the entire dataset is a constant If the quality is positive/negative, then its upper bound is achieved by selecting all the elements that are greater/less than the global mean

Pruning Measures Minimum Support Sufficient size Minimum Absolute Quality Sufficient quality Minimum Relative Quality Sufficient quality compared with its parent

Combination Rules Existence of Adjacency Only brother nodes can be combined Difference Homogeneity The quality of two subgroups should be both positive/negative Ensures that the combined subgroup should not be pruned: the combined quality is greater than at least one of the pre-combination subgroup In terms of subgroup combination, we only consider the subgroups that have already passed the pruning test.

Combination Rules (Cont’d) Distribution Purity Effectively controls the granularity of output subgroups -> Don’t grow too large Set an upper bound for Continuous Weighted Entropy (CWE) The “weight” here implies the normalized distance from the ith interval to the mean of the general population. Therefore, for two subgroups of equal size, the CWE of a subgroup in which most elements are within the intervals remote from the mean of the general population, tends to be greater than, the other subgroup in which most elements are within the intervals close to the mean of the general population. Similar design in association mining: Classical quantitative association rules mining approach sets a MaxSupport: combine 2 adjacent child nodes as long as the merged support is less than such a MaxSupport. Entropy Weight

Bitmap Indexing Our Algorithm Entirely Operates on Bitmap Indices Key Insights Each attribute-range pair can be represented by a disjunction of bitvectors Each subgroup can be represented by a conjunction of bitvectors All the statistics like mean, CWE, and quality can be calculated entirely based on bitmap Operating on bitmap indices instead of the raw data can save I/O and accelerate computation. Any multi-dimensional array can be viewed as a 1D array and hence mapped to a bitvector.

Indexing Step Indices for Val-Based Attr. Indices for Dim-Based Attr. Conventional indexing Indices for Dim-Based Attr. Very small additional cost because of the contiguity (e.g., 1% space after compression) Sign Bitvectors Used for optimistic estimates Positive/negative bitvector indicates all the elements greater/less than the global mean

Indexing Step (Cont’d)

Comparison with SD-Map* Redundant Subgroups The Best 8 Subgroups Discovered SD-Map* Higher Quality Better Generality Any multi-dimensional array can be mapped to a 1D array SciSD Negative Quality

Comparison Results Functionality Effectiveness Performance Avoids casting array data into relational data by adding additional columns (for dim-based attr.) Able to process large datasets Effectiveness Higher quality & better generality Can discover negative quality No redundant subgroup Performance Outperforms SD-Map* by up to 121x SD-Map* requires casting scientific data into CSV format. SD-Map* can only work for very small datasets.

Conclusion Functionality Subgroup can be described by dimensional and/or value ranges Numeric attributes to can be handled High Performance Efficient pruning and combination Operates on compact bitmap indices instead of raw data

Outline After Candidacy SciSD: Novel Subgroup Discovery over Scientific Datasets Using Bitmap Indices Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics Before Candidacy Lightweight Data Management Layer SQL-like operators over native arrays Structural aggregations over native arrays Approximate aggregations over bitmap indices MapReduce-Like Processing API over Multiple Scientific Data Formats

In-Situ Scientific Analytics What is “In Situ”? Co-locating simulation and analytics programs Moving computation instead of data Constraints of “In Situ” Minimize the impact on simulation Memory constraint Time constraint Persistent Storage Simulation Simulation Analytics Analytics This paper mainly focuses on addressing memory constraint, while time constraints can be addressed by a number of ways: Lossy processing; In-transit processing (move computation to staging nodes); Time step selection.

Seamlessly Connected? The Big Picture In-Situ Algorithms No disk I/O Indexing, compression, visualization, statistical analysis, etc. In-Situ Resource Scheduling Systems Enhance resource utilization Simplify the management of analytics code GoldRush, Glean, DataSpaces, FlexIO, etc. Algorithm/Application Level Seamlessly Connected? Platform/System Level These two layers have been extensively studied. Is there anything to do between the two levels?

Rethink These Two Levels In-Situ Algorithms Implemented with low-level APIs like OpenMP/MPI Manually handle all the parallelization details In-Situ Resource Scheduling Systems Play the role of coordinator Focus on scheduling issues like cycle stealing and asynchronous I/O No high-level parallel programming API Motivation Can the applications be mapped more easily to the platforms for in-situ analytics? Can the offline and in-situ analytics code be (almost) identical?

Opportunity Explore the Programming Model Level in In-Situ Environment Between application level and system level Hides all the parallelization complexities by simplified API A prominent example: MapReduce + In Situ

Challenges Hard to Adapt MR to In-Situ Environment 4 Mismatches MR is not designed for in-situ analytics 4 Mismatches Data Loading Mismatch Programming View Mismatch Memory Constraint Mismatch Programming Language Mismatch

Data Loading Mismatch In Situ Requires Taking Input From Memory Ways to Load Data into MRs From distributed file systems Hadoop and many variants (on HDFS), Google MR (on GFS), and Disco (on DDFS) From shared/local file systems MARIANE and CGL-MapReduce MPI-Based: MapReduce-MPI and MRO-MPI From memory Phoenix (shared-memory) From data streams HOP, M3, and iMR

Data Loading Mismatch (Cont’d) Few MR Option Most MRs load data from file systems Loading data from memory is mostly restricted to shared-memory environment Wrap simulation output as a data stream? Periodical stream spiking Only one-time scan is allowed An Exception -- Spark Can support loading data from file systems, memory, or data stream Spark works as long as input can be converted into RDD.

Programming View Mismatch Scientific Simulation Parallel programming view Explicit parallelism: partitioning, message passing, and synchronization MapReduce Sequential programming view Partitions are transparent Need a Hybrid Programming View Exposes partitions during data loading Hides parallelism after data loading Traditional MapReduce implementations cannot explicitly take partitioned simulation output as the input, or launch the execution of analytics from an SPMD region.

Memory Constraint Mismatch MR is Often Memory/Disk Intensive Map phase creates intermediate data Sorting, shuffling, and grouping do not reduce intermediate data at all Local combiner cannot reduce the peak memory consumption (in map phase) Need Alternate MR API Avoids key-value pair emission in the map phase Eliminates intermediate data in the shuffling phase

Programming Language Mismatch Simulation Code in Fortran or C/C++ Impractical to rewrite in other languages Mainstream MRs in Java/Scala Hadoop in Java Spark in Scala/Java/Python Other MRs in C/C++ are not widely adopted

Bridging the Gap Addresses All the Mismatches Loads data from (distributed) memory, even without extra memcpy in time sharing mode Presents a hybrid programming view High memory efficiency with alternate API Implemented in C++11, with OpenMP + MPI

System Overview In-Situ System = Shared-Memory System + Combination = Distributed System – Partitioning In-Situ System Distributed System Shared-Memory System Distributed System = Partitioning + Shared-Memory System + Combination In-Situ System = Shared-Memory System + Combination = Distributed System – Partitioning This is because the input data is automatically partitioned based on simulation code.

Two In-Situ Modes Space Sharing Mode: Enhances resource utilization when simulation reaches its scalability bottleneck Time Sharing Mode: Minimizes memory consumption

Ease of Use Launching Smart Application Development No extra libraries or configuration Minimal changes to the simulation code Analytics code remains the same in different modes Application Development Define a reduction object Derive a Smart scheduler class gen_key(s): generates key(s) for a data chunk accumulate: accumulates data on a reduction object merge: merges two reduction objects Both reduction map and combination map are transparent to the user, and all the parallelization complexity is hidden in a sequential view.

Optimization: Early Emission of Reduction Object Motivation: Mainly considers window-based analytics, e.g., moving average A large # of reduction objects to maintain -> high memory consumption Key Insight: Most reduction objects can be finalized in the reduction phase Set a customizable trigger: outputs these reduction objects (locally) as early as possible

Smart vs. Spark To Make a Fair Comparison Bypass programming view mismatch Run on an 8-core node: multi-threaded but not distributed Bypass memory constraint mismatch Use a simulation emulator that consumes little memory Bypass programming language mismatch Rewrite the simulation in Java and only compare computation time 40 GB input and 0.5 GB per time-step K-Means Histogram 62X 92X

Smart vs. Spark (Cont’d) Faster Execution Spark 1) emits intermediate data, 2) makes immutable RDDs, and 3) serializes RDDs and sends them through network even in the local mode Smart 1) avoids intermediate data, 2) performs data reduction in place, and 3) takes advantage of shared-memory environment (of each node) Better (Thread) Scalability Spark launches extra threads for other tasks, e.g., communication and driver’s UI Smart launches no extra thread Higher Memory Efficiency Spark: over 90% of 12 GB memory Smart: around 16 MB besides 0.5 GB time-step 3 reasons of such performance difference: Spark emits massive amounts of intermediate data, whereas Smart performs data reduction in place and eliminates emission of intermediate data. Spark makes a new RDD for every transformation due to its immutability, whereas Smart reuses reduction/combination maps even for iterative processing; 3) Spark serializes RDDs and send them through network even in local mode, whereas Smart takes advantage of share-memory environment and avoids replicating any reduction object.

Smart vs. Low-Level Implementations Setup Smart: time sharing mode; Low-Level: OpenMP + MPI Apps: K-means and logistic regression 1 TB input on 8–64 nodes Programmability 55% and 69% parallel codes are either eliminated or converted into sequential code Performance Up to 9% extra overheads for k-means Nearly unnoticeable overheads for logistic regression Logistic Regression K-Means Up to 9% extra overheads for k-means: low-level implementation can store reduction objects in contiguous arrays, whereas Smart stores data in map structures, and extra serialization is required for each synchronization. Nearly unnoticeable overheads for logistic regression: only a single key-value pair created.

Outline After Candidacy SciSD: Novel Subgroup Discovery over Scientific Datasets Using Bitmap Indices Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics Before Candidacy Lightweight Data Management Layer SQL-like operators over native arrays Structural aggregations over native arrays Approximate aggregations over bitmap indices MapReduce-Like Processing API over Multiple Scientific Data Formats

SQL-Like Operators Selection Based on Dimension Index Values Also supported by Scientific Data Format API Selection Based on Dimension Scales coordinate system instead of the physical layout (array subscript) Selection Based on Data Values Simple datatype + compound datatype Aggregation SUM, COUNT, AVG, MIN, and MAX Server-side aggregation to minimize the data transfer

Structural Aggregation Types Non-Overlapping Aggregation Overlapping Aggregation Grid aggregation: multi-dimensional histogram Sliding aggregation: apply a kernel function to a sliding window – moving average, denoising, time series, etc. Hierarchical aggregation: observe the gradual influence of radiation from a source (pollution source/explosion location) Circular aggregation: concentric but disjoint circles instead of regularly shaped grids

Approximate Aggregations Over Array Data Challenges Flexible Aggregation Over Any Subset Dimensional-based/value-based/combined predicate Aggregation Accuracy Spatial distribution/value distribution Aggregation Without Data Reorganization Reorganization is prohibitively expensive Existing Techniques - All Problematic for Array Data Sampling: unable to capture both distributions Histograms: no spatial distribution Wavelets: no value distribution Data Synopsis We Choose – Bitmap Indices KDTree-based stratified sampling requires reorganization Histogram: 1D: no spatial distribution MD: either space cost or partitioning granularity increases exponentially, leading to either substantial estimation overheads or high inaccuracy Wavelets: If the value-based attribute added as an extra dimension to the data cube, sorting or reorganization is required.

Outline After Candidacy SciSD: Novel Subgroup Discovery over Scientific Datasets Using Bitmap Indices Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics Before Candidacy Lightweight Data Management Layer SQL-like operators over native arrays Structural aggregations over native arrays Approximate aggregations over bitmap indices MapReduce-Like Processing API over Multiple Scientific Data Formats

Scientific Data Processing Module Data adaption layer is customizable: - Insert a third-party adapter - Open for extension but closed for modification

Conclusions Scientific data management and data processing tools are often too heavyweight RDBMSs, Array DBMSs, MR, etc. Lightweight Data Management Support Over native arrays: SQL-like operators and structural aggregations Based on bitmap: aggregations, subgroup discovery, contrast set mining, and correlation mining Lightweight Data Processing Support MapReduce for both offline processing over multiple data formats and in-situ analytics

Future Work Data Management Support Bitmap-based data mining: exception rule mining Bitmap-based middleware: abstracts bitmap and bitmap-based operations on Spark for data analysis Data Processing Support BioSmart: applies Smart to our earlier work [HiCOMB’14] on sequencing data analysis Smart for approximate algorithms: user-defined estimation and early termination for online analytics [HiPC’15 submission]

Backup Slides

SciSD

SD vs. Classification “Subgroup Discovery in Defect Prediction” Classification, such as decision trees or decision rules, appear unlikely to find all meaningful contrasts. A classifier finds a single model that maximizes the separation of multiple groups, not all interesting models as contrast discovery seeks. The output of classification is likely to be an entire decision tree/classification system, which can have the same subsetting predicates at different levels. Classifier separates each other, not between the subsets and the general population. “Subgroup Discovery in Defect Prediction” - Rachel Harrison, et al.

Comparison with Varying Max. CWE

Number of Subgroups on POP Dataset

Execution Times on POP Dataset

Average Quality (CWRAcc) on POP Dataset

Effectiveness of Search Strategy

Smart

Launching Smart in Time Sharing Mode

Launching Smart in Space Sharing Mode

Data Processing Mechanism

Node Scalability Setup 1 TB data output by Heat3D; time sharing; 8 cores per node 4-32 nodes

Thread Scalability Setup 1 TB data output by Lulesh; time sharing; 64 nodes 1-8 threads per node

Memory Efficiency of Time Sharing Setup Logistic regression on Heat3D using 4 nodes (left) Mutual information on Lulesh using 64 nodes (right)

Efficiency of Space Sharing Mode Setup 1 TB data output by Lulesh 8 Xeon Phi nodes and 60 threads per node Apps: K-Means (Left) and Moving Median (Right) For histogram, the performance of time sharing mode is better, because its computation is very lightweight, and the overall cost is dominated by synchronization. outperform time sharing by 10% outperform time sharing by 48%

Supporting a Light-Weight Data Management Layer Over HDF5

Overall Idea An SQL Implementation Over HDF5 High Efficiency Ease-of-use: declarative language instead of low-level programming language + HDF5 API Abstraction: provides a virtual relational view High Efficiency Load data on demand (lazy loading) Parallel query processing Server-side aggregation

Execution Overview 1D: AND-logic condition list 2D: OR-logic condition list 1D: OR-logic condition list Same content-based condition More optimizations with the metadata information. 74

Hyperslab Selector True: False: nullify the condition list nullify the elementary condition 4-dim Salinity Dataset dim1: time [0, 1023] dim2: cols [0, 166] dim3: rows [0, 62] dim4: layers [0, 33] Fill up all the index boundary values

Type2 and Type3 Query Examples

Aggregation Query Examples AG1: Simple global aggregation AG2: GROUP BY clause + HAVING clause AG3: GROUP BY clause

Experimental Setup Experimental Datasets 4 GB (sequential experiments) and 16 GB (parallel experiments) 4D: time, cols, rows, and layers Compared with Baseline Performance and OPeNDAP Baseline performance: no query parsing OPeNDAP: translates HDF5 into a specialized data format

Sequential Comparison (Type2 and Type3 Queries) By using OPeNDAP, the user has to download the entire data from the server first and then write its own filter. The performance scales poorly due to the additional data translation overhead. We implemented a client-side filter for OPeNDAP. By comparing the baseline performance and our sequential performance, we can see that the total sequential processing time for type 1 query is not distinguishable from the baseline or the intrinsic HDF5 query function.

Parallel Performance (Type2 and Type3 Queries) Scaled the system up to 16 nodes, and the selectivity was varied from <20% to >80%. Good scalability.

Sequential and Parallel Performance of Aggregation Queries

SAGA: Array Storage as a DB with Support for Structural Aggregations

Mismatch Between Scientific Data and DBMS Scientific (Array) Datasets: Very large but processed infrequently Read/append only No resources for reloading data Popular formats: NetCDF and HDF5 Database Technologies For (read-write) data – ACID guaranteed Assume data reloading/reformatting feasible

The Upfront Cost of Using SciDB High-Level Data Flow Requires data ingestion Data Ingestion Steps Raw files (e.g., HDF5) -> CSV Load CSV files into SciDB The data ingestion experience is very painful. The data ingestion cost is 100x of a simple query. “EarthDB: scalable analysis of MODIS data using SciDB” - G. Planthaber et al.

Array Storage as a DB A Paradigm Similar to NoDB Still maintains DB functionality But no data ingestion DB and Array Storage as a DB: Friends or Foes? When to use DB? Load once, and query frequently When to directly use array storage? Query infrequently, so avoid loading Our System Focuses on a set of special array operations - Structural Aggregations Absolute power corrupts absolutely.

Structural Aggregations Aggregate the elements based on positional relationships E.g., moving average: calculates the average of each 2 × 2 square from left to right 1 2 3 4 5 6 7 8 3.5 4.5 5.5 Input Array Aggregation Result aggregate the elements in the same square at a time

Grid Aggregation Parallelization: Easy after Partitioning Considerations Data contiguity which affects the I/O performance Communication cost Load balancing for skewed data Partitioning Strategies Coarse-grained Fine-grained Hybrid Auto-grained

Coarse-Grained Partitioning Pros Low I/O cost Low communication cost Cons Workload imbalance for skewed data

Fine-Grained Partitioning Pros Excellent workload balance for skewed data Cons Relatively high I/O cost High communication cost

Hybrid Partitioning Pros Cons Low communication cost Good workload balance for skewed data Cons High I/O cost

Auto-Grained Partitioning 2 Steps Estimate the grid density (after filtering) by sampling, and thus, estimate the computation cost (based on the time complexity) For each grid, total processing cost = constant loading cost + varying computation cost Partitions the cost array - Balanced Contiguous Multi-Way Partitioning Dynamic programming (small # of grids) Greedy (large # of grids)

Auto-Grained Partitioning (Cont’d) Pros Low I/O cost Low communication cost Great workload balance for skewed data Cons Overhead of sampling an runtime partitioning

Partitioning Strategy Summary I/O Performance Workload Balance Scalability Additional Cost Coarse-Grained Excellent Poor None Fine-Grained Hybrid Good Auto-Grained Great Nontrivial Our partitioning strategy decider can help choose the best strategy

Partitioning Strategy Decider Cost Model: analyze loading cost and computation cost separately Load cost Loading factor × data amount Computation cost Exception - Auto-Grained: take loading cost and computation cost as a whole Communication cost is trivial, with an exception of fine-grained partitioning with small grid sizes.

Overlapping Aggregation I/O Cost Reuse the data already in the memory Reduce the disk I/O to enhance the I/O performance Memory Accesses Reuse the data already in the cache Reduce cache misses to accelerate the computation Aggregation Approaches Naïve approach Data-reuse approach All-reuse approach

Example: Hierarchical Aggregation Aggregate 3 grids in a 6 × 6 array The innermost 2 × 2 grid The middle 4 × 4 grid The outmost 6 × 6 grid (Parallel) sliding aggregation is much more complicated

Naïve Approach For N grids: N loads + N aggregations Load the innermost grid Aggregate the innermost grid Load the middle grid Aggregate the middle grid Load the outermost grid Aggregate the outermost grid For N grids: N loads + N aggregations

Data-Reuse Approach For N grids: 1 load + N aggregations Load the outermost grid Aggregate the outermost grid Aggregate the middle grid Aggregate the innermost grid For N grids: 1 load + N aggregations Aggregation execution is not so straightforward

All-Reuse Approach For N grids: 1 load + 1 aggregation Load the outermost grid Once an element is accessed, accumulatively update the aggregation results it contributes to For N grids: 1 load + 1 aggregation Pure sequential I/O Only update the outermost aggregation result Update both the outermost and the middle aggregation results Update all the 3 aggregation results

All-Reuse Approach (Cont’d) Key Insight # of aggregates ≤ # of queried elements More computationally efficient to iterate over elements and update the associated aggregates More Benefits Load balance (for hierarchical/circular aggregations) More speedup for compound array elements The data type of an aggregate is usually primitive, but this is not always true for an array element Similar to the simple join operation over two tables, it is more computationally efficient to cache the smaller table and scan the larger table as few as possible.

Parallel Performance vs. SciDB No preprocessing cost is included for SciDB Array slab/data size (8 GB) ratio: from 12.5% to 100% Coarse-grained partitioning for the grid aggregation All-reuse approach for the sliding aggregation SciDB stores `chunked’ array: can even support overlapping chunking to accelerate the sliding aggregation On an 8-core machine (EC2). Unable to install distributed SciDB in a cluster… SciDB randomly access different chunks, for better handling data/computation skew and fault tolerance. Moreover, duplicate chunks also involve redundant computation.

Parallel Grid Aggregation Performance Used 4 processors on a Real-Life Dataset of 8 GB User-Defined Aggregation: K-Means Vary the number of iterations to vary to the computation amount

Parallel Sliding Aggregation Performance # of nodes: from 1 to 16 8 GB data Sliding grid size: from 3 × 3 to 7 × 7

A Novel Approach for Approximate Aggregations Over Arrays

Bitmap Indexing and Pre-Aggregation Bitmap Indices Pre-Aggregation Statistics Any multi-dimensional array can be mapped to a 1D array

Key Insight Bitvectors Pre-Aggregation Statistics Preserve spatial info for a bin Any subarea can be represented as a bitvector Support Fast bitwise operations on its compressed format AND, OR, COUNT, etc. Pre-Aggregation Statistics Preserve value info for each bin Equivalent to a histogram Cheap No extra data scan: generated during bitmap indexing

Approximate Aggregation Workflow

Running Example Bitmap Indices Pre-Aggregation Statistics SELECT SUM(Array) WHERE Value > 3 AND ID < 4; Predicate Bitvector: 11110000 i1’: 01000000 i2’: 10010000 Count1: 1 Count2: 2 Estimated Sum: 7 × 1/2 + 16 × 2/3 = 14.167 Precise Sum: 14

Accuracy Concern Conventional Bitmap Indices Mainly designed for accelerating selection, not aggregation 100% accuracy relies on the extra validation on the raw data Now the raw data is unavailable Need for Accurate Approximation Similar problems can be found in histogram literature

A Novel Binning Strategy Conventional Binning Strategies Equi-width/equi-depth binning Not necessarily a good approximation V-Optimized Binning Strategy Inspired by v-optimal histogram Goal: approximately minimizes Sum Squared Error (SSE) Unbiased v-optimized binning: data is randomly queried Weighted v-optimized binning: certain subareas are frequently queried

Unbiased V-Optimized Binning 3 Steps: Initial Binning: uses equi-depth binning Iterative Refinement: adjusts bin boundaries Bitvector Generation: marks spatial positions Add a bin/bin boundary -> improve the approximation quality/decrease SSE Remove a bin/bin boundary -> undermine the approximation quality/increase SSE

Weighted V-Optimized Binning Difference: minimizes WSSE instead of SSE Similar binning algorithm Major Modification Representative value of each bin is weighted by querying probabilities

Experimental Setup Data Skew 5 Types of Queries Dense Range: less than 5% space but over 90% data Sparse Range: less than 95% space but over 10% data 5 Types of Queries DB: with dimension-based predicates VBD: with value-based predicates over dense range VBS : with value-based predicates over sparse range CD: with combined predicates over dense range CS : with combined predicates over sparse range Ratio of Querying Possibilities – 10 : 1 50% synthetic data is frequently queried 25% real-world data is frequently queried

SUM Aggregation Accuracy of Different Methods On real-world dataset. Bitmap vs. Sampling with two sampling rates 2% and 20%: 1) A significantly higher sampling rate does not necessarily lead to a significantly higher accuracy; 2) Most accurate when only value-based predicates exist. This small error is caused by the edge bin(s) that overlap with the queried value range. Conservative aggregation is slightly better than aggressive aggregation in this case. Bitmap vs. (Equi-Depth) MD-Histogram: 400 bins/buckets to partition the value domain; Equi-depth partitioning property (similar to equi-depth binning); Even less accurate than equi-depth bitmap: inaccurate in processing dimension-based predicates due to the uniform distribution assumption for every dimension Sampling_2% Sampling_20% (Equi-Depth) Hist1 (200*18k*200) (Equi-Depth) Hist2 (200*72k*800) Unbiased V-Optimized Weighted V-Optimized

Indexing Creation Times

Space Requirements of Indexing

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

Scientific Data Analysis Today “Store-First-Analyze-After” Reload data into another file system E.g., load data from PVFS to HDFS Reload data into another data format E.g., load NetCDF/HDF5 data to a specialized format Problems Long data migration/transformation time Stresses network and disks

System Overview Key Feature scientific data processing module

Parallel Data Processing Times on 16 GB Datasets KNN K-Means