1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.

Slides:

Advertisements

Similar presentations

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Advertisements

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.

FLANN Fast Library for Approximate Nearest Neighbors

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.

Science Problem: Cognitive capacity (human/scientist understanding), storage and I/O have not kept up with our capacity to generate massive amounts physics-based.

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.

Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.

HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.

Reporter ： Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, and others)

March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Thomas Heinis* Eleni Tzirita Zacharatou ‡ Farhan Tauheed § Anastasia Ailamaki ‡ RUBIK: Efficient Threshold Queries on Massive Time Series § Oracle Labs,

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

Dense-Region Based Compact Data Cube

Optimizing Distributed Actor Systems for Dynamic Interactive Services

Big Data is a Big Deal!.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Parallel Density-based Hybrid Clustering

So far we have covered … Basic visualization algorithms

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

Database Performance Tuning and Query Optimization

CSCE 990: Advanced Distributed Systems

Sameh Shohdy, Yu Su, and Gagan Agrawal

Year 2 Updates.

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Efficient Image Classification on Vertically Decomposed Data

Linchuan Chen, Peng Jiang and Gagan Agrawal

Predictive Performance

Communication and Memory Efficient Parallel Decision Tree Construction

Efficient Distribution-based Feature Search in Multi-field Datasets Ohio State University (Shen) Problem: How to efficiently search for distribution-based.

Gagan Agrawal The Ohio State University

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Yi Wang, Wei Jiang, Gagan Agrawal

Chapter 11 Database Performance Tuning and Query Optimization

Wellington Cabrera Advisor: Carlos Ordonez

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Efficient Aggregation over Objects with Extent

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The Ohio State University Advisor: Dr. Gagan Agrawal 1/15/2019

Motivation – Scientific Data Analysis 1/15/2019 Motivation – Scientific Data Analysis Science has become extremely data driven Data

Motivation – “Big Data” Strong requirement for efficient analysis Huge performance gap CPU: extremely fast speed to generate data Memory: not big enough to hold data Disk, Network: very slow to store or transfer data 1/15/2019

Motivation - Challenges 1/15/2019 Motivation - Challenges Lack of Data Virtualization Different data formats Different data access libraries Data Analysis Efficiency Individual data analysis (e.g., selection, aggregation) Correlation data analysis (e.g., selection, mining) No flexible data analysis for data transfer protocols Big Original Data Size Time-consuming and resource costly Hard to find a smaller subset to represent whole data Sampling efficiency and accuracy Parallelism Data dependency Multi-Server, Multi-Core 1/15/2019

Thesis Work Individual Analysis Data Virtualization 1/15/2019 Thesis Work Efficiency Individual Analysis Data Virtualization Memory Cost Correlation Analysis Data Transfer In-Situ Analysis Bitmap-based Data Summary Parallelism Data Sampling 1/15/2019

Thesis Work Individual Analysis Data Virtualization 1/15/2019 Thesis Work Supporting User-defined Subsetting and Aggregation over Parallel NetCDF Datasets (CCGrid 2012) Indexing and Parallel Query Processing Support for Visualizing Climate Datasets (ICPP 2012) SDQueryDSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol (SC 2013) Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices (HPDC 2013) Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings (HPDC 2014) In-Situ Bitmap Index Generation and Efficient Data Analysis based on Index (HPDC 2015 submission) Individual Analysis Data Virtualization Correlation Analysis In-Situ Analysis Bitmap-based Data Summary The Landscape of Parallel Computing Research: A View from Berkeley Data Sampling 1/15/2019

Outline Work After Candidacy Work Before Candidacy Conclusion 1/15/2019 Outline Work After Candidacy Support Flexible Correlation Analysis Using Bitmap Index on Scientific Datasets In-Situ Bitmap Index Generation and Efficient Data Analysis based on Bitmap Work Before Candidacy A Data Virtualization Method for Scientific Datasets Support data selection and aggregation over scientific data Support bitmap indexing for more efficient query Integrate flexible data management with wide-area data transfer protocol Support Data Sampling Using Bitmap Index Conclusion 1/15/2019

Bitmap Indexing Widely used in scientific data management Suitable for floating-value by binning small value ranges Run Length Compression (WAH, BBC) Bitmap Indices can be treated as a data summarization with much smaller size

Motivation Scientific Analysis Type: Individual Variable Analysis Data Subsetting, Aggregation, Mining, Visualization Correlation Analysis Study relationship among multiple variables Make interesting scientific discoveries “Big Data” becomes more severe Huge data loading cost (multiple variables) Additional filtering cost for subset-based correlation analysis Huge correlation calculation cost Useful but time consuming and resource costly No support of correlation analysis over flexible data subsets

Correlation Metrics 2-D Histogram: Shannon’s Entropy: Indicate value distribution relationship Value distribution of one variable regarding to change of another Shannon’s Entropy: A metric to show the variability of the dataset Low entropy => more constant, predictable data High entropy => more random distributed data Mutual Information: A metric for computing the dependence between two variables Low M => two variables are relatively independent High M => one variable provides information about another

Contributions Support efficient correlation analysis using bitmap indexing Better efficiency and smaller memory cost using bitmap indexing Dynamic Indexing Static Indexing Parallel correlation analysis with multiple machines Dim-based Index Partition Value-based Index Partition Correlation analysis based on samples A framework to support correlation analysis over flexible data subsets 1/15/2019

User Cases Please enter variable names which you want to perform correlation queries: TEMP SALT UVEL Please enter your SQL query: SELECT TEMP FROM POP WHERE TEMP>0 AND TEMP<1 AND depth_t<50; Entropy: TEMP: 2.29, SALT: 2.66, UVEL: 3.05; Mutual Information: TEMPSALT: 0.15, TEMP->UVEL: 0.036; SELECT SALT FROM POP WHERE SALT<0.0346; Entropy: TEMP: 2.28, SALT: 2.53, UVEL: 3.06; Mutual Information: TEMPUVEL 0.039, SALT->UVEL->0.33; UNDO Entropy: TEMP: 2.22, SALT: 1.58, UVEL: 2.64； Mutual Information: TEMPUVEL 0.31, SALT->UVEL->0.21; ……

User Cases Histogram of SALT based on TEMP Cold Water(TEMP<5): High SALT Hot Water(TEMP>=15): High SALT Entropy TEMP: similar entropy SALT: Diversity of SALT becomes bigger as TEMP increases Mutual Information Correlation between TEMP and SALT is high when TEMP is cold or hot

Dynamic Indexing No Indexing Support: Load all data of variable A and B Filter A and B and generate subset (for value-based subsetting) Generate joint bins: divide A and B into bins, generate (A1, B1)->count1, … (Am, Bm)->countm by scanning each data element Calculate correlation metrics based on joint bins Dynamic Indexing (build index for each variable): Query bitvectors for variable A and B (much smaller index loading cost, very small filtering cost) Generate joint bins: generate (A1, B1)->count1, … (Am, Bm)->countm based on fast bitwise operations between A and B (bitvectors# are much smaller than elements#)

Dynamic Indexing Memory

Static Indexing Dynamic Indexing: Static Indexing: Build one index over each variable Still need to perform bitwise operations to generate joint bins Static Indexing: Build one index over multi-variables Only need to perform bitvectors loading and calculation

Dim-based Partition Pros: efficiency parallel index generation Cons: slave node cannot directly calculate the results. Big reduction overhead

Value-based Partition Pros: partition for parallel index generation is more time-consuming Cons: slave node can directly calculate partial results. Very small reduction overhead

Correlation Analysis in Distributed Environment Computing Node Without Indexing Support Using Bitmap Indexing Read Data Subset Read IndexSubset

Correlation Analysis over Samples Logic operations between sample of A and bitvectors of B Select bitvectors of variable A Perform Index-based sampling on Variable A Select bitvectors of variable B

Put It Together Parse the metadata file Continue Iteractive Query Parse the SQL expression Generate query request Give up current corrlation result or not? Decide query types Read Joint Bitvectors Read bitvectors and generate joint bins Perform index-based data query and samling Calculate Correlation Metrics based on joint bitvectors Read the data value after finding satisfying result

Experiment Evaluations Goals: Speedup of correlation analysis using bitmap indexing Scalability of parallel correlation analysis Efficiency improvement in distributed environment Efficiency and accuracy comparison with sampling Datasets: Parallel Ocean Program – Multi-dimensional Arrays 26 Variables: TEMP (depth, lat, lon), SALT, UVEL …… Environment: OSC Glenn Cluster: each node has 8 cores, 2.6 GHz AMD Opteron, 64 GB memory, 1.9 TB disk

Efficiency Comparison No Indexing (original): Data Loading + Filtering Joint Bins Generation (scan each data element) Correlation Calculation Dynamic Indexing: Index Subset Loading Joint Bins Generation (bitwise operations) 1.78x to 3.61x speedup Speedup becomes bigger as data subset size decreases Static Indexing: Joint Index Subset Loading 11.4x to 15.35x speedup Variables: TEMP SALT, 5.6 GB each Metrics: Entropy, Histogram, Mutual Info Input: 1000 queries divided into 5 categories based on subsetting percentage

Parallel Correlation Analysis Scalability Dim-based Partition: The speedup is limited 1.73x to 5.96x speedup Every node can only generate joint bins Joint bins from different nodes need to be transferred for a global reduction (big cost) More nodes used means bigger network transfer and calculation cost Value-based Partition: Much better speedup 1.87x to 11.79x speedup Every node can directly calculate partial correlation metrics Very small reduction cost Variables: TEMP SALT, 28 GB each Metrics: Entropy, Histogram, Mutual Info Nodes#: 1 – 32, one core per node Calculate correlations based on entire data Speedup as more number of nodes used

Speedup in Distributed Environment Local Data Server (1Gb/s) Remote Data Server (200Mb/s) Data Size: 7Gb – 28 GB Indexing Method: Smaller data transfer time (index size is only 12.1% to 26.8% of the dataset) Faster correlation analysis time (smaller data loading, faster joint bin calculation) Speedup using local data server (1 Gb/s): 1.87x – 1.91x Speedup using remote data server (200 Mb/s): 2.78x – 2.96x

Efficiency and Accuracy Comparison with Sampling Select 10 Variables (1.4 GB each) and calculate mutual information between each pair (45 pairs) Calculate correlation based on samples: Joint bins generation time is great reduced Extra cost: sampling time Speedup: 1.34x – 6.84x Use CFP to present relative mutual information differences (45 pairs) More accuracy lost as smaller sample used, average accuracy lost : 50% - 1.53%, 25% - 3.42% 10% - 7.91%, 5% - 12.57% 1% - 18.32%

Outline Work After Candidacy Work Before Candidacy Conclusion 1/15/2019 Outline Work After Candidacy Support Flexible Correlation Analysis Using Bitmap Index on Scientific Datasets In-Situ Bitmap Index Generation and Efficient Data Analysis based on Bitmap Work Before Candidacy A Data Virtualization Method for Scientific Datasets Support data selection and aggregation over scientific data Support bitmap indexing for more efficient query Integrate flexible data management with wide-area data transfer protocol Support Data Sampling Using Bitmap Index Conclusion 1/15/2019

Motivation of In-Situ Analysis Process of transforming data at run time Analysis Triage Reduction Visualization In-Situ has the promise of Saving more information dense data Saving disk space Saving time in analysis (online and offline) Producing higher fidelity results Goal: Generate a smaller data summarization for analysis Utilize extra computing resource Decrease the IO transfer volume

Contributions In-Situ bitmap index generation Highly parallelized using multi-node, multi-core In-place bitmap index compression Store only index Data analysis based on bitmap index Support various data analysis Previous work Online analysis: time step selection Offline analysis: correlation mining, histogram spectra Support more efficient data analysis Compare efficiency and accuracy between in-situ sampling and in-situ bitmap indexing

Data-based vs. Index-based Analysis Data-based In-Situ Analysis: Data Simulation Online Analysis: Timestamp Selection (Slow) Store the Data (Slow) Offline Analysis (Slow) Index-based In-Situ Analysis: Bitmap Index Generation (Fast) Timestamp Selection Using Bitmap Index (Faster) Store only Bitmap Index (Faster) Offline Analysis Using Bitmap Index (Faster)

System Overview

In-Situ Bitmap Index Generation Directly generate index in memory after each time step is simulated Parallel index generation Logical data partition Multi-Core index generation Core allocation strategies Shared Cores Allocate all cores to simulation and index generation Executed in sequence Separate Cores Allocate different core sets to simulation and index generation A data queue is shared between simulation and index generation Executed in parallel In-place bitmap index generation Scan data by segments Merge segments into compressed bitvectors

Time Step Selection Data is simulated over time steps Only keep time steps with “important” information Define “important”: Self-contained information Information with respect to others Earth Move’s Distance Conditional Entropy Traditional method issues: Scan entire time steps for calculations Can not hold enough time steps in memory Time step selection only using bitmap Index

Conditional Entropy Conditional Entropy: self-contained info minus info to others Shannon’s Entropy (H(A)): A metric to show the variability of the data Mutual Information (I(A;B)): A metric for computing the dependence between two variables

Earth Mover’s Distance Measure of distance between two probability distribution Divide data into bins and calculate element differences

Parallel Index Generation and Data Analysis

Correlation Mining Using Bitmap Automatically find data subsets with high correlations Traditional Method Exhaustive calculation over data subsets Huge time and memory cost Correlation mining using bitmap A top-down method for value subsets Multi-level bitmap indexing Subsets pruning from higher level to lower level A bottom-up method for spatial subsets Index generation Use Z-order curve to scan the data Multi-level granularity Correlation Mining Divide each bitvector into strides Find the high-correlated spatial areas Classify and Merge algorithm to show mining result

Correlation Mining Using Bitmap Step 1: generate joint bitvectors (time complexity: i * j) Step 2: bit-1 count operations within joint bitvectors Subsets pruning to further improve the efficiency

Histogram Spectra Functionality Improvement using bitmap index Histogram differences between highest resolution data and each low resolution data Predict error of each sample level Level-of-Detail selection Used in multivariate scenario Improvement using bitmap index Generate more accurate sample Calculate histogram spectra more efficiently in multivariate scenario Used for both online and offline analysis Online: data sampling Offline: bitvector sampling

Experiment Evaluations Goals: Compare in-situ data based analysis with in-situ bitmap index based analysis Scalability of our method in parallel environment Speedup of using bitmap index for correlation mining Compare in-situ sampling based analysis with in-situ bitmap index based analysis Simulations: Lulesh, Heat3D, POP Environments: Standalone: 32 Intel Xeon x5650 CPUs and 1 TB memory (CPU) 60 Intel Xeon Phi coprocessors and 8 GB memory (MIC) Cluster: 32 machines with 12 Intel Xeon x5650 CPUs and 48 GB memory

No Index vs. Bitmap Index (Lulesh, CPU) Intel CPU: 32 cores Lulesh: Motion of materials relative to each other Time consuming No Indexing: Time Step Selection (scan 12.28 GB data) Data Writing (6.14 GB each) Bitmap Indexing: Time Step Selection (hundreds of bitwise operations) Data Writing ( < 1GB each) 0.84x to 1.47x speedup Speedup becomes bigger as data size increases Select 25 time steps out of 100 4 Variables: 6.14 GB per time step Number of bins: 89 to 314 Metrics: Earth Mover’s Distance

No Index vs. Bitmap Index (Lulesh, MIC) Intel MIC: 60 cores More computing resource Limited IO speed Lulesh simulation No Indexing: Time Step Selection (scan 1.5 GB data) Data Writing (768 MB each) Data writing becomes the major bottleneck Bitmap Indexing: Index generation and time step selection time is very small using a big number of cores 0.92x to 2.62x speedup Speedup becomes bigger using MIC Select 25 time steps out of 100 4 Variables: 768 MB per time step Number of bins: 89 to 314 Metrics: Earth Mover’s Distance

No Index vs. Bitmap Index (Heat3D, MIC) Intel MIC: 60 cores Heat3D: Simulate heat flow Fast (one data scan) No Indexing: Time Step Selection (scan 3.2 GB data) Data Writing (1.6 BG each) Bitmap Indexing: Time Step Selection (thousands of bitwise AND operations) Data Writing (<200 MB each) 0.81x to 3.28x speedup Speedup becomes bigger as data size increases Select 25 time steps out of 100 TEMP Variable: 1.6 GB per time step Number of bins: 64 to 206 Metrics: Conditional Entropy

No Index vs. Bitmap Index (Memory Cost) Heat3D - No Indexing: 12 time steps (pre, temp, cur) Heat3D - Bitmap Indexing: 2 time steps (pre, temp) 1 previous selected indices 10 current indices Lulesh – No Indexing: 11 time steps (pre, cur) Huge extra memory for edges Lulesh – Bitmap Indexing: 1 time step (pre) 1.99x to 3.59x smaller memory Better as bigger data simulated and more time steps to hold Select 1 time step out of each 10 steps Simulations: Heat3D, Lulesh Machines: CPU, MIC

Scalability in Parallel Environment Simulation: Heat3D No Index – Local: Each node write its data subblock into its own disk Bitmap Index – Local: Each node writes its index subblock into its own disk Fast time step selection and local writing 1.24x – 1.29x speedup No Index – Remote: Different nodes send data sub-blocks to a master node Bitmap Index – Remote: Greatly alleviate data transfer burden of master node 1.24x – 3.79x speedup Select 25 time steps out of 100 TEMP Variable: 6.4 GB per time step Number of nodes: 1 to 32 Number of cores: 8

Speedup for Correlation Mining Simulation: POP No Indexing: Big data loading cost Huge memory cost Exhaustive calculation over data subsets Calculation is time consuming Bitmap Indexing: Smaller data loading Smaller memory cost Multi-level index to improve the mining process Bitwise operations and 1-count operations to improve the calculation efficiency 3.8x – 4.9x speedup Variables: TEMP, SALT Data size per variable: 1.4 GB to 11.2 GB Number of cores: 1

Sampling vs. Bitmap Indexing Index generation (binning, compression) has more time cost then down-sampling Sampling can effectively improve the time step selection cost Index generation can still achieve better efficiency if the index size is smaller than sample size Bitmap indexing: using the same binning scale, does not have any information loss Sampling: information loss is unavoidable no matter what sample% 50% - 3.14% loss, 30% - 7.56% loss 15% - 10.15%loss, 5% - 17.03% loss

Outline Work After Candidacy Work Before Candidacy Conclusion 1/15/2019 Outline Work After Candidacy Support Flexible Correlation Analysis Using Bitmap Index on Scientific Datasets In-Situ Bitmap Index Generation and Efficient Data Analysis based on Bitmap Work Before Candidacy A Data Virtualization Method for Scientific Datasets Support data selection and aggregation over scientific data Support bitmap indexing for more efficient query Integrate flexible data management with wide-area data transfer protocol Support Data Sampling Using Bitmap Index Conclusion 1/15/2019

Contribution Data virtualization Support standard SQL queries Keep data in its native format (e.g., NetCDF, HDF5) SciDB, OPeNDAP: huge data loading or transform cost Metadata design Server-side subsetting and aggregation Subsetting: dimensions, coordinates, values Bitmap Indexing: two-phase optimizations Aggregation: SUM, AVG, COUNT, MAX, MIN Parallel data processing Data Partition Strategy Multiple Parallel Levels – Files, Attributes, Blocks Integrate with data transfer protocol SDQuery_DSI module in Globus GridFTP Flexible data subsetting + Efficient data transfer

System Overview GridFTP Server Request Parser File DSI SDQuery DSI GridFTP Client GridFTP Client GridFTP Client GridFTP Server Request Parser data store request schema request data retrieve request Query Analysis Parse SQL query Receive Data File File Receiver File Receiver File Reader Indexing and find all data IDs Build Multi-level Bitmap Indexing Index Operations Index Generation File DSI Read Data based on data IDs Data Reader Generate Metadata View Physical Location Logical Layout Value Histogram Schema Management Query Metadata View Send File File Sender SDQuery DSI Indices and schema HDF5, NetCDF Dataset

Outline Work After Candidacy Work Before Candidacy Conclusion 1/15/2019 Outline Work After Candidacy Support Flexible Correlation Analysis Using Bitmap Index on Scientific Datasets In-Situ Bitmap Index Generation and Efficient Data Analysis based on Bitmap Work Before Candidacy A Data Virtualization Method for Scientific Datasets Support data selection and aggregation over scientific data Support bitmap indexing for more efficient query Integrate flexible data management with wide-area data transfer protocol Support Data Sampling Using Bitmap Index Conclusion 1/15/2019

Contributions Statistic Sampling Techniques Challenges A subset of individuals to represent whole population Information Loss and Error Metrics: Mean, Variance, Histogram, Q-Q Plot Challenges Sampling Accuracy Considering Data Features Error Calculation with High Overhead Support Data Sampling over Bitmap Indices Data samples has better accuracy Support error prediction before sampling the data Support data sampling over flexible data subset No data reorganization is needed

Data Sampling over Bitmap Indices Features of Bitmap Indexing Each bin (bitvector) corresponds to one value range Different bins reflect the entire value distribution Each bin keeps the data spatial locality Contain all space IDs (0-bits and 1-bits) Row Major, Column Major Z-Order Curve , Hilbert Curve Method Perform stratified random sampling over each bin Multi-level indices generates multi-level samples

Stratified Random Sampling over Bins S1: Index Generation S2: Divide Bitvector into Equal Strides S3: Random Select certain % of 1’s out of each stride

Error Prediction vs. Error Calculation Data Sampling Multi-Times Error Prediction Sampling Request Predict Request Sample Error Metrics Feedback Sampling Request Sample Not Good? Decide Sampling Error Calculation

Future Work In-Situ Importance Driven Subset Selection 1/15/2019 Future Work In-Situ Importance Driven Subset Selection Index-based data subset selection A comparison of this method with histogram-based data subset selection Support in-situ correlation mining using index Support efficient index generation using GPU Feature Mining Using Bitmap Index Scientific data features: Cosmology: halos Oceanographic: eddy Quickly identify these features using bitmap index In-situ analysis 1/15/2019

Conclusion A data virtualization framework for scientific dataset Support user-defined subsetting and aggregation over parallel netcdf datasets (CCGrid2012) Indexing and parallel query processing support for visualizing climate datasets (ICPP2012) SDQuery DSI: integrating data management support with a wide area data transfer protocol (SC2013) Support data sampling using bitmap index Taming massive distributed datasets: data sampling using bitmap indices (HPDC2013) Support correlation analysis using bitmap index Support correlation analysis on scientific datasets in parallel and distributed settings (HPDC2014) In-situ bitmap index based analysis In-Situ bitmap index generation and efficient data analysis using bitmap (HPDC2015 submission) 1/15/2019

Thanks for your attention! Q & A 1/15/2019