1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The Ohio State University Advisor: Dr. Gagan Agrawal 1/15/2019
Motivation – Scientific Data Analysis 1/15/2019 Motivation – Scientific Data Analysis Science has become extremely data driven Data
Motivation – “Big Data” Strong requirement for efficient analysis Huge performance gap CPU: extremely fast speed to generate data Memory: not big enough to hold data Disk, Network: very slow to store or transfer data 1/15/2019
Motivation - Challenges 1/15/2019 Motivation - Challenges Lack of Data Virtualization Different data formats Different data access libraries Data Analysis Efficiency Individual data analysis (e.g., selection, aggregation) Correlation data analysis (e.g., selection, mining) No flexible data analysis for data transfer protocols Big Original Data Size Time-consuming and resource costly Hard to find a smaller subset to represent whole data Sampling efficiency and accuracy Parallelism Data dependency Multi-Server, Multi-Core 1/15/2019
Thesis Work Individual Analysis Data Virtualization 1/15/2019 Thesis Work Efficiency Individual Analysis Data Virtualization Memory Cost Correlation Analysis Data Transfer In-Situ Analysis Bitmap-based Data Summary Parallelism Data Sampling 1/15/2019
Thesis Work Individual Analysis Data Virtualization 1/15/2019 Thesis Work Supporting User-defined Subsetting and Aggregation over Parallel NetCDF Datasets (CCGrid 2012) Indexing and Parallel Query Processing Support for Visualizing Climate Datasets (ICPP 2012) SDQueryDSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol (SC 2013) Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices (HPDC 2013) Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings (HPDC 2014) In-Situ Bitmap Index Generation and Efficient Data Analysis based on Index (HPDC 2015 submission) Individual Analysis Data Virtualization Correlation Analysis In-Situ Analysis Bitmap-based Data Summary The Landscape of Parallel Computing Research: A View from Berkeley Data Sampling 1/15/2019
Outline Work After Candidacy Work Before Candidacy Conclusion 1/15/2019 Outline Work After Candidacy Support Flexible Correlation Analysis Using Bitmap Index on Scientific Datasets In-Situ Bitmap Index Generation and Efficient Data Analysis based on Bitmap Work Before Candidacy A Data Virtualization Method for Scientific Datasets Support data selection and aggregation over scientific data Support bitmap indexing for more efficient query Integrate flexible data management with wide-area data transfer protocol Support Data Sampling Using Bitmap Index Conclusion 1/15/2019
Bitmap Indexing Widely used in scientific data management Suitable for floating-value by binning small value ranges Run Length Compression (WAH, BBC) Bitmap Indices can be treated as a data summarization with much smaller size
Motivation Scientific Analysis Type: Individual Variable Analysis Data Subsetting, Aggregation, Mining, Visualization Correlation Analysis Study relationship among multiple variables Make interesting scientific discoveries “Big Data” becomes more severe Huge data loading cost (multiple variables) Additional filtering cost for subset-based correlation analysis Huge correlation calculation cost Useful but time consuming and resource costly No support of correlation analysis over flexible data subsets
Correlation Metrics 2-D Histogram: Shannon’s Entropy: Indicate value distribution relationship Value distribution of one variable regarding to change of another Shannon’s Entropy: A metric to show the variability of the dataset Low entropy => more constant, predictable data High entropy => more random distributed data Mutual Information: A metric for computing the dependence between two variables Low M => two variables are relatively independent High M => one variable provides information about another
Contributions Support efficient correlation analysis using bitmap indexing Better efficiency and smaller memory cost using bitmap indexing Dynamic Indexing Static Indexing Parallel correlation analysis with multiple machines Dim-based Index Partition Value-based Index Partition Correlation analysis based on samples A framework to support correlation analysis over flexible data subsets 1/15/2019
User Cases Please enter variable names which you want to perform correlation queries: TEMP SALT UVEL Please enter your SQL query: SELECT TEMP FROM POP WHERE TEMP>0 AND TEMP<1 AND depth_t<50; Entropy: TEMP: 2.29, SALT: 2.66, UVEL: 3.05; Mutual Information: TEMPSALT: 0.15, TEMP->UVEL: 0.036; SELECT SALT FROM POP WHERE SALT<0.0346; Entropy: TEMP: 2.28, SALT: 2.53, UVEL: 3.06; Mutual Information: TEMPUVEL 0.039, SALT->UVEL->0.33; UNDO Entropy: TEMP: 2.22, SALT: 1.58, UVEL: 2.64; Mutual Information: TEMPUVEL 0.31, SALT->UVEL->0.21; ……
User Cases Histogram of SALT based on TEMP Cold Water(TEMP<5): High SALT Hot Water(TEMP>=15): High SALT Entropy TEMP: similar entropy SALT: Diversity of SALT becomes bigger as TEMP increases Mutual Information Correlation between TEMP and SALT is high when TEMP is cold or hot
Dynamic Indexing No Indexing Support: Load all data of variable A and B Filter A and B and generate subset (for value-based subsetting) Generate joint bins: divide A and B into bins, generate (A1, B1)->count1, … (Am, Bm)->countm by scanning each data element Calculate correlation metrics based on joint bins Dynamic Indexing (build index for each variable): Query bitvectors for variable A and B (much smaller index loading cost, very small filtering cost) Generate joint bins: generate (A1, B1)->count1, … (Am, Bm)->countm based on fast bitwise operations between A and B (bitvectors# are much smaller than elements#)
Dynamic Indexing Memory
Static Indexing Dynamic Indexing: Static Indexing: Build one index over each variable Still need to perform bitwise operations to generate joint bins Static Indexing: Build one index over multi-variables Only need to perform bitvectors loading and calculation
Dim-based Partition Pros: efficiency parallel index generation Cons: slave node cannot directly calculate the results. Big reduction overhead
Value-based Partition Pros: partition for parallel index generation is more time-consuming Cons: slave node can directly calculate partial results. Very small reduction overhead
Correlation Analysis in Distributed Environment Computing Node Without Indexing Support Using Bitmap Indexing Read Data Subset Read IndexSubset
Correlation Analysis over Samples Logic operations between sample of A and bitvectors of B Select bitvectors of variable A Perform Index-based sampling on Variable A Select bitvectors of variable B
Put It Together Parse the metadata file Continue Iteractive Query Parse the SQL expression Generate query request Give up current corrlation result or not? Decide query types Read Joint Bitvectors Read bitvectors and generate joint bins Perform index-based data query and samling Calculate Correlation Metrics based on joint bitvectors Read the data value after finding satisfying result
Experiment Evaluations Goals: Speedup of correlation analysis using bitmap indexing Scalability of parallel correlation analysis Efficiency improvement in distributed environment Efficiency and accuracy comparison with sampling Datasets: Parallel Ocean Program – Multi-dimensional Arrays 26 Variables: TEMP (depth, lat, lon), SALT, UVEL …… Environment: OSC Glenn Cluster: each node has 8 cores, 2.6 GHz AMD Opteron, 64 GB memory, 1.9 TB disk
Efficiency Comparison No Indexing (original): Data Loading + Filtering Joint Bins Generation (scan each data element) Correlation Calculation Dynamic Indexing: Index Subset Loading Joint Bins Generation (bitwise operations) 1.78x to 3.61x speedup Speedup becomes bigger as data subset size decreases Static Indexing: Joint Index Subset Loading 11.4x to 15.35x speedup Variables: TEMP SALT, 5.6 GB each Metrics: Entropy, Histogram, Mutual Info Input: 1000 queries divided into 5 categories based on subsetting percentage
Parallel Correlation Analysis Scalability Dim-based Partition: The speedup is limited 1.73x to 5.96x speedup Every node can only generate joint bins Joint bins from different nodes need to be transferred for a global reduction (big cost) More nodes used means bigger network transfer and calculation cost Value-based Partition: Much better speedup 1.87x to 11.79x speedup Every node can directly calculate partial correlation metrics Very small reduction cost Variables: TEMP SALT, 28 GB each Metrics: Entropy, Histogram, Mutual Info Nodes#: 1 – 32, one core per node Calculate correlations based on entire data Speedup as more number of nodes used
Speedup in Distributed Environment Local Data Server (1Gb/s) Remote Data Server (200Mb/s) Data Size: 7Gb – 28 GB Indexing Method: Smaller data transfer time (index size is only 12.1% to 26.8% of the dataset) Faster correlation analysis time (smaller data loading, faster joint bin calculation) Speedup using local data server (1 Gb/s): 1.87x – 1.91x Speedup using remote data server (200 Mb/s): 2.78x – 2.96x
Efficiency and Accuracy Comparison with Sampling Select 10 Variables (1.4 GB each) and calculate mutual information between each pair (45 pairs) Calculate correlation based on samples: Joint bins generation time is great reduced Extra cost: sampling time Speedup: 1.34x – 6.84x Use CFP to present relative mutual information differences (45 pairs) More accuracy lost as smaller sample used, average accuracy lost : 50% - 1.53%, 25% - 3.42% 10% - 7.91%, 5% - 12.57% 1% - 18.32%
Outline Work After Candidacy Work Before Candidacy Conclusion 1/15/2019 Outline Work After Candidacy Support Flexible Correlation Analysis Using Bitmap Index on Scientific Datasets In-Situ Bitmap Index Generation and Efficient Data Analysis based on Bitmap Work Before Candidacy A Data Virtualization Method for Scientific Datasets Support data selection and aggregation over scientific data Support bitmap indexing for more efficient query Integrate flexible data management with wide-area data transfer protocol Support Data Sampling Using Bitmap Index Conclusion 1/15/2019
Motivation of In-Situ Analysis Process of transforming data at run time Analysis Triage Reduction Visualization In-Situ has the promise of Saving more information dense data Saving disk space Saving time in analysis (online and offline) Producing higher fidelity results Goal: Generate a smaller data summarization for analysis Utilize extra computing resource Decrease the IO transfer volume
Contributions In-Situ bitmap index generation Highly parallelized using multi-node, multi-core In-place bitmap index compression Store only index Data analysis based on bitmap index Support various data analysis Previous work Online analysis: time step selection Offline analysis: correlation mining, histogram spectra Support more efficient data analysis Compare efficiency and accuracy between in-situ sampling and in-situ bitmap indexing
Data-based vs. Index-based Analysis Data-based In-Situ Analysis: Data Simulation Online Analysis: Timestamp Selection (Slow) Store the Data (Slow) Offline Analysis (Slow) Index-based In-Situ Analysis: Bitmap Index Generation (Fast) Timestamp Selection Using Bitmap Index (Faster) Store only Bitmap Index (Faster) Offline Analysis Using Bitmap Index (Faster)
System Overview
In-Situ Bitmap Index Generation Directly generate index in memory after each time step is simulated Parallel index generation Logical data partition Multi-Core index generation Core allocation strategies Shared Cores Allocate all cores to simulation and index generation Executed in sequence Separate Cores Allocate different core sets to simulation and index generation A data queue is shared between simulation and index generation Executed in parallel In-place bitmap index generation Scan data by segments Merge segments into compressed bitvectors
Time Step Selection Data is simulated over time steps Only keep time steps with “important” information Define “important”: Self-contained information Information with respect to others Earth Move’s Distance Conditional Entropy Traditional method issues: Scan entire time steps for calculations Can not hold enough time steps in memory Time step selection only using bitmap Index
Conditional Entropy Conditional Entropy: self-contained info minus info to others Shannon’s Entropy (H(A)): A metric to show the variability of the data Mutual Information (I(A;B)): A metric for computing the dependence between two variables
Earth Mover’s Distance Measure of distance between two probability distribution Divide data into bins and calculate element differences
Parallel Index Generation and Data Analysis
Correlation Mining Using Bitmap Automatically find data subsets with high correlations Traditional Method Exhaustive calculation over data subsets Huge time and memory cost Correlation mining using bitmap A top-down method for value subsets Multi-level bitmap indexing Subsets pruning from higher level to lower level A bottom-up method for spatial subsets Index generation Use Z-order curve to scan the data Multi-level granularity Correlation Mining Divide each bitvector into strides Find the high-correlated spatial areas Classify and Merge algorithm to show mining result
Correlation Mining Using Bitmap Step 1: generate joint bitvectors (time complexity: i * j) Step 2: bit-1 count operations within joint bitvectors Subsets pruning to further improve the efficiency
Histogram Spectra Functionality Improvement using bitmap index Histogram differences between highest resolution data and each low resolution data Predict error of each sample level Level-of-Detail selection Used in multivariate scenario Improvement using bitmap index Generate more accurate sample Calculate histogram spectra more efficiently in multivariate scenario Used for both online and offline analysis Online: data sampling Offline: bitvector sampling
Experiment Evaluations Goals: Compare in-situ data based analysis with in-situ bitmap index based analysis Scalability of our method in parallel environment Speedup of using bitmap index for correlation mining Compare in-situ sampling based analysis with in-situ bitmap index based analysis Simulations: Lulesh, Heat3D, POP Environments: Standalone: 32 Intel Xeon x5650 CPUs and 1 TB memory (CPU) 60 Intel Xeon Phi coprocessors and 8 GB memory (MIC) Cluster: 32 machines with 12 Intel Xeon x5650 CPUs and 48 GB memory
No Index vs. Bitmap Index (Lulesh, CPU) Intel CPU: 32 cores Lulesh: Motion of materials relative to each other Time consuming No Indexing: Time Step Selection (scan 12.28 GB data) Data Writing (6.14 GB each) Bitmap Indexing: Time Step Selection (hundreds of bitwise operations) Data Writing ( < 1GB each) 0.84x to 1.47x speedup Speedup becomes bigger as data size increases Select 25 time steps out of 100 4 Variables: 6.14 GB per time step Number of bins: 89 to 314 Metrics: Earth Mover’s Distance
No Index vs. Bitmap Index (Lulesh, MIC) Intel MIC: 60 cores More computing resource Limited IO speed Lulesh simulation No Indexing: Time Step Selection (scan 1.5 GB data) Data Writing (768 MB each) Data writing becomes the major bottleneck Bitmap Indexing: Index generation and time step selection time is very small using a big number of cores 0.92x to 2.62x speedup Speedup becomes bigger using MIC Select 25 time steps out of 100 4 Variables: 768 MB per time step Number of bins: 89 to 314 Metrics: Earth Mover’s Distance
No Index vs. Bitmap Index (Heat3D, MIC) Intel MIC: 60 cores Heat3D: Simulate heat flow Fast (one data scan) No Indexing: Time Step Selection (scan 3.2 GB data) Data Writing (1.6 BG each) Bitmap Indexing: Time Step Selection (thousands of bitwise AND operations) Data Writing (<200 MB each) 0.81x to 3.28x speedup Speedup becomes bigger as data size increases Select 25 time steps out of 100 TEMP Variable: 1.6 GB per time step Number of bins: 64 to 206 Metrics: Conditional Entropy
No Index vs. Bitmap Index (Memory Cost) Heat3D - No Indexing: 12 time steps (pre, temp, cur) Heat3D - Bitmap Indexing: 2 time steps (pre, temp) 1 previous selected indices 10 current indices Lulesh – No Indexing: 11 time steps (pre, cur) Huge extra memory for edges Lulesh – Bitmap Indexing: 1 time step (pre) 1.99x to 3.59x smaller memory Better as bigger data simulated and more time steps to hold Select 1 time step out of each 10 steps Simulations: Heat3D, Lulesh Machines: CPU, MIC
Scalability in Parallel Environment Simulation: Heat3D No Index – Local: Each node write its data subblock into its own disk Bitmap Index – Local: Each node writes its index subblock into its own disk Fast time step selection and local writing 1.24x – 1.29x speedup No Index – Remote: Different nodes send data sub-blocks to a master node Bitmap Index – Remote: Greatly alleviate data transfer burden of master node 1.24x – 3.79x speedup Select 25 time steps out of 100 TEMP Variable: 6.4 GB per time step Number of nodes: 1 to 32 Number of cores: 8
Speedup for Correlation Mining Simulation: POP No Indexing: Big data loading cost Huge memory cost Exhaustive calculation over data subsets Calculation is time consuming Bitmap Indexing: Smaller data loading Smaller memory cost Multi-level index to improve the mining process Bitwise operations and 1-count operations to improve the calculation efficiency 3.8x – 4.9x speedup Variables: TEMP, SALT Data size per variable: 1.4 GB to 11.2 GB Number of cores: 1
Sampling vs. Bitmap Indexing Index generation (binning, compression) has more time cost then down-sampling Sampling can effectively improve the time step selection cost Index generation can still achieve better efficiency if the index size is smaller than sample size Bitmap indexing: using the same binning scale, does not have any information loss Sampling: information loss is unavoidable no matter what sample% 50% - 3.14% loss, 30% - 7.56% loss 15% - 10.15%loss, 5% - 17.03% loss
Outline Work After Candidacy Work Before Candidacy Conclusion 1/15/2019 Outline Work After Candidacy Support Flexible Correlation Analysis Using Bitmap Index on Scientific Datasets In-Situ Bitmap Index Generation and Efficient Data Analysis based on Bitmap Work Before Candidacy A Data Virtualization Method for Scientific Datasets Support data selection and aggregation over scientific data Support bitmap indexing for more efficient query Integrate flexible data management with wide-area data transfer protocol Support Data Sampling Using Bitmap Index Conclusion 1/15/2019
Contribution Data virtualization Support standard SQL queries Keep data in its native format (e.g., NetCDF, HDF5) SciDB, OPeNDAP: huge data loading or transform cost Metadata design Server-side subsetting and aggregation Subsetting: dimensions, coordinates, values Bitmap Indexing: two-phase optimizations Aggregation: SUM, AVG, COUNT, MAX, MIN Parallel data processing Data Partition Strategy Multiple Parallel Levels – Files, Attributes, Blocks Integrate with data transfer protocol SDQuery_DSI module in Globus GridFTP Flexible data subsetting + Efficient data transfer
System Overview GridFTP Server Request Parser File DSI SDQuery DSI GridFTP Client GridFTP Client GridFTP Client GridFTP Server Request Parser data store request schema request data retrieve request Query Analysis Parse SQL query Receive Data File File Receiver File Receiver File Reader Indexing and find all data IDs Build Multi-level Bitmap Indexing Index Operations Index Generation File DSI Read Data based on data IDs Data Reader Generate Metadata View Physical Location Logical Layout Value Histogram Schema Management Query Metadata View Send File File Sender SDQuery DSI Indices and schema HDF5, NetCDF Dataset
Outline Work After Candidacy Work Before Candidacy Conclusion 1/15/2019 Outline Work After Candidacy Support Flexible Correlation Analysis Using Bitmap Index on Scientific Datasets In-Situ Bitmap Index Generation and Efficient Data Analysis based on Bitmap Work Before Candidacy A Data Virtualization Method for Scientific Datasets Support data selection and aggregation over scientific data Support bitmap indexing for more efficient query Integrate flexible data management with wide-area data transfer protocol Support Data Sampling Using Bitmap Index Conclusion 1/15/2019
Contributions Statistic Sampling Techniques Challenges A subset of individuals to represent whole population Information Loss and Error Metrics: Mean, Variance, Histogram, Q-Q Plot Challenges Sampling Accuracy Considering Data Features Error Calculation with High Overhead Support Data Sampling over Bitmap Indices Data samples has better accuracy Support error prediction before sampling the data Support data sampling over flexible data subset No data reorganization is needed
Data Sampling over Bitmap Indices Features of Bitmap Indexing Each bin (bitvector) corresponds to one value range Different bins reflect the entire value distribution Each bin keeps the data spatial locality Contain all space IDs (0-bits and 1-bits) Row Major, Column Major Z-Order Curve , Hilbert Curve Method Perform stratified random sampling over each bin Multi-level indices generates multi-level samples
Stratified Random Sampling over Bins S1: Index Generation S2: Divide Bitvector into Equal Strides S3: Random Select certain % of 1’s out of each stride
Error Prediction vs. Error Calculation Data Sampling Multi-Times Error Prediction Sampling Request Predict Request Sample Error Metrics Feedback Sampling Request Sample Not Good? Decide Sampling Error Calculation
Future Work In-Situ Importance Driven Subset Selection 1/15/2019 Future Work In-Situ Importance Driven Subset Selection Index-based data subset selection A comparison of this method with histogram-based data subset selection Support in-situ correlation mining using index Support efficient index generation using GPU Feature Mining Using Bitmap Index Scientific data features: Cosmology: halos Oceanographic: eddy Quickly identify these features using bitmap index In-situ analysis 1/15/2019
Conclusion A data virtualization framework for scientific dataset Support user-defined subsetting and aggregation over parallel netcdf datasets (CCGrid2012) Indexing and parallel query processing support for visualizing climate datasets (ICPP2012) SDQuery DSI: integrating data management support with a wide area data transfer protocol (SC2013) Support data sampling using bitmap index Taming massive distributed datasets: data sampling using bitmap indices (HPDC2013) Support correlation analysis using bitmap index Support correlation analysis on scientific datasets in parallel and distributed settings (HPDC2014) In-situ bitmap index based analysis In-Situ bitmap index generation and efficient data analysis using bitmap (HPDC2015 submission) 1/15/2019
Thanks for your attention! Q & A 1/15/2019