HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger #, James Ahrens # *The Ohio State University # Los Alamos National Laboratory
HPDC 2013 Motivation Science becomes increasingly data driven; Strong requirement for efficient data analysis; Challenges: –Fast data generation speed –Slow disk IO and network speed –Some number from road-runner EC 3 simulation particles, 36 bytes per particle => 2.3 TB/s Network Bandwidth: 10 GB/s 230 times different, and bigger in future Extremely hard to download and analyze entire data
HPDC 2013 Server-side Subsetting Methods Simple Request Advanced Request Challenges? No subsetting request? Data subset is still big? Server-side SubsettingClient-side Subsetting
HPDC 2013 Data Sampling and Challenges Statistic Sampling Techniques: –A subset of individuals to represent whole population –Example: Simple Random, Stratified Random Information Loss and Error Metrics: –Mean, Variance, Histogram, Q-Q Plot Challenges: –Sampling Accuracy Considering Data Features Value Distribution, Spatial Locality –Error Calculation without High Overhead. –Combine Data Sampling with Data Subsetting –Data Sampling without Data Reorganization
HPDC 2013 Our Solution and Contribution A server-side subsetting and sampling framework. –Standard SQL Interface –Bitmap Indexing Server-side Subsetting: Dimensions, Values Server-side Sampling Support Data Sampling over Bitmap Indices –Data samples has better accuracy; –Support error prediction before sampling the data –Support data sampling over flexible data subset –No data reorganization is needed
HPDC 2013 Background: Bitmap Indexing Widely used in scientific data management Suitable for float value by binning small ranges Run Length Compression(WAH, BBC) –Compress bitvector based on continuous 0s or 1s
HPDC 2013 System Architecture Parse the SQL expression Parse the metadata file Generate Query Request Find all bitvectors which satisfies current query Calculate Errors based on bitvectors Perform sampling over bitvectors Access the actual dataset
HPDC 2013 Data Sampling over Bitmap Indices Features of Bitmap Indexing: –Each bin(bitvector) corresponds to one value range; –Different bins reflect the entire value distribution; –Each bin keeps the data spatial locality; Contains all space IDs (0-bits and 1-bits) Row Major, Column Major Hilbert Curve, Z-Order Curve Method: –Perform stratified random sampling over each bin; –Multi-level indices generates multi-level samples;
HPDC 2013 Stratified Random Sampling over Bins S1: Index Generation S2: Divide Bitvector into Equal Strides S3: Random Select certain % of 1’s out of each stride
HPDC 2013 Error Prediction vs. Error Calculation Sampling Request Predict Request Error Prediction Error Calculation Data Sampling Error Calculation Sample Not Good? Multi-Times Error Prediction Error Metrics Feedback Decide Sampling Sampling Request Sample
HPDC 2013 Error Prediction Pre-estimate the error metrics before sampling –Bitmap Indices classifies the data into bins Each bin corresponds to one value or value range; Find some representative values for each bin: V i ; –Enforce equal sampling percentage for each bin Extra Metadata: number of 1-bits of each bin: C i ; Compute number of samples of each bin: S i ; –Pre-calculate error metrics based on V i and S i Representative Values: –Small Bin: mean value –Big Bin: lower-bound, upper-bound, mean value
HPDC 2013 Error Prediction Formula Mean, Variance: Histogram: Q-Q Plot
HPDC 2013 Data Subsetting + Data Sampling S3: Perform Stratified Sampling on Subset S2: Find Spatial ID subset S1: Find value subset Value = [2, 3) RID = (9, 25)
HPDC 2013 Experiment Results Goals: –Data analysis efficiency with the help of sampling –Accuracy among different sampling methods –Compare Predicted Error with Actual Error –Efficiency among different sampling methods –Speedup for combining data sampling with subsetting Datasets: –Ocean Data – Multi-dimensional Arrays –Cosmos Data – Separate Points with 7 attributes Environment: –Darwin Cluster: 120 nodes, 48 cores, 64 GB memory
HPDC 2013 Improve Efficiency of Distributed Data Analysis with Sampling Data Sampling in server-side; Data Transfer between client and server; Data Visualization in client-side; Dataset: 11.2 GB Ocean Data No Sampling(100%): zero sampling cost, but huge data transfer and visualization cost Sampling: much smaller data transfer and visualization cost 100 MB/s Network: data sampling achieves a 2.61 – speedup 10 MB/s Network: data sampling achieves a 4.82 – total speedup
HPDC 2013 Sample Accuracy Comparison Sampling Methods: –Simple Random Method –Stratified Random Method –KDTree Stratified Random Method –Big Bin Index Random Method –Small Bin Index Random Method Error Metrics: –Means over 200 separate sectors –Histogram using 200 value intervals –Q-Q Plot with 200 quantiles Sampling Percentage: 0.1%
HPDC 2013 Sample Accuracy Comparison Traditional sampling methods can not achieve good accuracy; Small Bin method achieves best accuracy in most cases; Big Bin method achieves comparable accuracy to KDTree sampling method. Mean Histogram Q-Q Plot
HPDC 2013 Predicted Error vs. Actual Error Means, Histogram, Q-Q Plot for Small Bin Method Means, Histogram, Q-Q Plot for Big Bin Method
HPDC 2013 Efficiency Comparison Index-based Sample Generation Time is proportional to the number of bins(1.10 to 3.98 times slower). The Error Calculation Time based on bins is much smaller than that based on data (>28 times faster). Sample Generation TimeError Calculation Time
HPDC 2013 Total Time based on Resampling Times Total Sampling Time Index-based Sampling: Multi-time Error Calculations One-time Sampling Other Sampling Methods: Multi-time Samplings Multi-time Error Calculations X axis: resampling times Speedup of Small Bin: 0.91 – 20.12
HPDC 2013 Speedup of Sampling over Subset X axis: Data Subsetting Percentage (100%, 50%, 30%, 10%, 1%) Y axis: Index Loading Time + Sampling Generation Time 25% Sampling Percentage Speedup :1.47 – 4.98 for Spatial Subsetting for value Subsetting Subset over Spatial IDsSubset over values
HPDC 2013 Conclusion ‘Big Data’ issue brings challenges for scientific data management; Data sampling is useful and necessary for data analysis; Perform server-side sampling over bitmap indices; Pre-calculate errors before actually sampling data; Combine data sampling with data subsetting; Achieve good accuracy and efficiency.
HPDC 2013 Thanks 23