HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su, Gagan Agrawal, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.

Slides:

Advertisements

Similar presentations

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Advertisements

Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.

ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,

File System Implementation CSCI 444/544 Operating Systems Fall 2008.

FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.

Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,

ASCR Scientific Data Management Analysis & Visualization PI Meeting Exploration of Exascale In Situ Visualization and Analysis Approaches LANL: James Ahrens,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Data Mining Techniques

Report ： Zhen Ming Wu 2008 IEEE 9th Grid Computing Conference.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

In Situ Sampling of a Large-Scale Particle Simulation Jon Woodring Los Alamos National Laboratory DOE CGF

July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.

CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Spatial Indexing of large astronomical databases László Dobos, István Csabai, Márton Trencséni ELTE, Hungary.

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.

Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.

Science Problem: Cognitive capacity (human/scientist understanding), storage and I/O have not kept up with our capacity to generate massive amounts physics-based.

VOMegaPlot Efficient Plotting of Large VOTable Datasets.

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Big Data Vs. (Traditional) HPC Gagan Agrawal Ohio State ICPP Big Data Panel (09/12/2012)

Resource Addressable Network (RAN) An Adaptive Peer-to-Peer Substrate for Internet-Scale Service Platforms RAN Concept & Design  Adaptive, self-organizing,

A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.

Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin

Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.

Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.

September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.

Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.

Reporter ： Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

Indexing and Selection of Data Items Using Tag Collections Sebastien Ponce CERN – LHCb Experiment EPFL – Computer Science Dpt Pere Mato Vila CERN – LHCb.

SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu.

Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.

Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.

Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, and others)

March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.

An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.

29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

Jialin Liu, Surendra Byna, Yong Chen Oct Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented.

So far we have covered … Basic visualization algorithms

Basic machine learning background with Python scikit-learn

CSCE 990: Advanced Distributed Systems

Sameh Shohdy, Yu Su, and Gagan Agrawal

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Linchuan Chen, Peng Jiang and Gagan Agrawal

Efficient Distribution-based Feature Search in Multi-field Datasets Ohio State University (Shen) Problem: How to efficiently search for distribution-based.

1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Extreme-Scale Distribution-Based Data Analysis

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Using Clustering to Make Prediction Intervals For Neural Networks

Presentation transcript:

HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger #, James Ahrens # *The Ohio State University # Los Alamos National Laboratory

HPDC 2013 Motivation Science becomes increasingly data driven; Strong requirement for efficient data analysis; Challenges: –Fast data generation speed –Slow disk IO and network speed –Some number from road-runner EC 3 simulation particles, 36 bytes per particle => 2.3 TB/s Network Bandwidth: 10 GB/s 230 times different, and bigger in future Extremely hard to download and analyze entire data

HPDC 2013 Server-side Subsetting Methods Simple Request Advanced Request Challenges? No subsetting request? Data subset is still big? Server-side SubsettingClient-side Subsetting

HPDC 2013 Data Sampling and Challenges Statistic Sampling Techniques: –A subset of individuals to represent whole population –Example: Simple Random, Stratified Random Information Loss and Error Metrics: –Mean, Variance, Histogram, Q-Q Plot Challenges: –Sampling Accuracy Considering Data Features Value Distribution, Spatial Locality –Error Calculation without High Overhead. –Combine Data Sampling with Data Subsetting –Data Sampling without Data Reorganization

HPDC 2013 Our Solution and Contribution A server-side subsetting and sampling framework. –Standard SQL Interface –Bitmap Indexing Server-side Subsetting: Dimensions, Values Server-side Sampling Support Data Sampling over Bitmap Indices –Data samples has better accuracy; –Support error prediction before sampling the data –Support data sampling over flexible data subset –No data reorganization is needed

HPDC 2013 Background: Bitmap Indexing Widely used in scientific data management Suitable for float value by binning small ranges Run Length Compression(WAH, BBC) –Compress bitvector based on continuous 0s or 1s

HPDC 2013 System Architecture Parse the SQL expression Parse the metadata file Generate Query Request Find all bitvectors which satisfies current query Calculate Errors based on bitvectors Perform sampling over bitvectors Access the actual dataset

HPDC 2013 Data Sampling over Bitmap Indices Features of Bitmap Indexing: –Each bin(bitvector) corresponds to one value range; –Different bins reflect the entire value distribution; –Each bin keeps the data spatial locality; Contains all space IDs (0-bits and 1-bits) Row Major, Column Major Hilbert Curve, Z-Order Curve Method: –Perform stratified random sampling over each bin; –Multi-level indices generates multi-level samples;

HPDC 2013 Stratified Random Sampling over Bins S1: Index Generation S2: Divide Bitvector into Equal Strides S3: Random Select certain % of 1’s out of each stride

HPDC 2013 Error Prediction vs. Error Calculation Sampling Request Predict Request Error Prediction Error Calculation Data Sampling Error Calculation Sample Not Good? Multi-Times Error Prediction Error Metrics Feedback Decide Sampling Sampling Request Sample

HPDC 2013 Error Prediction Pre-estimate the error metrics before sampling –Bitmap Indices classifies the data into bins Each bin corresponds to one value or value range; Find some representative values for each bin: V i ; –Enforce equal sampling percentage for each bin Extra Metadata: number of 1-bits of each bin: C i ; Compute number of samples of each bin: S i ; –Pre-calculate error metrics based on V i and S i Representative Values: –Small Bin: mean value –Big Bin: lower-bound, upper-bound, mean value

HPDC 2013 Error Prediction Formula Mean, Variance: Histogram: Q-Q Plot

HPDC 2013 Data Subsetting + Data Sampling S3: Perform Stratified Sampling on Subset S2: Find Spatial ID subset S1: Find value subset Value = [2, 3) RID = (9, 25)

HPDC 2013 Experiment Results Goals: –Data analysis efficiency with the help of sampling –Accuracy among different sampling methods –Compare Predicted Error with Actual Error –Efficiency among different sampling methods –Speedup for combining data sampling with subsetting Datasets: –Ocean Data – Multi-dimensional Arrays –Cosmos Data – Separate Points with 7 attributes Environment: –Darwin Cluster: 120 nodes, 48 cores, 64 GB memory

HPDC 2013 Improve Efficiency of Distributed Data Analysis with Sampling Data Sampling in server-side; Data Transfer between client and server; Data Visualization in client-side; Dataset: 11.2 GB Ocean Data No Sampling(100%): zero sampling cost, but huge data transfer and visualization cost Sampling: much smaller data transfer and visualization cost 100 MB/s Network: data sampling achieves a 2.61 – speedup 10 MB/s Network: data sampling achieves a 4.82 – total speedup

HPDC 2013 Sample Accuracy Comparison Sampling Methods: –Simple Random Method –Stratified Random Method –KDTree Stratified Random Method –Big Bin Index Random Method –Small Bin Index Random Method Error Metrics: –Means over 200 separate sectors –Histogram using 200 value intervals –Q-Q Plot with 200 quantiles Sampling Percentage: 0.1%

HPDC 2013 Sample Accuracy Comparison Traditional sampling methods can not achieve good accuracy; Small Bin method achieves best accuracy in most cases; Big Bin method achieves comparable accuracy to KDTree sampling method. Mean Histogram Q-Q Plot

HPDC 2013 Predicted Error vs. Actual Error Means, Histogram, Q-Q Plot for Small Bin Method Means, Histogram, Q-Q Plot for Big Bin Method

HPDC 2013 Efficiency Comparison Index-based Sample Generation Time is proportional to the number of bins(1.10 to 3.98 times slower). The Error Calculation Time based on bins is much smaller than that based on data (>28 times faster). Sample Generation TimeError Calculation Time

HPDC 2013 Total Time based on Resampling Times Total Sampling Time Index-based Sampling: Multi-time Error Calculations One-time Sampling Other Sampling Methods: Multi-time Samplings Multi-time Error Calculations X axis: resampling times Speedup of Small Bin: 0.91 – 20.12

HPDC 2013 Speedup of Sampling over Subset X axis: Data Subsetting Percentage (100%, 50%, 30%, 10%, 1%) Y axis: Index Loading Time + Sampling Generation Time 25% Sampling Percentage Speedup :1.47 – 4.98 for Spatial Subsetting for value Subsetting Subset over Spatial IDsSubset over values

HPDC 2013 Conclusion ‘Big Data’ issue brings challenges for scientific data management; Data sampling is useful and necessary for data analysis; Perform server-side sampling over bitmap indices; Pre-calculate errors before actually sampling data; Combine data sampling with data subsetting; Achieve good accuracy and efficiency.

HPDC 2013 Thanks 23