Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Slides:

Advertisements

Similar presentations

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Advertisements

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.

Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.

Science Problem: Cognitive capacity (human/scientist understanding), storage and I/O have not kept up with our capacity to generate massive amounts physics-based.

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.

VIRTUAL MEMORY By Thi Nguyen. Motivation  In early time, the main memory was not large enough to store and execute complex program as higher level languages.

HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu.

Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, and others)

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.

MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Dense-Region Based Compact Data Cube

University of Maryland Baltimore County

Curator: Self-Managing Storage for Enterprise Clusters

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics

So far we have covered … Basic visualization algorithms

Genomic Data Clustering on FPGAs for Compression

Swapping Segmented paging allows us to have non-contiguous allocations

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

Data Compression.

CSCE 990: Advanced Distributed Systems

Sameh Shohdy, Yu Su, and Gagan Agrawal

Year 2 Updates.

Chapter 15 QUERY EXECUTION.

湖南大学-信息科学与工程学院-计算机与科学系

Linchuan Chen, Peng Jiang and Gagan Agrawal

Communication and Memory Efficient Parallel Decision Tree Construction

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Scalable Parallel Interoperable Data Analytics Library

Efficient Distribution-based Feature Search in Multi-field Datasets Ohio State University (Shen) Problem: How to efficiently search for distribution-based.

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Support for ”interactive batch”

Gagan Agrawal The Ohio State University

Data-Intensive Computing: From Clouds to GPU Clusters

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Peng Jiang, Linchuan Chen, and Gagan Agrawal

1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Fast and Exact K-Means Clustering

CENG 351 Data Management and File Structures

Emulator of Cosmological Simulation for Initial Parameters Study

Extreme-Scale Distribution-Based Data Analysis

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Motivation – HPC Trends 11/14/2018 Motivation – HPC Trends Huge performance gap CPU: extremely fast for generating data Disk, Network: very slow to store or transfer data Memory: not large enough to hold data 11/14/2018

In-Situ Analysis – What and Why 11/14/2018 In-Situ Analysis – What and Why Process of transforming data at run time Analysis Classification Reduction Visualization In-Situ has the promise of Saving more information dense data Saving I/O or network transfer time Saving disk space Saving time in analysis Before I introduce our method, let’s first look at what is in-situ analysis, and why do we need in-situ analysis. By definition, in-situ analysis is the process of transforming data at run-time. Each time after the data is collected or simulated in memory, we do several types of transforms before the data is written to disk or transffered through the network.

Key Questions How insights can be obtained in-situ? 11/14/2018 Key Questions How do we decide what data to save? This analysis cannot take too much time/memory Simulations already consume most available memory Scientists cannot accept much slowdown for analytics How insights can be obtained in-situ? Must be memory and time efficient What representation to use for data stored in disks? Effective analysis/visualization Disk/Network Efficient Before I introduce our method, let’s first look at what is in-situ analysis, and why do we need in-situ analysis. By definition, in-situ analysis is the process of transforming data at run-time. Each time after the data is collected or simulated in memory, we do several types of transforms before the data is written to disk or transffered through the network.

Quick Answers How insights can be obtained in-situ? 11/14/2018 Quick Answers How do we decide what data to save? Use Bitmaps! How insights can be obtained in-situ? Use Bitmaps!! What representation to use for data stored on disks? Bitmaps!!! Before I introduce our method, let’s first look at what is in-situ analysis, and why do we need in-situ analysis. By definition, in-situ analysis is the process of transforming data at run-time. Each time after the data is collected or simulated in memory, we do several types of transforms before the data is written to disk or transffered through the network.

Specific Issues Bitmaps as data summarization In-Situ Data Reduction 11/14/2018 Specific Issues Bitmaps as data summarization Utilize extra computer power for data reduction Save memory usage, disk I/O and network transfer time In-Situ Data Reduction In-Situ generate bitmaps Bitmaps generation is time-consuming Bitmaps before compression has big memory cost In-Situ Data Analysis Time steps selection Can bitmaps support time step selection? Efficiency of time step selection using bitmaps Offline Analysis: Only keep bitmaps instead of data Types of analysis supported by bitmaps Before I introduce our method, let’s first look at what is in-situ analysis, and why do we need in-situ analysis. By definition, in-situ analysis is the process of transforming data at run-time. Each time after the data is collected or simulated in memory, we do several types of transforms before the data is written to disk or transffered through the network.

Background: Bitmaps Widely used in scientific data management Suitable for floating value by binning small ranges Run Length Compression (WAH, BBC) Bitmaps can be treated as a small profile of the data

In-Situ Bitmaps Generation

In-Situ Bitmaps Generation Parallel index generation Save the data loading cost Multi-Core based index generation Core allocation strategies Shared Cores Allocate all cores to simulation and bitmaps generation Executed in sequence Separate Cores Allocate different core sets to simulation and bitmaps generation A data queue is shared between simulation and bitmaps generation Executed in parallel In-place bitvector compression Scan data by segments Merge segment into compressed bitvectors

Time-Steps Selection

Correlation Metrics Earth Mover’s Distance: Shannon’s Entropy: Indicate distance between two probability distributions over a region Cost of changing value distributions of data Shannon’s Entropy: A metric to show the variability of the dataset High entropy => more random distributed data Mutual Information: A metric for computing the dependence between two variables Low M => two variables are relatively independent Conditional Entropy: Self-contained information Information with respect to others

Calculate Earth Mover’s Distance Using Bitmaps Divide Ti and Tj into bins over value subsets Generate a CFP based on value differences between bins of Ti and Tj Accumulate results

Correlation Mining Using Bitmaps Automatically suggest data subsets with high correlations Correlation Analysis: keep submitting queries Traditional Method Exhaustive calculation over data subsets (spatial and value) Huge time and memory cost Correlation mining using bitmap Mutual Information Calculated by probability distribution (value subsets) A top-down method for value subsets Multi-level bitmap indexing Go to low-level index only if high-level has high mutual info A bottom-up method for spatial subsets Divide bitvectors (with high correlations) into basic strides Perform 1-bits count operation over strides

Correlation Mining

Experiment Results Goals: Simulations: Heat3D, Lulesh Efficiency and storage improvement using bitmaps Scalability in parallel in-situ environment Efficiency improvement for correlation mining Efficiency and accuracy comparison with sampling Simulations: Heat3D, Lulesh Datasets: Parallel Ocean Program Environment: 32 Intel Xeon x5650 CPUs and 1TB memory MIC: 60 Intel Xeon Phi coprocessors and 8GB memory OSC Oakley Cluster: 32 nodes with 12 Intel Xeon x5650 CPUs and 48 GB memory

Efficiency Comparison for In-Situ Analysis - CPU Full Data (original): Simulation: bad scalability Time Step Selection: big Data Writing: big and bad scalability Bitmaps: Simulation: utilize extra computing power for bitmaps generation Extra bitmaps generation time but good scalability Time Step Selection Using Bitmaps: 1.38x to 1.5x Bitmaps Writing: 6.78x Overall: 0.79x to 2.38x More number of cores, better speedup we can achieve Simulation: Heat3D; Processor: CPU Time steps: select 25 over 100 time steps 6.4 GB per time step (800*1000*1000) Metrics: Conditional Entropy

Efficiency Comparison for In-Situ Analysis - MIC More cores Lower bandwidth Full Data (original): Huge data writing time Bitmaps: Good scalability of both bitmaps generation and time step selection using bitmaps Much smaller data writing time Overall: 0.81x to 3.28x Simulation: Heat3D; Processor: MIC Time steps: select 25 over 100 time steps 1.6 GB per time step (200*1000*1000) Metrics: Conditional Entropy

Memory Cost of In-Situ Analysis Heat3D - No Indexing: 12 time steps (pre, temp, cur) Heat3D - Bitmap Indexing: 2 time steps (pre, temp) 1 previous selected indices 10 current indices Lulesh – No Indexing: 11 time steps (pre, cur) Huge extra memory for edges Lulesh – Bitmap Indexing: 1 time step (pre) 2.0x to 3.59x smaller memory Better as bigger data simulated and more time steps to hold Simulation: Heat3D, Lulesh Processor: CPU, MIC Keep 10 time steps in memory

Scalability in Parallel Environment Simulation: Heat3D Full Data– Local: Each node write its data subblock into its own disk Bitmaps– Local: Each node writes its bitmaps subblock into its own disk Fast time step selection and local writing 1.24x – 1.29x speedup Full Data– Remote: Different nodes send data sub-blocks to a master node Bitmaps – Remote: Greatly alleviate data transfer burden of master node 1.24x – 3.79x speedup Select 25 time steps out of 100 TEMP Variable: 6.4 GB per time step Number of nodes: 1 to 32 Number of cores: 8

Speedup for Correlation Mining Simulation: POP Full Data: Big data loading cost Exhaustive calculations over data subsets Each calculation is time consuming Bitmaps: Smaller data loading Multi-level bitmaps to improve the mining process Bitwise AND and 1-bits count operations to improve the calculation efficiency 3.81x – 4.92x speedup Variables: TEMP, SALT Data size per variable: 1.4 GB to 11.2 GB Number of cores: 1

In-Situ Sampling vs. Bitmaps Heat3D ,100 time steps (6.4 GB), 32 cores Bitmaps generation (binning, compression) has more time cost then down-sampling Sampling can effectively improve the time step selection cost Bitmaps generation can still achieve better efficiency if the index size is smaller than sample size Bitmaps: using the same binning scale, does not have any information loss Sampling: information loss is unavoidable no matter what sample% 30% - 21.03% loss 15% - 37.56% loss 5% - 58.37% loss

Conclusion ‘Big Data’ issue brings challenges for scientific data management Efficient in-situ bitmaps generation Efficient online data analysis (time step selection) using only bitmaps Efficient offline data analysis (correlation mining) using only bitmaps Compare in-situ data sampling with in-situ bitmaps