Tools and Techniques for Processing and Management of Data

Slides:

Advertisements

Similar presentations

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Advertisements

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

DOE PI Meeting at BNL 1 Lightweight High-performance I/O for Data-intensive Computing Jun Wang Computer Architecture and Storage System Laboratory (CASS)

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

Presented by: Omar Alqahtani Fall 2016

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Spark Presentation.

Database Performance Tuning and Query Optimization

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Speedup over Ji et al.'s work

Linchuan Chen, Xin Huo and Gagan Agrawal

Introduction to Spark.

Tools and Techniques for Processing (and Management) of Data

Applying Twister to Scientific Applications

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Linchuan Chen, Peng Jiang and Gagan Agrawal

On Spatial Joins in MapReduce

Communication and Memory Efficient Parallel Decision Tree Construction

Outline Midterm results summary Distributed file systems – continued

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Ch 4. The Evolution of Analytic Scalability

Wei Jiang Advisor: Dr. Gagan Agrawal

Data-Intensive Computing: From Clouds to GPU Clusters

Overview of big data tools

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Interpret the execution mode of SQL query in F1 Query paper

Yi Wang, Wei Jiang, Gagan Agrawal

Distributed Systems CS

Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J

Chapter 11 Database Performance Tuning and Query Optimization

MAPREDUCE TYPES, FORMATS AND FEATURES

A Map-Reduce System with an Alternate API for Multi-Core Environments

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Fast, Interactive, Language-Integrated Cluster Computing

MapReduce: Simplified Data Processing on Large Clusters

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

Map Reduce, Types, Formats and Features

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Tools and Techniques for Processing and Management of Data Gagan Agrawal Data-Intensive and High Performance Computing Research Group Department of Computer Science and Engineering The Ohio State University (Joint work with Wei Jiang, Yu Su, Linchuan Chen, Yi Wang et al.)

Data-Intensive and High Performance Computing Research Group Overall Context Research group active in `data-intensive’ computing since 2000 Main directions Data Processing Solutions (A MapReduce like system built in 2000) Data Management (``Database’’) solutions for scientific computing Automatic Data Virtualization Approach presented in 2004 Recently re-discovered in DB community as NoDB apprach! Parallel Data Mining algorithms Use of accelerators Data-Intensive and High Performance Computing Research Group

Data-Intensive and High Performance Computing Research Group Outline Data Processing Middleware solutions MATE Ex-MATE MATE-CG SciMATE MapReduce on GPU Data Management Solutions Automatic Data Virtualization Indexing as a Service and Services based on Indexing Data-Intensive and High Performance Computing Research Group

Data-Intensive and High Performance Computing Research Group The Era of “Big Data” When data size becomes a problem: Need easy-to-use tools! What other aspects? Performance? Analysis and Management? Security and Privacy? Old days Data-Intensive and High Performance Computing Research Group

Data-Intensive and High Performance Computing Research Group Motivation Growing need of Data-Intensive SuperComputing Performance is highest priority in HPC! Efficient data processing & High programming productivity Emergence of various parallel architectures Traditional CPU clusters (multi-cores) GPU clusters (many-cores) CPU-GPU clusters (heterogeneous systems) Given big data, high-end apps, and parallel architectures… We need Programming Models and Middleware Support! Data-Intensive and High Performance Computing Research Group

Limitations of Current MapReduce Performance API for specification of various parallel algorithms Processing of Data directly in scientific data formats Use of accelerators /emerging architectures Support for distributed data stores (including cloud) Data-Intensive and High Performance Computing Research Group

Data-Intensive and High Performance Computing Research Group Our Middleware Series Bridge the gap between the parallel architectures and the applications Higher programming productivity than MPI Better performance efficiency than MapReduce MATE GPU GPU Ex-MATE Tall oaks grow from little acorns! We could use task re-execution as in map-reduce; but there is a more effective fault tolerance approach. MATE-CG GPU GPU FT-MATE Data-Intensive and High Performance Computing Research Group

Data-Intensive and High Performance Computing Research Group The Programming Model The generalized reduction model Based on user-declared reduction objects Motivated by a set of data mining applications For example, K-Means could have a very large set of data points to process but only need to update a small set of centroids (the reduction object!) Forms a compact summary of computational states Helps achieve more efficient fault tolerance and recovery than replication/job re-execution in Map-Reduce Avoids large-sized intermediate data Applies updates directly on the reduction object instead of going through Map---Intermediate Processing---Reduce Data-Intensive and High Performance Computing Research Group

Comparing Processing Structures Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping, shuffling.. .overheads are eliminated with red. func/obj. But we need global combination. Insight: we could even provide a better implementation of the same map-reduce API! --- e.g., Turbo MapReduce from Quantcast! Data-Intensive and High Performance Computing Research Group

Data-Intensive and High Performance Computing Research Group Experiments Design For comparison against Phoenix, we used three data mining applications K-Means Clustering, Princinple Component Analysis (PCA), Apriori Associative Mining. Also evaluated the single-node performance of Hadoop on KMeans and Apriori Combine function is used in Hadoop with careful tuning Experiments on two multi-core platforms 8 cores on one 8-core node (Intel cpu) 16 cores on one 16-core node (AMD cpu) Data-Intensive and High Performance Computing Research Group 11

Results: Data Mining (I) K-Means on 8-core and 16-core machines: 400MB dataset, 3-dim points, k = 100 Avg. Time Per Iteration (sec) 2.0 3.0 # of threads Data-Intensive and High Performance Computing Research Group

Results: Data Mining (II) PCA on 8-core and 16-core machines : 8000 * 1024 matrix Total Time (sec) 2.0 2.0 # of threads Data-Intensive and High Performance Computing Research Group

Data-Intensive and High Performance Computing Research Group Extending MATE Main issue of the original MATE: Assumes that the reduction object MUST fit in memory We extended MATE to address this limitation Focus on graph mining: an emerging class of apps Require large-sized reduction objects as well as large-scale datasets E.g., PageRank could have a 16GB reduction object! Support of managing arbitrary-sized reduction objects Large-sized reduction objects are disk-resident Evaluated Ex-MATE using PEGASUS PEGASUS: A Hadoop-based graph mining system Data-Intensive and High Performance Computing Research Group

Ex-MATE Runtime Overview Basic one-stage execution Execution Overview of the Extended MATE Data-Intensive and High Performance Computing Research Group

Results: Graph Mining (I) 16GB datasets: Ex-MATE:~10 times speedup HADI PageRank Avg. Time Per Iteration (min) HCC # of nodes Data-Intensive and High Performance Computing Research Group

Scalability: Graph Mining (II) HCC: better scalability with larger datasets 32GB 8GB Avg. Time Per Iteration (min) 64GB # of nodes 1.5 3.0 Data-Intensive and High Performance Computing Research Group

MATE for CPU-GPU Clusters Still adopts Generalized Reduction Built on top of MATE and Ex-MATE Accelerates data-intensive computations on heterogeneous systems Focus on CPU-GPU clusters A multi-level data partitioning Proposed a novel auto-tuning framework Exploits iterative nature of many data-intensive apps Automatically decides the workload distribution between CPUs and GPUs Data-Intensive and High Performance Computing Research Group

Data-Intensive and High Performance Computing Research Group MATE-CG Overview Execution work-flow Data-Intensive and High Performance Computing Research Group

Data-Intensive and High Performance Computing Research Group Experiments Design Experiments Platform A heterogeneous CPU-GPU cluster Each node has one multi-core CPU and one GPU Intel 8-core CPU NVIDA Tesla (Fermi) GPU (14*32 (448) cores) Used up to 128 CPU cores and 7168 GPU cores on 16 nodes Three representative applications Gridding kernel, EM, and PageRank For each application, we run it in four modes in the cluster CPU-1: CPU-8: GPU-only: CPU-8-n-GPU Data-Intensive and High Performance Computing Research Group 20

Results: Scalability with increasing # of GPUs GPU-only is better than CPU-8 PageRank EM Avg. Time Per Iteration (sec) 16% 3.0 Gridding Kernel 25% # of nodes Data-Intensive and High Performance Computing Research Group

Scientific Data Analysis Today Increasingly data-intensive Volume approximately doubles each year Stored in certain specialized formats NetCDF, HDF5, … Popularity of MapReduce and its variants Free accessibility Easy programmability Good scalability Built-in fault tolerance 2018/11/12

SciMATE Framework No data reloading and no need to know library specifics Extend MATE for scientific data analysis Customizable data format adaption API Ability to be adapted to support processing on any ( or even new) scientific data format Optimized by Access strategies Access patterns 2018/11/12

System Overview Key feature scientific data processing module 2018/11/12

Scientific Data Processing Module 2018/11/12

Integrating a New Data Format Data adaption layer is customizable Insert a third-party adapter Open for extension but closed for modification Have to implement the generic block loader interface Partitioning function and auxiliary functions E.g., partition, get_dimensionality Full read function and partial read functions E.g., full_read, partial_read, partial_read_by_block 2018/11/12

Evaluating Thread Scalability Data processing times for kNN (on a 8-core node) 2018/11/12

GPU MapReduce Can our reduction idea benefit implementation of original MapReduce? Turns out Yes! Reduce the (key, value) pair to the reduction object immediately after it is generated by the map function Very suitable for reduction-intensive applications A general and efficient MapReduce framework Dynamic memory allocation within a reduction object Maintaining a memory hierarchy Multi-group mechanism Overflow handling Before step into deeper, let me first talk about the background information of MapReduce and GPU architecture

Main Idea (1) Traditional MapReduce map(input) { (key, value) = process(input); emit(key, value); } grouping the key-value pairs (by runtime system) reduce(key, iterator) for each value in iterator result = operation(result, value); emit(key, result);

Main Idea (2) Reduction-based approach map(input) { (key, value) = process(input); reductionobject->insert(key, value); } reduce(value1, value2) value1 = operation(value1, value2); Reduces the memory overhead of storing key-value pairs Makes it possible to effectively utilize shared memory on a GPU Eliminates the need of grouping Especially suitable for reduction-intensive applications

Challenges Result collection and overflow handling Maintain a memory hierarchy Trade off space requirement and locking overhead A multi-group scheme To keep the framework general and efficient A well defined data structure for the reduction object

Device Memory Reduction Object Memory Hierarchy GPU Reduction Object 0 Reduction Object 1 Reduction Object 0 Reduction Object 1 … … … … … … Block 0’s Shared Memory Block 0’s Shared Memory Device Memory Reduction Object Result Array Device Memory CPU Host Memory

Reduction Object … … … … Memory Allocator KeyIdx[0] ValIdx[0] Val Data Key Size Val Size Key Data Key Size Val Size Key Data Val Data

Comparison with MapCG With reduction-intensive applications

Comparison with MapCG With other applications

Comparison with Ji et al.'s work

Data-Intensive and High Performance Computing Research Group Outline Data Processing Middleware solutions MATE Ex-MATE MATE-CG SciMATE MapReduce on GPU Data Management Solutions Automatic Data Virtualization Indexing as a Service and Services based on Indexing Data-Intensive and High Performance Computing Research Group

Data Management Today Database Systems Ad-hoc Solutions High-level query languages Indexing support Large-complex systems Need to load all data inside the system Cannot handle format changes etc. Use procedural or scripting languages Lack indexing support Keep data in original format Light-weight solutions Adapt to format changes etc.

Our Approach Automatic Data Virtualization Support high-level view of array-based data Allow queries assuming such a view Extract values from dataset to serve these queries Indexing techniques applied to low-level data Integrated with a high-level query system Sampling is a critical functionality Integrate with data virtualization system Use an indexing method to sample

Automatic Data Virtualization User develops queries using an SQL-like language Potential to use graphical interfaces Automatic logical to physical mapping from SQL query to underlying data format Server side data staging and aggregation of query Performs spatial subsetting, spatial sampling, and/or aggregation GAGAN, if you have some slides to that better describe your implementation it would be welcome. Graphical or textual SQL syntax Benefits: Simple, well known, interface Data sets added via simple layout descriptors IO optimizations can be implemented automatically (no need for intimate end user knowledge of IO protocols) RNET Proprietary Data November 12, 2018

System Overview (NetCDF) Parse the SQL expression Parse the metadata file Generate Query Request Index Generation if not generated; Index Retrieving after that.

Efficiency Comparison with Filtering in Paraview Data size: 5.6 GB Input: 400 queries Depends on subset percentage General index method is better than filtering when data subset < 60% Two phase optimization achieved a 0.71 – 11.17 speedup compared with filtering method Index m1: Bitmap Indexing, no optimization Index m2: Use bitwise operation instead of post-filtering Index m3: Use both bitwise operation and index partition Filter: load all data + filter

Efficiency Comparison with FastQuery Data size: 8.4 GB Proc#: 48 Input: 100 queries for each query type Achieved a 1.41 to 2.12 speedup compared with FastQuery

Comparison with NCO SELECT TEMP FROM "POP.nc" WHERE TEMP>7; ./ncap2 -S 'query.nco' POP.nc POP2.nc query.nco TEMP=TEMP; TEMP.set_miss(-1.0e+34f); where (TEMP>7.0) elsewhere TEMP=TEMP@_FillValue; NCO does not support subsetting based on variable values when the variable has multiple dimensions. In the above query, TEMP is a variable with 5 dimensions (latitude, longitude, depth etc.). To obtain the records with TEMP > 7 using NCO, it is required to provide a script (query.nco) to ncap2 command. The script replaces every value that does not satisfy the required condition with a FillValue. These records are ignored in obtaining further statistics using NCO. Our SQL based system supports variable based subsetting as shown in the left column. November 12, 2018 RNET Proprietary Data

NCO Performance Comparison Query 1: Same as Query in Slide 15 Query2: SELECT pressure FROM "pressure_19010110_000000.nc" WHERE pressure>=90000 OR pressure<=4000; NCO: pressure=pressure; pressure.set_miss(-999.0); where(pressure>=90000.0 || pressure <= 4000.0) elsewhere pressure=pressure.get_miss(); Query 3: SELECT geopotential FROM "dataset/GCRM/geopotential_19010110_000000.nc"; time ./ncks -O -v geopotential ../../dataset/GCRM/geopotential_19010110_000000.nc temp.nc Query 4: SELECT geopotential FROM "dataset/GCRM/geopotential_19010110_000000.nc" WHERE (cells>30000 OR cells<=1000) AND interfaces>200 AND time<797600; time ./ncks -O -d cells,30000, -d cells,,1000 -d interfaces,200.0, -d time,797600.0, ../../dataset/GCRM/geopotential_19010110_000000.nc temp.nc Query 5: SELECT VVEL FROM "dataset/POP/POP.nc" WHERE u_lon<10 OR u_lon>80; time ./ncks -O -d u_lon,,10.0 -d u_lon,80.0, -v VVEL ../../dataset/POP/POP.nc POP2.nc Query Number and Percentage Data Retrieved RNET Proprietary Data November 12, 2018

Selection Performance Comparison with SciDB Queries on the 3D dataset of 5.2 GB: divided into 4 groups based on different data coverage (<5%, 5%-10%, 10%-15%, and 15%-20%) – 17,000 seconds data loading cost for SciDB not included here

Data-Intensive and High Performance Computing Research Group Conclusions Many innovative solutions built Many ongoing research activities as well Much potential for meeting DOE data-intensive science requirements Very open to collaborations Data-Intensive and High Performance Computing Research Group