Tools and Techniques for Processing (and Management) of Data

Slides:

Advertisements

Similar presentations

Spark: Cluster Computing with Working Sets

Advertisements

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Mining High Utility Itemset in Big Data

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Rapid Tomographic Image Reconstruction via Large-Scale Parallelization Ohio State University Computer Science and Engineering Dep. Gagan Agrawal Argonne.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Big Data is a Big Deal!.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Introduction to Distributed Platforms

Spark Presentation.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics

PA an Coordinated Memory Caching for Parallel Jobs

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Speedup over Ji et al.'s work

Linchuan Chen, Xin Huo and Gagan Agrawal

Year 2 Updates.

Distributed Systems CS

Tools and Techniques for Processing and Management of Data

Applying Twister to Scientific Applications

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

湖南大学-信息科学与工程学院-计算机与科学系

Communication and Memory Efficient Parallel Decision Tree Construction

CS110: Discussion about Spark

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Ch 4. The Evolution of Analytic Scalability

Wei Jiang Advisor: Dr. Gagan Agrawal

Gagan Agrawal The Ohio State University

Data-Intensive Computing: From Clouds to GPU Clusters

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Yi Wang, Wei Jiang, Gagan Agrawal

Distributed Systems CS

MAPREDUCE TYPES, FORMATS AND FEATURES

A Map-Reduce System with an Alternate API for Multi-Core Environments

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Fast, Interactive, Language-Integrated Cluster Computing

Lecture 29: Distributed Systems

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

Convergence of Big Data and Extreme Computing

Map Reduce, Types, Formats and Features

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Tools and Techniques for Processing (and Management) of Data Gagan Agrawal Data-Intensive and High Performance Computing Research Group Department of Computer Science and Engineering The Ohio State University (Joint work with Wei Jiang, Tekin Bicer, Yu Su, Linchuan Chen, Yi Wang et al.)

Data-Intensive and High Performance Computing Research Group Overall Context Research group active in `data-intensive’ computing since 2000 Main directions Data Processing Solutions (A MapReduce like system built in 2000) Data Management (``Database’’) solutions for scientific computing Automatic Data Virtualization Approach presented in 2004 Recently re-discovered in DB community as NoDB apprach! Parallel Data Mining algorithms Use of accelerators Data-Intensive and High Performance Computing Research Group

Data-Intensive and High Performance Computing Research Group Outline Data Processing Middleware solutions MATE, Ex-MATE, MATE-CG SciMATE MATE-EC, MATE-HC MapReduce on GPU Smark (ongoing) Data Management Solutions Automatic Data Virtualization Indexing as a Service and Services based on Indexing Data-Intensive and High Performance Computing Research Group

Data-Intensive and High Performance Computing Research Group The Era of “Big Data” When data size becomes a problem: Need easy-to-use tools! What other aspects? Performance? Analysis and Management? Security and Privacy? Old days Data-Intensive and High Performance Computing Research Group

Data-Intensive and High Performance Computing Research Group Motivation Growing need of Data-Intensive SuperComputing Performance is highest priority in HPC! Efficient data processing & High programming productivity Emergence of various parallel architectures Traditional CPU clusters (multi-cores) CPU-GPU clusters (heterogeneous systems) In-Situ Analytics Given big data, high-end apps, and parallel architectures… We need Programming Models and Middleware Support! Data-Intensive and High Performance Computing Research Group

Limitations of Current MapReduce Implementations Performance API for specification of various parallel algorithms Processing of Data directly in scientific data formats Use of accelerators /emerging architectures Support for distributed data stores (including cloud) In Situ Analytics Data-Intensive and High Performance Computing Research Group

Our Initial Middleware Series Bridge the gap between the parallel architectures and the applications Higher programming productivity than MPI Better performance efficiency than MapReduce MATE GPU GPU Ex-MATE Tall oaks grow from little acorns! We could use task re-execution as in map-reduce; but there is a more effective fault tolerance approach. MATE-CG GPU GPU FT-MATE Data-Intensive and High Performance Computing Research Group

Data-Intensive and High Performance Computing Research Group Programming Model The generalized reduction model Based on user-declared reduction objects Motivated by a set of data mining applications For example, K-Means could have a very large set of data points to process but only need to update a small set of centroids (the reduction object!) Forms a compact summary of computational states Helps achieve more efficient fault tolerance and recovery than replication/job re-execution in Map-Reduce Avoids large-sized intermediate data Applies updates directly on the reduction object instead of going through Map---Intermediate Processing---Reduce Data-Intensive and High Performance Computing Research Group

Comparing Processing Structures Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping, shuffling.. .overheads are eliminated with red. func/obj. But we need global combination. Insight: we could even provide a better implementation of the same map-reduce API! --- e.g., Turbo MapReduce from Quantcast! Data-Intensive and High Performance Computing Research Group

Results: Data Mining (I) K-Means on 8-core and 16-core machines: 400MB dataset, 3-dim points, k = 100 Avg. Time Per Iteration (sec) 2.0 3.0 # of threads Data-Intensive and High Performance Computing Research Group

Results: Data Mining (II) PCA on 8-core and 16-core machines : 8000 * 1024 matrix Total Time (sec) 2.0 2.0 # of threads Data-Intensive and High Performance Computing Research Group

Hybrid ``Cloud’’ Motivation Properties of cloud technologies Elasticity Pay-as-you-go Model Types of resources Computational resources Storage resources Hybrid Cloud Local Resources: Base Cloud Resources: Additional Typically, cloud technologies are used for meeting the computational demands of data-intensive applications. However, these applications can also easily exhaust the available storage. In this situation, local resources can be used to meet base needs, and additional resource demands can be satisfied from cloud services.

MATE-HC: Map-reduce with AlternaTE API over Hybrid Cloud Transparent data access and analysis Metadata generation Programmability of large-scale applications Variant of MapReduce MATE-HC Selective job assignment Consideration of data locality Different data objects Multithreaded remote data retrieval Asynchronous informed prefetching and caching Programmability of large-scale applications Variant of MapReduce Reduction Object LocalReduction (Map) GlobalReduction (Reduce)

Middleware for Hybrid Cloud Job Assignment Global Reduction Job Assignment Global Reduction Remote Data Analysis 15

Scientific Data Analysis Today Increasingly data-intensive Volume approximately doubles each year Stored in certain specialized formats NetCDF, HDF5, … Popularity of MapReduce and its variants Free accessibility Easy programmability Good scalability Built-in fault tolerance 2018/11/11

SciMATE Framework No data reloading and no need to know library specifics Extend MATE for scientific data analysis Customizable data format adaption API Ability to be adapted to support processing on any ( or even new) scientific data format Optimized by Access strategies Access patterns 2018/11/11

System Overview Key feature scientific data processing module 2018/11/11

Scientific Data Processing Module 2018/11/11

GPU MapReduce Can our reduction idea benefit implementation of original MapReduce? Turns out Yes! Reduce the (key, value) pair to the reduction object immediately after it is generated by the map function Very suitable for reduction-intensive applications A general and efficient MapReduce framework Dynamic memory allocation within a reduction object Maintaining a memory hierarchy Multi-group mechanism Overflow handling Before step into deeper, let me first talk about the background information of MapReduce and GPU architecture

Main Idea (1) Traditional MapReduce map(input) { (key, value) = process(input); emit(key, value); } grouping the key-value pairs (by runtime system) reduce(key, iterator) for each value in iterator result = operation(result, value); emit(key, result);

Main Idea (2) Reduction-based approach map(input) { (key, value) = process(input); reductionobject->insert(key, value); } reduce(value1, value2) value1 = operation(value1, value2); Reduces the memory overhead of storing key-value pairs Makes it possible to effectively utilize shared memory on a GPU Eliminates the need of grouping Especially suitable for reduction-intensive applications

Comparison with MapCG With reduction-intensive applications

Comparison with MapCG With other applications

Smark: In-Situ Design Motivation Move computation instead of data In-Situ System Distributed System Shared-Memory System Distributed System = Partitioning + Shared-Memory System + Combination In-Situ System = Shared-Memory System + Combination = Distributed System – Partitioning This is because the input data is automatically partitioned based on simulation code.

Ease of Use Co-locate simulation code and Smark code Define a reduction object Implement a Smark scheduler class string gen_key(const Chunk& chunk) void accumulate(const Chunk& chunk, RedObj& red_obj) void merge(const RedObj& red_obj, RedObj& com_obj) Only 3 extra lines at the end of simulation code /* After simulation code generates a float-type array “data” */ SchedArgs args(num_threads, chunk_size, extra_data, num_iterations); unique_ptr<Scheduler<float>> smark(new DerivedClass<float>(args)); smark->run(data, length); Reduction object is transparent to the user, and all parallelization complexity is hidden in a sequential view. Future work: Implement a “gen_keys” (similar to Spark flatmap) function to allow a data chunk to generate multiple keys. This is essential for window-based applications.

Smark vs. Spark Experimental Setup K-Means on an 8-core node 10 GB data Why not in a distributed environment? Mismatch between simulation code and Spark Scientific simulation code in C/C++ with MPI Spark supports Scala/Java/Python in a sequential view Two options to bridge the gap for Spark MPI_Gather() + Shared-memory buffer between C/C++ and Java programs (inefficient) Re-implement existing simulation programs in Java/Scala (impractical) Don’t forget another important limitation for Spark -- RDD does not naturally expose positional information (array subscript) to the user. Thus, it cannot be used for many scientific applications, e.g., structural aggregations and even many scientific simulations.

Smark vs. Spark (Cont’d) # of worker threads Simulation Computation Total Data Loading 1 25.90 203.20 229.05 37.83 744.82 11646.94 12460.99 2 25.70 106.10 131.80 45.92 326.58 7299.78 7691.45 4 52.45 78.15 40.97 210.88 5871.96 6147.92 8 25.65 26.35 52.00 36.70 179.72 4906.88 5139.48 RDD transformation can be quite expensive before any data reduction. E.g., map/flatmap and groupby actually may not reduce the size of the input RDD at all, but result in another restructured dataset. In contrast, all Smark operations are carried out in place of reduction objects, Conclusions Smark scales much better than Spark at least in shared-memory environment Spark launches extra threads for other tasks, e.g., communication and driver’s UI Spark still requires a data loading phase to convert simulation data into RDD Smark outperforms Spark by 54x – 99x in total processing time Each Spark transformation makes a new RDD due to its immutability Spark serializes RDD and send them through network even in local mode

RAMSES Project Possibilities Can provide representative workflows of many types GPU based data processing? Wide area processing across clusters In Situ Analytics Precursor Work Modeled memory hierarchy for reduction style computations (sigmetrics 2002) Modeled wide area data processing (FREERIDE-G: precursor to MATE-HC) Data-Intensive and High Performance Computing Research Group