Tools and Techniques for Processing (and Management) of Data

Tools and Techniques for Processing (and Management) of Data
Gagan Agrawal Data-Intensive and High Performance Computing Research Group Department of Computer Science and Engineering The Ohio State University (Joint work with Wei Jiang, Tekin Bicer, Yu Su, Linchuan Chen, Yi Wang et al.)

Data-Intensive and High Performance Computing Research Group
Overall Context Research group active in `data-intensive’ computing since 2000 Main directions Data Processing Solutions (A MapReduce like system built in 2000) Data Management (``Database’’) solutions for scientific computing Automatic Data Virtualization Approach presented in 2004 Recently re-discovered in DB community as NoDB apprach! Parallel Data Mining algorithms Use of accelerators Data-Intensive and High Performance Computing Research Group

Outline Data Processing Middleware solutions MATE, Ex-MATE, MATE-CG SciMATE MATE-EC, MATE-HC MapReduce on GPU Smark (ongoing) Data Management Solutions Automatic Data Virtualization Indexing as a Service and Services based on Indexing Data-Intensive and High Performance Computing Research Group

The Era of “Big Data” When data size becomes a problem: Need easy-to-use tools! What other aspects? Performance? Analysis and Management? Security and Privacy? Old days Data-Intensive and High Performance Computing Research Group

Motivation Growing need of Data-Intensive SuperComputing Performance is highest priority in HPC! Efficient data processing & High programming productivity Emergence of various parallel architectures Traditional CPU clusters (multi-cores) CPU-GPU clusters (heterogeneous systems) In-Situ Analytics Given big data, high-end apps, and parallel architectures… We need Programming Models and Middleware Support! Data-Intensive and High Performance Computing Research Group

Limitations of Current MapReduce Implementations
Performance API for specification of various parallel algorithms Processing of Data directly in scientific data formats Use of accelerators /emerging architectures Support for distributed data stores (including cloud) In Situ Analytics Data-Intensive and High Performance Computing Research Group

Our Initial Middleware Series
Bridge the gap between the parallel architectures and the applications Higher programming productivity than MPI Better performance efficiency than MapReduce MATE GPU GPU Ex-MATE Tall oaks grow from little acorns! We could use task re-execution as in map-reduce; but there is a more effective fault tolerance approach. MATE-CG GPU GPU FT-MATE Data-Intensive and High Performance Computing Research Group

Programming Model The generalized reduction model Based on user-declared reduction objects Motivated by a set of data mining applications For example, K-Means could have a very large set of data points to process but only need to update a small set of centroids (the reduction object!) Forms a compact summary of computational states Helps achieve more efficient fault tolerance and recovery than replication/job re-execution in Map-Reduce Avoids large-sized intermediate data Applies updates directly on the reduction object instead of going through Map---Intermediate Processing---Reduce Data-Intensive and High Performance Computing Research Group

Comparing Processing Structures
Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping, shuffling.. .overheads are eliminated with red. func/obj. But we need global combination. Insight: we could even provide a better implementation of the same map-reduce API! --- e.g., Turbo MapReduce from Quantcast! Data-Intensive and High Performance Computing Research Group

Results: Data Mining (I)
K-Means on 8-core and 16-core machines: 400MB dataset, 3-dim points, k = 100 Avg. Time Per Iteration (sec) 2.0 3.0 # of threads Data-Intensive and High Performance Computing Research Group

Results: Data Mining (II)
PCA on 8-core and 16-core machines : 8000 * 1024 matrix Total Time (sec) 2.0 2.0 # of threads Data-Intensive and High Performance Computing Research Group

Hybrid ``Cloud’’ Motivation
Properties of cloud technologies Elasticity Pay-as-you-go Model Types of resources Computational resources Storage resources Hybrid Cloud Local Resources: Base Cloud Resources: Additional Typically, cloud technologies are used for meeting the computational demands of data-intensive applications. However, these applications can also easily exhaust the available storage. In this situation, local resources can be used to meet base needs, and additional resource demands can be satisfied from cloud services.

MATE-HC: Map-reduce with AlternaTE API over Hybrid Cloud
Transparent data access and analysis Metadata generation Programmability of large-scale applications Variant of MapReduce MATE-HC Selective job assignment Consideration of data locality Different data objects Multithreaded remote data retrieval Asynchronous informed prefetching and caching Programmability of large-scale applications Variant of MapReduce Reduction Object LocalReduction (Map) GlobalReduction (Reduce)

Middleware for Hybrid Cloud
Job Assignment Global Reduction Job Assignment Global Reduction Remote Data Analysis 15

Scientific Data Analysis Today
Increasingly data-intensive Volume approximately doubles each year Stored in certain specialized formats NetCDF, HDF5, … Popularity of MapReduce and its variants Free accessibility Easy programmability Good scalability Built-in fault tolerance 2018/11/11

SciMATE Framework No data reloading and no need to know library specifics Extend MATE for scientific data analysis Customizable data format adaption API Ability to be adapted to support processing on any ( or even new) scientific data format Optimized by Access strategies Access patterns 2018/11/11

System Overview Key feature scientific data processing module
2018/11/11

Scientific Data Processing Module
2018/11/11

GPU MapReduce Can our reduction idea benefit implementation of original MapReduce? Turns out Yes! Reduce the (key, value) pair to the reduction object immediately after it is generated by the map function Very suitable for reduction-intensive applications A general and efficient MapReduce framework Dynamic memory allocation within a reduction object Maintaining a memory hierarchy Multi-group mechanism Overflow handling Before step into deeper, let me first talk about the background information of MapReduce and GPU architecture

Main Idea (1) Traditional MapReduce map(input) {
(key, value) = process(input); emit(key, value); } grouping the key-value pairs (by runtime system) reduce(key, iterator) for each value in iterator result = operation(result, value); emit(key, result);

Main Idea (2) Reduction-based approach map(input) {
(key, value) = process(input); reductionobject->insert(key, value); } reduce(value1, value2) value1 = operation(value1, value2); Reduces the memory overhead of storing key-value pairs Makes it possible to effectively utilize shared memory on a GPU Eliminates the need of grouping Especially suitable for reduction-intensive applications

Comparison with MapCG With reduction-intensive applications

Comparison with MapCG With other applications

Smark: In-Situ Design Motivation Move computation instead of data
In-Situ System Distributed System Shared-Memory System Distributed System = Partitioning + Shared-Memory System + Combination In-Situ System = Shared-Memory System + Combination = Distributed System – Partitioning This is because the input data is automatically partitioned based on simulation code.

Ease of Use Co-locate simulation code and Smark code
Define a reduction object Implement a Smark scheduler class string gen_key(const Chunk& chunk) void accumulate(const Chunk& chunk, RedObj& red_obj) void merge(const RedObj& red_obj, RedObj& com_obj) Only 3 extra lines at the end of simulation code /* After simulation code generates a float-type array “data” */ SchedArgs args(num_threads, chunk_size, extra_data, num_iterations); unique_ptr<Scheduler<float>> smark(new DerivedClass<float>(args)); smark->run(data, length); Reduction object is transparent to the user, and all parallelization complexity is hidden in a sequential view. Future work: Implement a “gen_keys” (similar to Spark flatmap) function to allow a data chunk to generate multiple keys. This is essential for window-based applications.

Smark vs. Spark Experimental Setup
K-Means on an 8-core node 10 GB data Why not in a distributed environment? Mismatch between simulation code and Spark Scientific simulation code in C/C++ with MPI Spark supports Scala/Java/Python in a sequential view Two options to bridge the gap for Spark MPI_Gather() + Shared-memory buffer between C/C++ and Java programs (inefficient) Re-implement existing simulation programs in Java/Scala (impractical) Don’t forget another important limitation for Spark -- RDD does not naturally expose positional information (array subscript) to the user. Thus, it cannot be used for many scientific applications, e.g., structural aggregations and even many scientific simulations.

Smark vs. Spark (Cont’d)
# of worker threads Simulation Computation Total Data Loading 1 25.90 203.20 229.05 37.83 744.82 2 25.70 106.10 131.80 45.92 326.58 4 52.45 78.15 40.97 210.88 8 25.65 26.35 52.00 36.70 179.72 RDD transformation can be quite expensive before any data reduction. E.g., map/flatmap and groupby actually may not reduce the size of the input RDD at all, but result in another restructured dataset. In contrast, all Smark operations are carried out in place of reduction objects, Conclusions Smark scales much better than Spark at least in shared-memory environment Spark launches extra threads for other tasks, e.g., communication and driver’s UI Spark still requires a data loading phase to convert simulation data into RDD Smark outperforms Spark by 54x – 99x in total processing time Each Spark transformation makes a new RDD due to its immutability Spark serializes RDD and send them through network even in local mode

RAMSES Project Possibilities
Can provide representative workflows of many types GPU based data processing? Wide area processing across clusters In Situ Analytics Precursor Work Modeled memory hierarchy for reduction style computations (sigmetrics 2002) Modeled wide area data processing (FREERIDE-G: precursor to MATE-HC) Data-Intensive and High Performance Computing Research Group

Tools and Techniques for Processing (and Management) of Data

Similar presentations

Presentation on theme: "Tools and Techniques for Processing (and Management) of Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tools and Techniques for Processing (and Management) of Data

Similar presentations

Presentation on theme: "Tools and Techniques for Processing (and Management) of Data"— Presentation transcript:

Similar presentations

About project

Feedback