Tools and Techniques for Processing and Management of Data Gagan Agrawal Data-Intensive and High Performance Computing Research Group Department of Computer Science and Engineering The Ohio State University (Joint work with Wei Jiang, Yu Su, Linchuan Chen, Yi Wang et al.)
Data-Intensive and High Performance Computing Research Group Overall Context Research group active in `data-intensive’ computing since 2000 Main directions Data Processing Solutions (A MapReduce like system built in 2000) Data Management (``Database’’) solutions for scientific computing Automatic Data Virtualization Approach presented in 2004 Recently re-discovered in DB community as NoDB apprach! Parallel Data Mining algorithms Use of accelerators Data-Intensive and High Performance Computing Research Group
Data-Intensive and High Performance Computing Research Group Outline Data Processing Middleware solutions MATE Ex-MATE MATE-CG SciMATE MapReduce on GPU Data Management Solutions Automatic Data Virtualization Indexing as a Service and Services based on Indexing Data-Intensive and High Performance Computing Research Group
Data-Intensive and High Performance Computing Research Group The Era of “Big Data” When data size becomes a problem: Need easy-to-use tools! What other aspects? Performance? Analysis and Management? Security and Privacy? Old days Data-Intensive and High Performance Computing Research Group
Data-Intensive and High Performance Computing Research Group Motivation Growing need of Data-Intensive SuperComputing Performance is highest priority in HPC! Efficient data processing & High programming productivity Emergence of various parallel architectures Traditional CPU clusters (multi-cores) GPU clusters (many-cores) CPU-GPU clusters (heterogeneous systems) Given big data, high-end apps, and parallel architectures… We need Programming Models and Middleware Support! Data-Intensive and High Performance Computing Research Group
Limitations of Current MapReduce Performance API for specification of various parallel algorithms Processing of Data directly in scientific data formats Use of accelerators /emerging architectures Support for distributed data stores (including cloud) Data-Intensive and High Performance Computing Research Group
Data-Intensive and High Performance Computing Research Group Our Middleware Series Bridge the gap between the parallel architectures and the applications Higher programming productivity than MPI Better performance efficiency than MapReduce MATE GPU GPU Ex-MATE Tall oaks grow from little acorns! We could use task re-execution as in map-reduce; but there is a more effective fault tolerance approach. MATE-CG GPU GPU FT-MATE Data-Intensive and High Performance Computing Research Group
Data-Intensive and High Performance Computing Research Group The Programming Model The generalized reduction model Based on user-declared reduction objects Motivated by a set of data mining applications For example, K-Means could have a very large set of data points to process but only need to update a small set of centroids (the reduction object!) Forms a compact summary of computational states Helps achieve more efficient fault tolerance and recovery than replication/job re-execution in Map-Reduce Avoids large-sized intermediate data Applies updates directly on the reduction object instead of going through Map---Intermediate Processing---Reduce Data-Intensive and High Performance Computing Research Group
Comparing Processing Structures Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping, shuffling.. .overheads are eliminated with red. func/obj. But we need global combination. Insight: we could even provide a better implementation of the same map-reduce API! --- e.g., Turbo MapReduce from Quantcast! Data-Intensive and High Performance Computing Research Group
Data-Intensive and High Performance Computing Research Group Experiments Design For comparison against Phoenix, we used three data mining applications K-Means Clustering, Princinple Component Analysis (PCA), Apriori Associative Mining. Also evaluated the single-node performance of Hadoop on KMeans and Apriori Combine function is used in Hadoop with careful tuning Experiments on two multi-core platforms 8 cores on one 8-core node (Intel cpu) 16 cores on one 16-core node (AMD cpu) Data-Intensive and High Performance Computing Research Group 11
Results: Data Mining (I) K-Means on 8-core and 16-core machines: 400MB dataset, 3-dim points, k = 100 Avg. Time Per Iteration (sec) 2.0 3.0 # of threads Data-Intensive and High Performance Computing Research Group
Results: Data Mining (II) PCA on 8-core and 16-core machines : 8000 * 1024 matrix Total Time (sec) 2.0 2.0 # of threads Data-Intensive and High Performance Computing Research Group
Data-Intensive and High Performance Computing Research Group Extending MATE Main issue of the original MATE: Assumes that the reduction object MUST fit in memory We extended MATE to address this limitation Focus on graph mining: an emerging class of apps Require large-sized reduction objects as well as large-scale datasets E.g., PageRank could have a 16GB reduction object! Support of managing arbitrary-sized reduction objects Large-sized reduction objects are disk-resident Evaluated Ex-MATE using PEGASUS PEGASUS: A Hadoop-based graph mining system Data-Intensive and High Performance Computing Research Group
Ex-MATE Runtime Overview Basic one-stage execution Execution Overview of the Extended MATE Data-Intensive and High Performance Computing Research Group
Results: Graph Mining (I) 16GB datasets: Ex-MATE:~10 times speedup HADI PageRank Avg. Time Per Iteration (min) HCC # of nodes Data-Intensive and High Performance Computing Research Group
Scalability: Graph Mining (II) HCC: better scalability with larger datasets 32GB 8GB Avg. Time Per Iteration (min) 64GB # of nodes 1.5 3.0 Data-Intensive and High Performance Computing Research Group
MATE for CPU-GPU Clusters Still adopts Generalized Reduction Built on top of MATE and Ex-MATE Accelerates data-intensive computations on heterogeneous systems Focus on CPU-GPU clusters A multi-level data partitioning Proposed a novel auto-tuning framework Exploits iterative nature of many data-intensive apps Automatically decides the workload distribution between CPUs and GPUs Data-Intensive and High Performance Computing Research Group
Data-Intensive and High Performance Computing Research Group MATE-CG Overview Execution work-flow Data-Intensive and High Performance Computing Research Group
Data-Intensive and High Performance Computing Research Group Experiments Design Experiments Platform A heterogeneous CPU-GPU cluster Each node has one multi-core CPU and one GPU Intel 8-core CPU NVIDA Tesla (Fermi) GPU (14*32 (448) cores) Used up to 128 CPU cores and 7168 GPU cores on 16 nodes Three representative applications Gridding kernel, EM, and PageRank For each application, we run it in four modes in the cluster CPU-1: CPU-8: GPU-only: CPU-8-n-GPU Data-Intensive and High Performance Computing Research Group 20
Results: Scalability with increasing # of GPUs GPU-only is better than CPU-8 PageRank EM Avg. Time Per Iteration (sec) 16% 3.0 Gridding Kernel 25% # of nodes Data-Intensive and High Performance Computing Research Group
Scientific Data Analysis Today Increasingly data-intensive Volume approximately doubles each year Stored in certain specialized formats NetCDF, HDF5, … Popularity of MapReduce and its variants Free accessibility Easy programmability Good scalability Built-in fault tolerance 2018/11/12
SciMATE Framework No data reloading and no need to know library specifics Extend MATE for scientific data analysis Customizable data format adaption API Ability to be adapted to support processing on any ( or even new) scientific data format Optimized by Access strategies Access patterns 2018/11/12
System Overview Key feature scientific data processing module 2018/11/12
Scientific Data Processing Module 2018/11/12
Integrating a New Data Format Data adaption layer is customizable Insert a third-party adapter Open for extension but closed for modification Have to implement the generic block loader interface Partitioning function and auxiliary functions E.g., partition, get_dimensionality Full read function and partial read functions E.g., full_read, partial_read, partial_read_by_block 2018/11/12
Evaluating Thread Scalability Data processing times for kNN (on a 8-core node) 2018/11/12
GPU MapReduce Can our reduction idea benefit implementation of original MapReduce? Turns out Yes! Reduce the (key, value) pair to the reduction object immediately after it is generated by the map function Very suitable for reduction-intensive applications A general and efficient MapReduce framework Dynamic memory allocation within a reduction object Maintaining a memory hierarchy Multi-group mechanism Overflow handling Before step into deeper, let me first talk about the background information of MapReduce and GPU architecture
Main Idea (1) Traditional MapReduce map(input) { (key, value) = process(input); emit(key, value); } grouping the key-value pairs (by runtime system) reduce(key, iterator) for each value in iterator result = operation(result, value); emit(key, result);
Main Idea (2) Reduction-based approach map(input) { (key, value) = process(input); reductionobject->insert(key, value); } reduce(value1, value2) value1 = operation(value1, value2); Reduces the memory overhead of storing key-value pairs Makes it possible to effectively utilize shared memory on a GPU Eliminates the need of grouping Especially suitable for reduction-intensive applications
Challenges Result collection and overflow handling Maintain a memory hierarchy Trade off space requirement and locking overhead A multi-group scheme To keep the framework general and efficient A well defined data structure for the reduction object
Device Memory Reduction Object Memory Hierarchy GPU Reduction Object 0 Reduction Object 1 Reduction Object 0 Reduction Object 1 … … … … … … Block 0’s Shared Memory Block 0’s Shared Memory Device Memory Reduction Object Result Array Device Memory CPU Host Memory
Reduction Object … … … … Memory Allocator KeyIdx[0] ValIdx[0] Val Data Key Size Val Size Key Data Key Size Val Size Key Data Val Data
Comparison with MapCG With reduction-intensive applications
Comparison with MapCG With other applications
Comparison with Ji et al.'s work
Data-Intensive and High Performance Computing Research Group Outline Data Processing Middleware solutions MATE Ex-MATE MATE-CG SciMATE MapReduce on GPU Data Management Solutions Automatic Data Virtualization Indexing as a Service and Services based on Indexing Data-Intensive and High Performance Computing Research Group
Data Management Today Database Systems Ad-hoc Solutions High-level query languages Indexing support Large-complex systems Need to load all data inside the system Cannot handle format changes etc. Use procedural or scripting languages Lack indexing support Keep data in original format Light-weight solutions Adapt to format changes etc.
Our Approach Automatic Data Virtualization Support high-level view of array-based data Allow queries assuming such a view Extract values from dataset to serve these queries Indexing techniques applied to low-level data Integrated with a high-level query system Sampling is a critical functionality Integrate with data virtualization system Use an indexing method to sample
Automatic Data Virtualization User develops queries using an SQL-like language Potential to use graphical interfaces Automatic logical to physical mapping from SQL query to underlying data format Server side data staging and aggregation of query Performs spatial subsetting, spatial sampling, and/or aggregation GAGAN, if you have some slides to that better describe your implementation it would be welcome. Graphical or textual SQL syntax Benefits: Simple, well known, interface Data sets added via simple layout descriptors IO optimizations can be implemented automatically (no need for intimate end user knowledge of IO protocols) RNET Proprietary Data November 12, 2018
System Overview (NetCDF) Parse the SQL expression Parse the metadata file Generate Query Request Index Generation if not generated; Index Retrieving after that.
Efficiency Comparison with Filtering in Paraview Data size: 5.6 GB Input: 400 queries Depends on subset percentage General index method is better than filtering when data subset < 60% Two phase optimization achieved a 0.71 – 11.17 speedup compared with filtering method Index m1: Bitmap Indexing, no optimization Index m2: Use bitwise operation instead of post-filtering Index m3: Use both bitwise operation and index partition Filter: load all data + filter
Efficiency Comparison with FastQuery Data size: 8.4 GB Proc#: 48 Input: 100 queries for each query type Achieved a 1.41 to 2.12 speedup compared with FastQuery
Comparison with NCO SELECT TEMP FROM "POP.nc" WHERE TEMP>7; ./ncap2 -S 'query.nco' POP.nc POP2.nc query.nco TEMP=TEMP; TEMP.set_miss(-1.0e+34f); where (TEMP>7.0) elsewhere TEMP=TEMP@_FillValue; NCO does not support subsetting based on variable values when the variable has multiple dimensions. In the above query, TEMP is a variable with 5 dimensions (latitude, longitude, depth etc.). To obtain the records with TEMP > 7 using NCO, it is required to provide a script (query.nco) to ncap2 command. The script replaces every value that does not satisfy the required condition with a FillValue. These records are ignored in obtaining further statistics using NCO. Our SQL based system supports variable based subsetting as shown in the left column. November 12, 2018 RNET Proprietary Data
NCO Performance Comparison Query 1: Same as Query in Slide 15 Query2: SELECT pressure FROM "pressure_19010110_000000.nc" WHERE pressure>=90000 OR pressure<=4000; NCO: pressure=pressure; pressure.set_miss(-999.0); where(pressure>=90000.0 || pressure <= 4000.0) elsewhere pressure=pressure.get_miss(); Query 3: SELECT geopotential FROM "dataset/GCRM/geopotential_19010110_000000.nc"; time ./ncks -O -v geopotential ../../dataset/GCRM/geopotential_19010110_000000.nc temp.nc Query 4: SELECT geopotential FROM "dataset/GCRM/geopotential_19010110_000000.nc" WHERE (cells>30000 OR cells<=1000) AND interfaces>200 AND time<797600; time ./ncks -O -d cells,30000, -d cells,,1000 -d interfaces,200.0, -d time,797600.0, ../../dataset/GCRM/geopotential_19010110_000000.nc temp.nc Query 5: SELECT VVEL FROM "dataset/POP/POP.nc" WHERE u_lon<10 OR u_lon>80; time ./ncks -O -d u_lon,,10.0 -d u_lon,80.0, -v VVEL ../../dataset/POP/POP.nc POP2.nc Query Number and Percentage Data Retrieved RNET Proprietary Data November 12, 2018
Selection Performance Comparison with SciDB Queries on the 3D dataset of 5.2 GB: divided into 4 groups based on different data coverage (<5%, 5%-10%, 10%-15%, and 15%-20%) – 17,000 seconds data loading cost for SciDB not included here
Data-Intensive and High Performance Computing Research Group Conclusions Many innovative solutions built Many ongoing research activities as well Much potential for meeting DOE data-intensive science requirements Very open to collaborations Data-Intensive and High Performance Computing Research Group