Download presentation
Presentation is loading. Please wait.
Published byJeffery Woods Modified over 9 years ago
1
OSU – CSE 2014 Supporting Data-Intensive Scientific Computing on Bandwidth and Space Constrained Environments Tekin Bicer Department of Computer Science and Engineering The Ohio State University Advisor: Prof. Gagan Agrawal
2
OSU – CSE 2014 Introduction Scientific simulations and instruments –X-ray Photon Correlation Spectroscopy CCD Detector: 120MB/s now; 10GB/s by 2015 –Global Cloud Resolving Model 1PB for 4km grid-cell Performed on local clusters –Not sufficient Problems –Data Analysis, Storage, I/O performance Cloud Technologies 2
3
OSU – CSE 2014 Hybrid Cloud Motivation Properties of cloud technologies –Elasticity –Pay-as-you-go Model Types of resources –Computational resources –Storage resources Hybrid Cloud –Local Resources: Base –Cloud Resources: Additional 3
4
OSU – CSE 2014 Cloud Storage Usage of Hybrid Cloud 4 Local Storage Data Source Local Nodes Cloud Compute Nodes
5
OSU – CSE 2014 Challenges Data-Intensive Processing –Transparent Data Access and Analysis –Meeting User Constraints Minimizing Storage and I/O Cost –Domain Specific Compression –Parallel I/O with Compression 5 MATE-HC: Map-reduce with AlternaTE API Dynamic Resource Allocation Framework with Cloud Bursting Compression Methodology and System for Large-Scale App.
6
OSU – CSE 2014 MATE-HC: Map-reduce with AlternaTE API over Hybrid Cloud Transparent data access and analysis –Metadata generation Programmability of large-scale applications –Variant of MapReduce MATE-HC –Selective job assignment Consideration of data locality Different data objects –Multithreaded remote data retrieval –Asynchronous informed prefetching and caching 6
7
OSU – CSE 2014 MATE vs. Map-Reduce Processing Structure 7 Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping.. overheads are eliminated with red. func/obj.
8
OSU – CSE 2014 Middleware for Hybrid Cloud 8 Remote Data Analysis Job Assignment Global Reduction
9
OSU – CSE 2014 Dynamic Resource Allocation for Cloud Bursting Competition for local resources complicates applications with deadlines –Cloud resources can be utilized Utilization of cloud resources incur cost Dynamic Resource Allocation Framework –A model for capturing “Time” and “Cost” constraints with cloud bursting Cloud Bursting –In-house resources: Base workload –Cloud resources: Adopt performance requirements 9
10
OSU – CSE 2014 Resource Allocation Framework Estimate required time for local cluster processing Estimate required time for cloud cluster processing All variables can be profiled during execution, except estimated # stolen jobs Estimate remaining # jobs after local jobs are consumed Ratio of local computational throughput in system 10
11
OSU – CSE 2014 Execution of Resource Allocation Framework Slave Nodes –Request and consume jobs Master Node –Each cluster has one –Collects profile info. During job req. time –(De)allocates resources Head Node –Evaluates profiled info. –Estimates # cloud inst. Before each job assign. –Informs Master nodes 11
12
OSU – CSE 2014 Experimental Setup Two Applications –KMeans (520GB): Local=104GB; Cloud=416GB –PageRank (520GB): Local=104GB; Cloud=416GB Local cluster: Max. 16 nodes x 8 cores = 128 cores Cloud resources: Max. 16 nodes x 8 cores = 128 cores Evaluation of model –Local nodes are dropped during execution –Observed how system is adopted 12
13
OSU – CSE 2014 KMeans – Time Constraint # Local Inst.: 16 (fixed) # Cloud Inst.: Max 16 (Varies) Local: 104GB, Cloud:416GB System is not able to meet the time constraint because max. # of cloud instances is reached All other configurations meet the time constraint with <1.5% error rate 13
14
OSU – CSE 2014 KMeans – Cloud Bursting 4 local nodes are dropped … After 25% and 50% of time constraints are elapsed, error rate <1.9% After 75% of time constraint is elapsed, error rate <3.6% Reason of higher error rate: Shorter time to profile new environment # Local Inst.: 16 (fixed) # Cloud Inst.: Max 16 (Varies) Local: 104GB, Cloud:416GB 14
15
OSU – CSE 2014 Outline Data-Intensive Processing –Transparent Data Access and Analysis –Meeting User Constraints Minimizing Storage and I/O Cost –Domain Specific Compression –Parallel I/O with Compression 15 MATE-HC: Map-reduce with AlternaTE API Dynamic Resource Allocation Framework with Cloud Bursting Compression Methodology and System for Large-Scale App.
16
OSU – CSE 2014 Data Management using Compression Generic compression algorithms –Good for low entropy sequence of bytes –Scientific dataset are hard to compress Floating point numbers: Exponent and mantissa Mantissa can be highly entropic Using compression is challenging –Suitable compression algorithms –Utilization of available resources –Integration of compression algorithms 16
17
OSU – CSE 2014 Compression Methodology Common properties of scientific datasets –Multidimensional arrays –Consist of floating point numbers –Relationship between neighboring values Domain specific solutions can help Approach: –Prediction-based differential compression Predict the values of neighboring cells Store the difference 17
18
OSU – CSE 2014 Example: GCRM Temperature Variable Compression E.g.: Temperature record The values of neighboring cells are highly related X’ table (after prediction): X’’ compressed values –5bits for prediction + difference Lossless and lossy comp. Fast and good compression ratios 18
19
OSU – CSE 2014 Compression System Improve end-to-end application performance Minimize the application I/O time –Pipelining I/O and (de)compression operations Hide computational overhead –Overlapping application computation with compression framework Easy implementation of compression algorithms Easy integration with applications –Similar API to POSIX I/O 19
20
OSU – CSE 2014 A Compression System for Data Intensive Applications Chunk Resource Allocation (CRA) Layer Initialization of the system Converting original offset and data size requests to compressed Generate chunk requests, enqueue processing 20 Parallel Compression Engine (PCE) Applies encode(), decode() functions to chunks Manages in-memory cache with informed prefetching Creates I/O requests Parallel I/O Layer (PIOL) Creates parallel chunk requests to storage medium Each chunk request is handled by a group of threads Provides abstraction for different data transfer protocols 20
21
OSU – CSE 2014 Integration with a Data-Intensive Computing System Remote data processing –Sensitive to I/O bandwidth Processes data in… –local cluster –cloud –or both (Hybrid Cloud) 21
22
OSU – CSE 2014 Experimental Setup Two datasets: –GCRM: 375GB (L:270 + R:105) –NPB: 237GB (L:166 + R:71) Local cluster: 16x8 cores Storage of datasets –Lustre FS (14 storage nodes) –Amazon S3 (Northern Virginia) Compression algorithms –CC, FPC, LZO, bzip, gzip, lzma Applications: AT, MMAT, KMeans 22
23
OSU – CSE 2014 Performance of MMAT 23
24
OSU – CSE 2014 Summary Management and analysis of scientific datasets are challenging –Generic compression algorithms are inefficient for scientific datasets Proposed a compression framework and methodology –Domain specific compression algorithms are fast and space efficient 51.68% space saving 53.27% improvement in exec. time –Easy plug-and-play of compression –Integration of the proposed framework and methodology with a data analysis middleware 24
25
OSU – CSE 2014 Enabling Parallel I/O with Compression Community focus –Storing, managing and moving scientific dataset –Mostly reading data Compression can further help –Simulation applications and instruments Large volume of output data High write overhead, extended execution times –Parallel I/O with Compression Increased output performance But… –Can it really benefit the output performance? Tradeoff between CPU utilization and I/O idle time –What about integration with scientific applications? Effort required by scientists to adopt their application 25
26
OSU – CSE 2014 Scientific Data Management Libs. Widely used by the community –PnetCDF (NetCDF), HDF5… NetCDF Format –Portable, self-describing, space-efficient High Performance Parallel I/O –MPI-IO Optimizations: Collective and Independent calls Hints about file system No Support for Compression 26
27
OSU – CSE 2014 Parallel and Transparent Compression for PnetCDF Parallel write operations –Size of data types and variables –Data item locations Parallel write operations with compression –Variable-size chunks –No priori knowledge about the locations –Many processes write at once 27
28
OSU – CSE 2014 Parallel and Transparent Compression for PnetCDF Desired features while enabling compression: Parallel Compression and Write –Sparse and Dense Storage Transparency –Minimum effort from the application developer –System integration with PnetCDF Performance –Different variables may require different comp. –Domain specific compression alg. 28
29
OSU – CSE 2014 Compression: Sparse Storage Chunks/splits are created Compression layer applies user provided algorithms Compressed splits are written w/ orig. offset Still can benefit I/O –Only compressed data No benefit for storage space 29
30
OSU – CSE 2014 Compression: Dense Storage Generated compressed splits are appended locally New offset addresses are calculated –Requires metadata exchange All compressed data blocks written using collective call Generated file is smaller –Advantages: I/O + storage space 30
31
OSU – CSE 2014 Compression: Hybrid Method Developer provides: –Compression ratio –Error ratio Does not require metadata exchange Error padding can be used for overflowed data Generated file is smaller Relies on user inputs 31 Off’ = Off x (1/(comp_ratio-err_ratio)
32
OSU – CSE 2014 PnetCDF Data Flow 1.Generated data is passed to PnetCDF library 2.Variable info. gathered from NetCDF header 3.Splits are compressed 4.Metadata info. exchanged 5.Parallel write ops. 6.Synch. and global view 1.Update NetCDF header 32
33
OSU – CSE 2014 Experimental Setup Local cluster: –Each node has 8 cores (Intel Xeon E5630, 2.53Ghz) –Memory: 12GB Infiniband network –Lustre file system: 8 OSTs, 4 storage nodes –1 Metadata Server Microbenchmarks: 34 GB Two data analysis applications: 136 GB dataset –AT, MATT Scientific simulation application: 49 GB dataset –Mantevo Project: MiniMD 33
34
OSU – CSE 2014 Exp: (Write) Microbenchmarks 34
35
OSU – CSE 2014 Exp: Simulation (MiniMD) 35 Application Execution Times Application Write Times
36
OSU – CSE 2014 Summary Scientific simulation applications and instruments –Generate massive amount of data Management of “Big Data” –I/O throughput affects performance –Minimum effort during integration Proposed two compression methods Implemented a compression layer in PnetCDF –Ported our proposed methods Evaluated our system –MiniMD: 22% performance, 25.5% storage space –AT, MATT: 45.3% performance, 47.8% storage space 36
37
OSU – CSE 2014 Conclusions Limited resources complicate execution of applications –Cloud technologies provide on-demand resources Challenges –Transparent data access and processing; meeting user constraints; minimizing I/O and storage cost MATE-HC –Transparent and efficient data processing on Hybrid Cloud –Time and cost sensitive data processing A Compression Methodology and a System –Minimize storage cost and I/O bottleneck Parallel I/O with Compression Future Direction: In-Situ and In-Transit Data Analysis –Resource utilization and load balancing 37
38
OSU – CSE 2014 38 Thanks!
39
OSU – CSE 2014 In-Situ and In-Transit Analysis 39
40
OSU – CSE 2014 In-Situ and In-Transit Analysis Compression can ease data management –But may not always be sufficient In-situ data analysis –Co-locate data source and analysis code –Data analysis during data generation In-transit data analysis –Remote resources are utilized –Forward generated data to “staging nodes” 40
41
OSU – CSE 2014 In-Situ and In-Transit Data Analysis Significant reduction in generated dataset size –Noise elimination, data filtering, stream mining… –Timely insights Parallel data analysis –MATE-Stream Dynamic resource allocation and load balancing –Hybrid data analysis –Both in-situ and in-transit 41
42
OSU – CSE 2014 Robj[...] LR Parallel In-Situ Data Analysis 42 Data Source Disp LRobj[...] Local Combination Intermediate results Timely insights Continuous global red. Local Reduction Filtering, stream mining Data reduction Continuous local red. Data Generation Scientific instruments, simulations, etc. (Un)bounded data
43
OSU – CSE 2014 Robj[...] LR Robj[...] LR Elastic In-Situ Data Analysis 43 Data Source Disp LRobj[...] Insufficient resource utilization Dynamically extend resources New local reduction proc.
44
OSU – CSE 2014 Robj[...] LR Robj[...] LR Elastic In-Situ and In-Transit Data Analysis 44 Data Source Disp LRobj[...] Disp LRobj[...] GRobj[...] Staging node is set Forward data Reduction process: 1.Local comb. 2.Global comb. N0 N1
45
OSU – CSE 2014 Future Directions Scientific applications are difficult to modify –Integration with existing data sources –GridFTP, (P)NetCDF and HDF5 etc. Data transfer is expensive (especially for in-transit) –Utilization of advanced network technologies –Software-Defined Networking (SDN) Long running nature of large-scale app. –Failures are inevitable –Exploit features of processing structure 45
46
OSU – CSE 2014 Experiments: Hybrid Cloud 46
47
OSU – CSE 2014 Experiments 2 geographically distributed clusters –Cloud: EC2 instances running on Virginia –32 nodes x 8 cores –Local: Campus cluster (Columbus, OH) –32 nodes x 8 cores (Intel Xeon 2.53GHz) 3 applications with 120GB of data –KMeans: k=1000; KNN: k=1000; –PageRank: 50x10 links w/ 9.2x10 edges Goals: –Evaluating the system overhead with different job distributions –Evaluating the scalability of the system 47
48
OSU – CSE 2014 System Overhead: K-Means 48 Env-*Global Reduction Idle TimeTotal Slowdown Stolen # Jobs (960) localEC2 50/500.067093.87120.430 (0.5%)0 33/670.066031.232142.403 (5.9%)128 17/830.066025.101243.31 (10.4%)240
49
OSU – CSE 2014 Scalability: K-Means 49
50
OSU – CSE 2014 Summary MATE-HC is a data-intensive middleware developed for Hybrid Cloud Our results show that –Low inter-cluster comm. overhead –Job distribution is important –Overall slowdown is modest –Proposed system is scalable 50
51
OSU – CSE 2014 MATE Processing Structure MATE-EC2 Design 51
52
OSU – CSE 2014 Simple Example 3 5 8 4 1 3 5 2 6 7 9 4 2 4 8 52 Our large Dataset Our Compute Nodes Robj[1]= Local Reduction (+) 81514212327 Result= 71 Global Reduction(+)
53
OSU – CSE 2014 MATE-EC2 Design and Experiments 53
54
OSU – CSE 2014 MATE-EC2 Design Data organization –Three levels: Buckets, Chunks and Units –Metadata information Chunk Retrieval –Threaded Data Retrieval –Selective Job Assignment Load Balancing and handling heterogeneity –Pooling mechanism 54
55
OSU – CSE 2014 MATE-EC2 vs. EMR 55 KMeans Speedups vs. combine 3.54 – 4.58 PageRank Speedups vs. combine 4.08 – 7.54
56
OSU – CSE 2014 Different Chunk Sizes KMeans 1 retrieval threads Performance increase –128KB vs. >8M –2.07 to 2.49 56
57
OSU – CSE 2014 K-Means (Data Retrieval) Fig 1: 16 Retrieval Threads –8M vs. others speedup: 1.13-1.30 Fig. 2: 128M Chunk Size –1 Thread vs. others speedup: 1.37-1.90 57 Fig. 1 Fig. 2 Dataset: 8.2GB
58
OSU – CSE 2014 Job Assignment 58 KMeans: –1.01 (8M) and 1.10-1.14 (for others) PCA (2 iterations): –Speedups : 1.19-1.68
59
OSU – CSE 2014 Heterogeneous Conf. 59 Overheads –KMeans: 1% –PCA: 1.1%, 7.4%, 11.7%
60
OSU – CSE 2014 Dynamic Resource Allocation Framework and Experiments 60
61
OSU – CSE 2014 Kmeans – Cost Constraint System meets the cost constraints with <1.1% error rate Maximum # cloud instances is allocated error rate is again <1.1% System tries to minimize the execution time with provided cost constraint 61
62
OSU – CSE 2014 Summary Developed a resource allocation model –Based on feedback mechanism –Time and cost constraints Two data-intensive applications (KMeans, PR) –Error rate for time < 3.6% –Error rate for cost < 1.2% 62
63
OSU – CSE 2014 Compression Framework Components and Experiments 63
64
OSU – CSE 2014 Prefetching and In-Memory Cache Overlapping application layer computation with I/O Reusability of already accessed data is small Prefetching and caching the prospective chunks –Default is LRU –User can analyze history and provide prospective chunk list Cache uses row-based locking scheme for efficient consecutive chunk requests 64 Informed Prefetching prefetch(…)
65
OSU – CSE 2014 Performance of MMAT 65 Breakdown of Performance Overhead (Local): 15.41% Read Speedup: 1.96
66
OSU – CSE 2014 Lossy Compression (MMAT) 66 Lossy #e: # dropped bits Error bound: 5x(1/10^5)
67
OSU – CSE 2014 67 Performance of KMeans NPB dataset Comp ratio: 24.01% (180GB) More computation –More opportunity to fetch and decompression
68
OSU – CSE 2014 Parallel I/O with Compression Experiments 68
69
OSU – CSE 2014 Exp: (Read) Microbenchmarks 69
70
OSU – CSE 2014 Exp: Scientific Analysis (AT) 70
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.