Download presentation
Presentation is loading. Please wait.
Published byJoan Barber Modified over 9 years ago
1
Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National Lab.
2
1 Checkpoint-restart widely used MPI applications Take globally coordinated checkpoints Application-level checkpoint High-level I/O format HDF5, Adios, netCDF etc. Checkpoint writing Application I/O Library Data-Format API Struct ToyGrp{ 1. float Temperature[1024]; 2. short Pressure[20][30]; }; NetCDF HDF5 1.HDF5 checkpoint{ 2.Group “/”{ 3. Group “ToyGrp”{ DATASET “Temperature”{ DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE {(1024) / (1024)} } DATASET “Pressure” { DATATYPE H5T_STD_U8LE DATASPACE SIMPLE {(20,30) / (20,30)} }}}} Parallel File System (PFS) N1N1 Parallel File System (PFS) NNNN Parallel File System (PFS) NMNM Not scalable Best compromise but complex Easiest but Contention on PFS Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
3
2 IOR 78MB of data per process N N checkpoint transfer Observations: (-) Large average write time less frequent checkpointing (-) Large average read time poor application performance Average Read Time (s) Average Write Time (s) # of Processes (N) Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
4
3 Today’s checkpoint-restart systems will not scale Increasing number of concurrent transfers Increasing volume of checkpoint data Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
5
4 Data-aware aggregation Reduces the number of concurrent transfers Improves compressibility of checkpoints Data-aware compression Reduces data almost 2x more than simply concatenating them and compressing Design and develop mcrEngine N M checkpointing system Decouples checkpoint transfer logic from applications Improves application performance Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
6
5 Background Problem Data aggregation & compression Evaluation Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
7
6 Agnostic scheme – concatenate checkpoints Agnostic-block scheme – interleave fixed-size blocks Observations: (+) Easy (−) Low compression ratio C 1 [1-B] C 2 [1-B] C 1 [B+1-2B] C 2 [B+1-2B] C 1 [1-B] C 1 [B+1-2B] C 2 [1-B] C 2 [B+1-2B] C1C1 C2C2 C1C1 C2C2 Gzip PFS PFS First Phase Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
8
7 C 1.TC 1.PC 2.TC 2.P Group ToyGrp{ float Temperature[1024]; int Pressure[20][30]; }; P0P0 Group ToyGrp{ float Temperature[100]; int Pressure[10][50]; }; P1 Meta-data: 1.Name 2.Data-type 3.Class: -- Array, Atomic Concatenating similar variables C 2.TC 1.TC 2.PC 1.P Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
9
8 Group ToyGrp{ float Temperature[1024]; int Pressure[20][30]; }; P0P0 Group ToyGrp{ float Temperature[100]; int Pressure[10][50]; }; P1 Meta-data: 1.Name 2.Data-type 3.Class: -- Array, Atomic First ‘B’ bytes of Temperature Next ‘B’ bytes of Temperature Interleave Pressure C 1. TC 1. P C 2. PC 2. T Interleaving similar variables Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
10
9 Aware scheme – concatenate similar variables Aware-block scheme – interleave similar variables Output buffer Data-type aware compression FPC Lempel-Ziv TTPPHHDD Gzip PFS C 2.TC 1.TC 2.PC 1.P First Phase Second Phase Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
11
10 CNCCNC CNCCNC CNCCNC ANCANCCNCCNC CNCCNC CNCCNC CNCCNC CNCCNC ANCANC ANCANC Meta-data Identifies “similar” variables Request T, P Applies data-aware aggregation and compression PFS CNC : Compute node component ANC: Aggregator node component Rank-order groups, Group size = 4, N M checkpointing Group CNCCNC CNCCNC CNCCNC CNCCNC Request H, D DD TP TP TP HD HD HD Gzip HHPPTT HHTTPPDD HHTTPPDD Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
12
11 Background Problem Data aggregation & compression Evaluation Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
13
12 Applications ALE3D – 4.8GB per checkpoint set Cactus – 2.41GB per checkpoint set Cosmology – 1.1GB per checkpoint set Implosion – 13MB per checkpoint set Experimental test-bed LLNL’s Sierra: 261.3 TFLOP/s, Linux cluster 15,408 cores, 1.3 Petabyte Lustre file system Compression algorithm FPC [1] for double-precision float Fpzip [2] for single-precision float Lempel-Ziv for all other data-types Gzip for general-purpose compression Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
14
13 Effectiveness of data-aware compression What is the benefit of multiple compression phases? How does group size affect compression ratio? How does compression ratio change as a simulation progresses? Performance of mcrEngine Overhead of the checkpointing phase Overhead of the restart phase Compression ratio = Uncompressed size Compressed size Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
15
14 Data-type aware compression improves compressibility First phase changes underlying data format Data-agnostic double compression is not beneficial Because, data-format is non-uniform and uncompressible Compression Ratio Data-Aware Data-Agnostic Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
16
15 Different merging schemes better for different applications Larger group size beneficial for certain applications ALE3D: Improvement of 8% from group size 2 to 32 ALE3D Cactus CosmologyImplosion Compression Ratio Group size Aware-Block Aware Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
17
16 ALE3D Cactus CosmologyImplosion Compression Ratio Group size Aware-Block Aware Agnostic-Block Agnostic 98-115% Data-aware technique always yields better compression ratio than Data-Agnostic technique Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
18
17 Data-aware technique always yields better compression Cactus Compression Ratio Aware-Block Aware Agnostic-Block Agnostic Simulation Time-steps CosmologyImplosion Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
19
18 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
20
19 Used IOR N N: Each process transfers 78MB N M: Group size 32, 1.21GB per aggregator Average Write Time (sec) Average Read Time (sec) N->N Write N->M Write N->N Read N->M Read Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
21
20 IOR with N M transfer, groups of 32 processes Data-aware: 1.2GB, data-agnostic: 2.4GB Data-aware compression improves I/O performance at large scale Improvement during write 43% - 70% Improvement during read 48% - 70% Agnostic Aware Agnostic-Read Agnostic-Write Aware-Read Aware-Write Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
22
21 15,408 processes Group size of 32 for N M schemes Each process takes a checkpoint Converts network bound operation into CPU bound one CPU Overhead Transfer Overhead to PFS Total Checkpointing Overhead (sec) Reduction in checkpointing overhead 87%51% Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
23
22 Reduced overall restart overhead Reduced network load and transfer time CPU Overhead Transfer Overhead to PFS Total Recovery Overhead (sec) 43% 71% Reduction in I/O overhead 62% 64% Reduction in recovery overhead Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
24
23 Developed data-aware checkpoint compression technique Relative improvement in compression ratio up to 115% Investigated different merging techniques Evaluated effectiveness using real-world applications Designed and developed a scalable framework Implements N M checkpointing Improves application performance Transforms checkpointing into CPU bound operation Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
25
24 Tanzima Islam (tislam@purdue.edu)tislam@purdue.edu Website: web.ics.purdue.edu/~tislam Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression Purdue: Saurabh Bagchi (sbagchi@purdue.edu)sbagchi@purdue.edu Rudolf Eigenmann (eigenman@purdue.edu)eigenman@purdue.edu Lawrence Livermore National Laboratory Kathryn Mohror (kathryn@llnl.gov)kathryn@llnl.gov Adam Moody (moody20@llnl.gov)moody20@llnl.gov Bronis R. de Supinski (bronis@llnl.gov)bronis@llnl.gov
26
25 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
27
26 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
28
27 “A Large-scale Study of Failures in High-performance Computing Systems, by Bianca Schroeder, Garth Gibson Breakdown of root causes of failuresBreakdown of downtime into root causes Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
29
28 Analytical solution to group size selection? Better way than rank-order grouping? Variable streaming? Engineering challenge Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
30
29 1.M. Burtscher and P. Ratanaworabhan, “FPC: A High-speed Compressor for Double-Precision Floating-Point Data”. 2.P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”. 3.L. Reinhold, “QuickLZ”. Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.