pFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University
pFPC: A Parallel Compressor for Floating-Point DataMarch 2009 Introduction Scientific programs Often produce and transfer lots of floating-point data (e.g., program output, checkpoints, messages) Large amounts of data Are expensive and slow to transfer and store FPC algorithm for IEEE 754 double-precision data Compresses linear streams of FP values fast and well Single-pass operation and lossless compression
Introduction (cont.) Large-scale high-performance computers Consist of many networked compute nodes Compute nodes have multiple CPUs but only one link Want to speed up data transfer Need real-time compression to match link throughput pFPC: a parallel version of the FPC algorithm Exceeds 10 Gb/s on four Xeon processors pFPC: A Parallel Compressor for Floating-Point DataMarch 2009
pFPC: A Parallel Compressor for Floating-Point DataMarch 2009 Sequential FPC Algorithm [DCC’07] Make two predictions Select closer value XOR with true value Count leading zero bytes Encode value Update predictors
pFPC: Parallel FPC Algorithm pFPC operation Divide data stream into chunks Logically assign chunks round-robin to threads Each thread compresses its data with FPC Key parameters Chunk size & number of threads pFPC: A Parallel Compressor for Floating-Point DataMarch 2009
pFPC: A Parallel Compressor for Floating-Point DataMarch 2009 Evaluation Method Systems 3.0 GHz Xeon with 4 processors Others in paper Datasets Linear streams of real-world data (18 – 277 MB) 3 observations: error, info, spitzer 3 simulations: brain, comet, plasma 3 messages: bt, sp, sweep3d
Compression Ratio vs. Thread Count Configuration Small predictor Chunk size = 1 Compression ratio Low (FP data) Other algos worse Fluctuations Due to multi- dimensional data pFPC: A Parallel Compressor for Floating-Point DataMarch 2009
Compression Ratio vs. Chunk Size Configuration Small predictor 1 to 4 threads Compression ratio Flat for 1 thread Steep initial drop Chunk size Larger is better for history-based pred. pFPC: A Parallel Compressor for Floating-Point DataMarch 2009
Throughput on Xeon System Throughput increases with chunk size Loop overhead, false sharing, TLB performance Throughput scales with thread count Limited by load balance and memory bandwidth pFPC: A Parallel Compressor for Floating-Point DataMarch 2009 CompressionDecompression
Summary pFPC algorithm Chunks up data and logically assigns chunks in round-robin fashion to threads Reaches 10.9 and 13.6 Gb/s throughput with a compression ratio of 1.18 on a 4-core 3 GHz Xeon Portable C source code is available on-line pFPC: A Parallel Compressor for Floating-Point DataMarch 2009
Conclusions For best compression ratio, thread count should equal to or be small multiple of data’s dimension Chunk size should be one For highest throughput, chunk size should at least match system’s page size (and be page aligned) Larger chunks also yield higher compression ratios with history-based predictors Parallel scaling is limited by memory bandwidth Future work should focus on improving compression ratio without increasing the memory bandwidth pFPC: A Parallel Compressor for Floating-Point DataMarch 2009