Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science

Introduction  Scientific simulations on HPC clusters  Run on interconnected compute nodes  Produce and transfer lots of floating-point data  Data storage and transfer are expensive and slow  Compute nodes have multiple cores but only one link  Interconnects are getting faster  Lonestar: 40 Gb/s InfiniBand  Speeds of up to 100 Gb/s soon Floating-Point Data Compression at 75 Gb/s on a GPU Texas Advanced Computing Center March 2011

Introduction (cont.)  Compression  Reduced storage, faster transfer  Only useful when done in real time  Saturate network with compressed data  Requires compressor tailored to hardware capabilities  GFC algorithm for IEEE 754 double-precision data  Designed specifically for GPU hardware (CUDA)  Provides reasonable compression ratio and operates above throughput of emerging networks Floating-Point Data Compression at 75 Gb/s on a GPU Charles Trevelyan for http://plus.maths.org/ March 2011

Lossless Data Compression  Dictionary-based (Lempel-Ziv family) [gzip, lzop]  Variable-length entropy coders (Huffman, AC)  Run-length encoding [fax]  Transforms (Burrows-Wheeler) [bzip2]  Special-purpose FP compressors [FPC, FSD, PLMI]  Prediction and leading-zero suppression  None of these offer real-time speeds for state-of-the-art networks Floating-Point Data Compression at 75 Gb/s on a GPUMarch 2011

GFC Algorithm  GPUs require 1000s of parallel activities, but… compression is a generally serial operation Floating-Point Data Compression at 75 Gb/s on a GPU  Divide data into n chunks, processed in parallel  Best perf: choose n to match max number of resident warps  Each chunk composed of 32-word subchunks  One double per warp thread  Use previous subchunk to provide prediction values March 2011

Dimensionality  Many scientific data sets display dimensionality  Interleaved coordinates from multiple dimensions  Optional dimensionality parameter to GFC  Determines index of previous subchunk to use as the prediction Floating-Point Data Compression at 75 Gb/s on a GPUMarch 2011

GFC Algorithm (cont.) Floating-Point Data Compression at 75 Gb/s on a GPUMarch 2011

GPU Optimizations  Low thread divergence (few if statements)  Some short enough to be predicated  Coalesce memory accesses by packing/unpacking data in shared memory (for CC < 2.0)  Very little inter-thread communication and synchronization  Prefix sum only  Warp-based implementation Floating-Point Data Compression at 75 Gb/s on a GPU gamedsforum.ca March 2011

Evaluation Method  Systems  Two quad-core 2.53 GHz Xeons  NVIDIA FX 5800 GPU (CC 1.3)  13 datasets: real-world data (19 – 277 MB)  Observational data, simulation results, MPI messages  Comparisons  Compression ratio vs. 5 compressors in common use  Throughput vs. pFPC (fastest known CPU compressor) Floating-Point Data Compression at 75 Gb/s on a GPUMarch 2011

Compression Ratio  1.188 (range: 1.01 – 3.53)  Low (FP data), but in line with other algos  Largely independent of number of chunks  When done in real- time, compression at this ratio can greatly speed up MPI apps  3% – 98% speed-up [Ke et al., SC’04] Floating-Point Data Compression at 75 Gb/s on a GPUMarch 2011

Throughput  C: 75 – 87 Gb/s  Mean: 77.9 Gb/s  D: 90 – 121 Gb/s  Mean: 96.6 Gb/s  4x faster than pFPC on 8 cores (2 CPUs)  Improvement over pFPC’s compression ratio vs. performance trend Floating-Point Data Compression at 75 Gb/s on a GPUMarch 2011

NEW: Fermi Throughput  Fermi improvements:  Faster, simpler memory accesses  Hardware support for count- leading-zeros op  Compression ratio: 1.187  C: 119 – 219 (HM: 167.5 Gb/s)  D: 169 – 219 (HM: 180.3 Gb/s)  Compresses over 9.5x faster than pFPC on 8 x86 cores Floating-Point Data Compression at 75 Gb/s on a GPUMarch 2011

Summary  GFC algorithm  Chunks up data, each warp processes a chunk iteratively by 32-word subchunks  No communication required between warps  Minimum 75 Gb/s – 90 Gb/s (encode-decode) throughput on GTX-285, and 119 Gb/s – 169 Gb/s on Fermi, with a compression ratio of 1.19  CUDA source code is freely available at http://www.cs.txstate.edu/~burtscher/research/GFC/ Floating-Point Data Compression at 75 Gb/s on a GPUMarch 2011

Conclusions  GPU can compress much faster than PCIe bus can transfer the data  But…  PCIe bus will become faster  CPU-GPU increasingly on single die  GPU-to-GPU, GPU-to-NIC transfers coming?  GFC is the first compressor with the potential to deliver real-time FP data compression for current and emerging network speeds Floating-Point Data Compression at 75 Gb/s on a GPU AMD NVIDIA March 2011

Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Similar presentations

Presentation on theme: "Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Similar presentations

Presentation on theme: "Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback