Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science.

Slides:

Advertisements

Similar presentations

Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Advertisements

T.Sharon-A.Frank 1 Multimedia Compression Basics.

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

A Performance Analysis of the ITU-T Draft H.26L Video Coding Standard Anthony Joch, Faouzi Kossentini, Panos Nasiopoulos Packetvideo Workshop 2002 Department.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,

Rectangle Image Compression Jiří Komzák Department of Computer Science and Engineering, Czech Technical University (CTU)

Active Calibration of Cameras: Theory and Implementation Anup Basu Sung Huh CPSC 643 Individual Presentation II March 4 th,

School of Computing Science Simon Fraser University

CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.

1 Lecture 11: Digital Design Today’s topics:  Evaluating a system  Intro to boolean functions.

CS :: Fall 2003 MPEG-1 Video (Part 1) Ketan Mayer-Patel.

Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.

Chapter 1 Data Storage. 2 Chapter 1: Data Storage 1.1 Bits and Their Storage 1.2 Main Memory 1.3 Mass Storage 1.4 Representing Information as Bit Patterns.

T.Sharon-A.Frank 1 Multimedia Image Compression 2 T.Sharon-A.Frank Coding Techniques – Hybrid.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

GPGPU platforms GP - General Purpose computation using GPU

Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher.

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

Chapter 2 Source Coding (part 2)

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Floating point variables of different lengths. Trade-off: accuracy vs. memory space Recall that the computer can combine adjacent bytes in the RAM memory.

Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National.

High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering.

GFPC: A Self-Tuning Compression Algorithm Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Kasetsart University.

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

: Chapter 12: Image Compression 1 Montri Karnjanadecha ac.th/~montri Image Processing.

Extracted directly from:

9/17/15UB Fall 2015 CSE565: S. Upadhyaya Lec 6.1 CSE565: Computer Security Lecture 6 Advanced Encryption Standard Shambhu Upadhyaya Computer Science &

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Application of Data Compression to the MIL-STD-1553 Data Bus Scholar’s Day Feb. 1, 2008 By Bernard Lam.

Lecture 18: Dynamic Reconfiguration II November 12, 2004 ECE 697F Reconfigurable Computing Lecture 18 Dynamic Reconfiguration II.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University.

GPU Architecture and Programming

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

Image Compression – Fundamentals and Lossless Compression Techniques

Chapter 1 Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.

1 Lecture 10: Floating Point, Digital Design Today’s topics:  FP arithmetic  Intro to Boolean functions.

CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.

Unconventional Fixed-Radix Number Systems

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

DATA & COMPUTER SECURITY (CSNB414) MODULE 3 MODERN SYMMETRIC ENCRYPTION.

Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.

Comp 335 File Structures Data Compression. Why Study Data Compression? Conserves storage space Files can be transmitted faster because there are less.

Block-based coding Multimedia Systems and Standards S2 IF Telkom University.

Sunpyo Hong, Hyesoon Kim

Page 1KUT Graduate Course Data Compression Jun-Ki Min.

Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]

Computer Sciences Department1. 2 Data Compression and techniques.

3.3 Fundamentals of data representation

Chapter 1 Data Storage.

Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.

Quantizing Compression

Quantizing Compression

Presentation transcript:

Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science

Highlights  MPC compression algorithm  Brand-new lossless compression algorithm for single- and double-precision floating-point data  Systematically derived to work well on GPUs  MPC features  Compression ratio is similar to best CPU algorithms  Throughput is much higher  Requires little internal state (no tables or dictionaries) Synthesizing Effective Data Compression Algorithms for GPUs 2

Introduction  High-Performance Computing Systems  Depend increasingly on accelerators  Process large amounts of floating-point (FP) data  Moving this data is often the performance bottleneck  Data compression  Can increase transfer throughput  Can reduce storage requirement  But only if effective, fast (real-time), and lossless Synthesizing Effective Data Compression Algorithms for GPUs 3

Problem Statement  Existing FP compression algorithms for GPUs  Fast but compress poorly  Existing FP compression algorithms for CPUs  Compress much better but are slow  Parallel codes run serial algorithms on multiple chunks  Too much state per thread for a GPU implementation  Best serial algos may not be scalably parallelizable  Do effective FP compression algos for GPUs exist?  And if so, how can we create such an algorithm? Synthesizing Effective Data Compression Algorithms for GPUs 4

Our Approach  Need a brand-new massively-parallel algorithm  Study existing FP compression algorithms  Break them down into constituent parts  Only keep GPU-friendly parts  Generalize them as much as possible  Resulted in algorithmic components  CUDA implementation: each component takes sequence of values as input and outputs transformed sequence  Components operate on integer representation of data Synthesizing Effective Data Compression Algorithms for GPUs 5 Charles Trevelyan for

Our Approach (cont.)  Automatically synthesize compression algorithms by chaining components  Use exhaustive search to find best four-component chains  Synthesize decompressor  Employ inverse components  Perform opposite transformation on data Synthesizing Effective Data Compression Algorithms for GPUs 6

Mutator Components  Mutators computationally transform each value  Do not use information about any other value  NUL outputs the input block (identity)  INV flips all the bits  │, called cut, is a singleton pseudo component that converts a block of words into a block of bytes  Merely a type cast, i.e., no computation or data copying  Byte granularity can be better for compression Synthesizing Effective Data Compression Algorithms for GPUs 7

Shuffler Components  Shufflers reorder whole values or bits of values  Do not perform any computation  Each thread block operates on a chunk of values  BIT emits most significant bits of all values, followed by the second most significant bits, etc.  DIMn groups values by dimension n  Tested n = 2, 3, 4, 5, 8, 16, and 32  For example, DIM2 has the following effect: sequence A, B, C, D, E, F becomes A, C, E, B, D, F Synthesizing Effective Data Compression Algorithms for GPUs 8

Predictor Components  Predictors guess values based on previous values and compute residuals (true minus guessed value)  Residuals tend to cluster around zero, making them easier to compress than the original sequence  Each thread block operates on a chunk of values  LNVns subtracts n th prior value from current value  Tested n = 1, 2, 3, 5, 6, and 8  LNVnx XORs current with n th prior value  Tested n = 1, 2, 3, 5, 6, and 8 Synthesizing Effective Data Compression Algorithms for GPUs 9

Reducer Components  Reducers eliminate redundancies in value sequence  All other components cannot change length of sequence, i.e., only reducers can compress sequence  Each thread block operates on a chunk of values  ZE emits bitmap of 0s followed by non-zero values  Effective if input sequence contains many zeros  RLE performs run-length encoding, i.e., replaces repeating values by count and a single value  Effective if input contains many repeating values Synthesizing Effective Data Compression Algorithms for GPUs 10

Algorithm Synthesis  Determine best four-stage algorithms with a cut  Exhaustive search of all possible 138,240 combinations  13 double-precision data sets (19 – 277 MB)  Observational data, simulation results, MPI messages  Single-precision data derived from double-precision data  Create general GPU-friendly compression algorithm  Analyze best algorithm for each data set and precision  Find commonalities and generalize into one algorithm Synthesizing Effective Data Compression Algorithms for GPUs 11

Best of 138,240 Algorithms Synthesizing Effective Data Compression Algorithms for GPUs 12

Analysis of Reducers  Double prec results only  Single prec results similar  ZE or RLE required at end  Not counting cut; (encoder)  ZE dominates  Many 0s but not in a row  First three stages  Contain almost no reducers  Transformations are key to making reducer effective  Chaining whole compression algorithms may be futile Synthesizing Effective Data Compression Algorithms for GPUs 13

Analysis of Mutators  NUL and INV never used  No need to invert bits  Fewer stages perform worse  Cut often at end (not used)  Word granularity suffices  Easier/faster to implement  DIM8 right after cut  DIM4 with single precision  Used to separate byte positions of each word  Synthesis yielded unforeseen use of DIM component Synthesizing Effective Data Compression Algorithms for GPUs 14

Analysis of Shufflers  Shufflers are important  Almost always included  BIT used very frequently  FP bit positions correlate more strongly than values  DIM has two uses  Separate bytes (see before)  Right after cut  Separate values of multi-dim data sets (intended use)  Early stages Synthesizing Effective Data Compression Algorithms for GPUs 15

Analysis of Predictors  Predictors very important  (Data model)  Used in every case  Often 2 predictors used  LNVns dominates LNVnx  Arithmetic (sub) difference superior to bit-wise (xor) difference in residual  Dimension n  Separates values of multi- dim data sets (in 1 st stage) Synthesizing Effective Data Compression Algorithms for GPUs 16

Analysis of Overall Best Algorithm  Same algo for SP and DP  Few components mismatch  But LNV6s dim is off  Most frequent pattern  LNV*s BIT LNV1s ZE  Star denotes dimensionality  Why 6 in starred position?  Not used in individual algos  6 is least common multiple of 1, 2, and 3  Did not test n > 8 Synthesizing Effective Data Compression Algorithms for GPUs 17

MPC: Generalization of Overall Best  MPC algorithm  Massively Parallel Compression  Uses generalized pattern  “LNVds BIT LNV1s ZE” where d is data set dimensionality  Matches best algorithm on several DP and SP data sets  Performs even better when true dimensionality is used Synthesizing Effective Data Compression Algorithms for GPUs 18

Evaluation Methodology  System  Dual 10-core Xeon E v2 CPU  K40 GPU with 15 SMs (2880 cores)  13 DP and 13 SP real-world data sets  Same as before  Compression algorithms  CPU: bzip2, gzip, lzop, and pFPC  GPU: GFC and MPC (our algorithm) Synthesizing Effective Data Compression Algorithms for GPUs 19

Compression Ratio (Double Precision)  MPC delivers record compression on 5 data sets  In spite of “GPU-friendly components” constraint  MPC outperformed by bzip2 and pFPC on average  Due to msg_sppm and num_plasma  MPC superior to GFC (only other GPU compressor) Synthesizing Effective Data Compression Algorithms for GPUs 20

Compression Ratio (Single Precision)  MPC delivers record compression 8 data sets  In spite of “GPU-friendly components” constraint  MPC is outperformed by bzip2 on average  Due to num_plasma  MPC is “superior” to GFC and pFPC  They do not support single-precision data, MPC does Synthesizing Effective Data Compression Algorithms for GPUs 21

Throughput (Gigabytes per Second)  MPC outperforms all CPU compressors  Including pFPC running on two 10-core CPUs by 7.5x  MPC slower than GFC but mostly faster than PCIe  MPC uses slow O(n log n) prefix scan implementation Synthesizing Effective Data Compression Algorithms for GPUs 22

Summary  Goal of research  Create an effective algorithm for FP data compression that is suitable for massively-parallel GPUs  Approach  Extracted 24 GPU-friendly components and evaluated 138,240 combinations to find best 4-stage algorithms  Generalized findings to derive MPC algorithm  Result  Brand new compression algorithm for SP and DP data  Compresses about as well as CPU algos but much faster Synthesizing Effective Data Compression Algorithms for GPUs 23

Future Work and Acknowledgments  Future work  Faster implementation, more components, longer chains, and other inputs, data types, and constraints  Acknowledgments  National Science Foundation  NVIDIA Corporation  Texas Advanced Computing Center  Contact information  Synthesizing Effective Data Compression Algorithms for GPUs 24 Nvidia

Number of Stages  3 stages reach about 95% of compression ratio Synthesizing Effective Data Compression Algorithms for GPUs 25

Single- vs Double-Precision Algorithms Synthesizing Effective Data Compression Algorithms for GPUs 26

MPC Operation  What does “LNVds BIT LNV1s ZE” do?  LNVds predicts each value using a similar value to obtain a residual sequence with many small values  Similar value = most recent prior value from same dim  BIT groups residuals by bit position  All LSBs, then all second LSBs, etc.  LNV1s turns identical consecutive words into zeros  ZE eliminates these zero words  GPU friendly  All four components are massively parallel  Can be implemented with prefix scans or simpler Synthesizing Effective Data Compression Algorithms for GPUs 27