PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University.

Slides:



Advertisements
Similar presentations
1 RAID Overview n Computing speeds double every 3 years n Disk speeds cant keep up n Data needs higher MTBF than any component in system n IO.
Advertisements

Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.
Parallelism Lecture notes from MKP and S. Yalamanchili.
A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.
Tradeoffs in Scalable Data Routing for Deduplication Clusters FAST '11 Wei Dong From Princeton University Fred Douglis, Kai Li, Hugo Patterson, Sazzala.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
Locality-Aware Request Distribution in Cluster-based Network Servers 1. Introduction and Motivation --- Why have this idea? 2. Strategies --- How to implement?
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
Compressibility of WML and WMLScript byte code:Initial results Eetu Ojanen and Jari Veijalainen Department of Computer Science and Information Systems.
GCSE Computing - The CPU
Algorithms in a Multiprocessor Environment Kevin Frandsen ENCM 515.
Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.
Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.
JSZap: Compressing JavaScript Code Martin Burtscher, UT Austin Ben Livshits & Ben Zorn, Microsoft Research Gaurav Sinha, IIT Kanpur.
VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher.
18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Tree-Based Density Clustering using Graphics Processors
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National.
High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering.
GFPC: A Self-Tuning Compression Algorithm Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Kasetsart University.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.
1 Enabling Large Scale Network Simulation with 100 Million Nodes using Grid Infrastructure Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
Artdaq Introduction artdaq is a toolkit for creating the event building and filtering portions of a DAQ. A set of ready-to-use components along with hooks.
Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1.
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Harnessing Multicore Processors for High Speed Secure Transfer Raj Kettimuthu Argonne National Laboratory.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
High Speed Detectors at Diamond Nick Rees. A few words about HDF5 PSI and Dectris held a workshop in May 2012 which identified issues with HDF5: –HDF5.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
Sunpyo Hong, Hyesoon Kim
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
BUFFALO: Bloom Filter Forwarding Architecture for Large Organizations Minlan Yu Princeton University Joint work with Alex Fabrikant,
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
A Case for Toggle-Aware Compression for GPU Systems
GCSE Computing - The CPU
Ioannis E. Venetis Department of Computer Engineering and Informatics
Comparative Analysis of Parallel OPIR Compression on Space Processors
Kalyan Boggavarapu Lehigh University
Communication and Memory Efficient Parallel Decision Tree Construction
Hybrid Programming with OpenMP and MPI
Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang
GCSE Computing - The CPU
Presentation transcript:

pFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

pFPC: A Parallel Compressor for Floating-Point DataMarch 2009 Introduction  Scientific programs  Often produce and transfer lots of floating-point data (e.g., program output, checkpoints, messages)  Large amounts of data  Are expensive and slow to transfer and store  FPC algorithm for IEEE 754 double-precision data  Compresses linear streams of FP values fast and well  Single-pass operation and lossless compression

Introduction (cont.)  Large-scale high-performance computers  Consist of many networked compute nodes  Compute nodes have multiple CPUs but only one link  Want to speed up data transfer  Need real-time compression to match link throughput  pFPC: a parallel version of the FPC algorithm  Exceeds 10 Gb/s on four Xeon processors pFPC: A Parallel Compressor for Floating-Point DataMarch 2009

pFPC: A Parallel Compressor for Floating-Point DataMarch 2009 Sequential FPC Algorithm [DCC’07]  Make two predictions  Select closer value  XOR with true value  Count leading zero bytes  Encode value  Update predictors

pFPC: Parallel FPC Algorithm  pFPC operation  Divide data stream into chunks  Logically assign chunks round-robin to threads  Each thread compresses its data with FPC  Key parameters  Chunk size & number of threads pFPC: A Parallel Compressor for Floating-Point DataMarch 2009

pFPC: A Parallel Compressor for Floating-Point DataMarch 2009 Evaluation Method  Systems  3.0 GHz Xeon with 4 processors  Others in paper  Datasets  Linear streams of real-world data (18 – 277 MB)  3 observations: error, info, spitzer  3 simulations: brain, comet, plasma  3 messages: bt, sp, sweep3d

Compression Ratio vs. Thread Count  Configuration  Small predictor  Chunk size = 1  Compression ratio  Low (FP data)  Other algos worse  Fluctuations  Due to multi- dimensional data pFPC: A Parallel Compressor for Floating-Point DataMarch 2009

Compression Ratio vs. Chunk Size  Configuration  Small predictor  1 to 4 threads  Compression ratio  Flat for 1 thread  Steep initial drop  Chunk size  Larger is better for history-based pred. pFPC: A Parallel Compressor for Floating-Point DataMarch 2009

Throughput on Xeon System  Throughput increases with chunk size  Loop overhead, false sharing, TLB performance  Throughput scales with thread count  Limited by load balance and memory bandwidth pFPC: A Parallel Compressor for Floating-Point DataMarch 2009 CompressionDecompression

Summary  pFPC algorithm  Chunks up data and logically assigns chunks in round-robin fashion to threads  Reaches 10.9 and 13.6 Gb/s throughput with a compression ratio of 1.18 on a 4-core 3 GHz Xeon  Portable C source code is available on-line pFPC: A Parallel Compressor for Floating-Point DataMarch 2009

Conclusions  For best compression ratio, thread count should equal to or be small multiple of data’s dimension  Chunk size should be one  For highest throughput, chunk size should at least match system’s page size (and be page aligned)  Larger chunks also yield higher compression ratios with history-based predictors  Parallel scaling is limited by memory bandwidth  Future work should focus on improving compression ratio without increasing the memory bandwidth pFPC: A Parallel Compressor for Floating-Point DataMarch 2009