Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Memory.
Multipattern String Matching On A GPU Author: Xinyan Zha, Sartaj Sahni Publisher: 16th IEEE Symposium on Computers and Communications Presenter: Ye-Zhi.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.
1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany.
Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Challenge the future Delft University of Technology Evaluating Multi-Core Processors for Data-Intensive Kernels Alexander van Amesfoort Delft.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Binary Image Compression via Monochromatic Pattern Substitution: A Sequential Speed-Up Luigi Cinque and Sergio De Agostino Computer Science Department.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Threads. Readings r Silberschatz et al : Chapter 4.
Sunpyo Hong, Hyesoon Kim
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Computer Engg, IIT(BHU)
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Employing compression solutions under openacc
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
6- General Purpose GPU Programming
Presentation transcript:

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011

OUTLINE Introduction Background Implementation Details Performance Analysis Discussion and Limitations Related Work Future Work Conclusion Q & A

INTRODUCTION Utilization of expensive resources usage Memory Bandwidth CPU usage Data compression Trade off in increasing running time GPUs CULZSS General purpose data compression Reduces the effect of increasing time compared to CPU Keeping the compression ratio

BACKGROUND LZSS Algorithm A variant of LZ77 Dictionary encoding Sliding window search buffer Uncoded lookahead buffer Insert a diagram here shows the flow

BACKGROUND LZSS Algorithm Example:

BACKGROUND GPU architecture Fermi architecture Up to 512 CUDA cores 16 SM of 32 cores each Hierarchy of memory types Global, constant, texture are accessible by all threads Shared memory only by threads in same block Registers and local memory only by the thread it is owned

IMPLEMENTATION DETAILS LZSS CPU implementation CUDA implementation have two versions API gpu_compress( *buffer, buf_length, **compressed_buffer, &comp_length, compression_parameters); gpu_decompress(*buffer, buf_length, **decompressed_buffer, &decomp_length, compression_parameters); Version 1 Version 2 Decompression Optimizations

IMPLEMENTATION DETAILS LZSS CPU implementation Straight forward implementation of the algorithm Mainly adapted from Dipperstein’s work To be fair to CPU – a thread version is implemented with POSIX threads. Each thread is given some chunk of the file and chunks compressed concurrently, and reassembled.

IMPLEMENTATION DETAILS CUDA implementation Version 1 Similar to Pthread implementation Chunks of file is distributed among blocks and in each block smaller chunks are assigned to threads

IMPLEMENTATION DETAILS CUDA implementation Version 2 Exploit the algorithm’s SIMD nature The work distributed among threads in the same block is the matching phase of the compression for a single chunk

IMPLEMENTATION DETAILS Version 2 Matching phase CPU steps

IMPLEMENTATION DETAILS Decompression Identical in both versions Each character is read, decoded and written to output Independent behavior exist among blocks A list of compressed block sizes needs to be kept The length of the list depends on the number of blocks Negligible impact on compression ratio

IMPLEMENTATION DETAILS Optimizations Coalesced access to global memory Access fit into a block, done by just one memory transaction Access to global memory is needed before and after matching stage. In Version 1, each thread reads/writes buffer size of data In Version 2, each thread reads/writes 1 byte of memory For that reason test results give the best performance with 128 bytes of buffer size and 128 threads per block which leads one memory transaction in Fermi architectures

IMPLEMENTATION DETAILS Optimizations (cont.) Shared memory usage Shared memory is divided into banks, and each bank can only address one dataset request at a time, if all requests from different banks, all satisfied in parallel Using shared memory for sliding window in Version 1 gave a %30 speed up Access pattern on Version 2 is suitable for bank conflict free access where one thread can access one bank Speed up is shown on analysis

IMPLEMENTATION DETAILS Optimizations (cont.) Configuration parameters Thread per block V1 has limitations on shared memory Buffer size V2 consumes double buffer size than a single which limits the encoding offsets into 16 bit space Buffer sizes 128 bytes 128 threads per block

PERFORMANCE ANALYSIS Testbed Configurations GeForce GTX 480 CUDA version 3.2 Intel® Core ™ i7 CPU 920 at 2.67 GHZ

PERFORMANCE ANALYSIS Datasets Five sets of data Collection of C files – text based input Delaware State Digital Raster Graphics and Digital Line Graphs Server – basemaps for georeferencing and visual analysis English dictionary –alphabetical ordered text Linux kernel tarball Highly compressible custom data set – contains repeating characters in substrings of 20

PERFORMANCE ANALYSIS

DISCUSSION AND LIMITATIONS Limitations with shared memory Version 1 needed more shared memory for threads more than 128 LZSS is not inherently parallel Some parts left to CPU Opportunity for overlap and upto double performance Main goal being achieved by better performance than CPU based implementation BZIP2 is used to compare for a well known application

DISCUSSION AND LIMITATIONS Version 1 gives better performance on highly compressible data Can skip matched data Version 2 is better on the other data sets Cannot skip matched data Better utilization of GPU (coalescing accesses and avoiding bank conflicts) leads better performance Two version and the option to choose the version in API gives users the ability to use best matching implementation

RELATED WORK CPU based Parallelizing with threads GPU based Lossy data compression has been exposed to GPU community Image/Video Processing, Texture Compressor, … etc Lossless data compression O’Neil et al. GFC – specialized for double precision floating point data compression Balevic proposes a data parallel algorithm for length encoding

FUTURE WORK More detailed tuning configuration API Combined CPU and GPU heterogeneous implementation to benefit for future proof architectures that have both of them on the same die; AMD Fusion, Intel Nehalem, … Multi GPU usage Overlapping computations with CPU and GPU in a pipelining fashion

CONCLUSION CULZSS shows a promising usage of GPUs for lossless data compression To our best knowledge first successful improvement attempt of lossless data compression for any kind of general data. Outperformed the serial LZSS by up to 18x, Pthread LZSS by up to 3x in compression time Compression ratios kept very similar with CPU version

Q & A