Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

1 Outline  Introduction to JEPG2000  Why another image compression technique  Features  Discrete Wavelet Transform  Wavelet transform  Wavelet implementation.

Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.

Programming Multiprocessors with Explicitly Managed Memory Hierarchies ELEC 6200 Xin Jin 4/30/2010.

Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim.

Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Embedded Computer Architecture 5KK73 MPSoC Platforms Part2: Cell Bart Mesman and Henk Corporaal.

Still Image Conpression JPEG & JPEG2000 Yu-Wei Chang /18.

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.

Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2010.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.

A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

High Performance Computing on an IBM Cell Processor --- Bioinformatics

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Vector Processing => Multimedia

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

Implementation of DWT using SSE Instruction Set

Coe818 Advanced Computer Architecture

Presentation transcript:

Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Key Contributions We design an efficient data decomposition scheme to achieve high performance with affordable programming complexity We introduce multiple Cell/B.E. and DWT specific optimization issues and solutions Our implementation achieves 34 and 56 times speedup over one PPE performance, and 4.7 and 3.7 times speedup over the cutting edge multicore processor (AMD Barcelona), for lossless and lossy DWT, respectively.

Presentation Outline Discrete Wavelet Transform Cell Broadband Engine architecture -Comparison with the traditional multicore processor -Impact in performance and programmability Optimization Strategies -Previous work -Data decomposition scheme -Real number representation -Loop interleaving -Fine-grain data transfer control Performance Evaluation -Comparison with the AMD Barcelona Conclusions

Presentation Outline Discrete Wavelet Transform Cell Broadband Engine architecture -Comparison with the traditional multicore processor -Impact in performance and programmability Optimization Strategies -Previous work -Data decomposition scheme -Real number representation -Loop interleaving -Fine-grain data transfer control Performance Evaluation -Comparison with the AMD Barcelona Conclusions

Discrete Wavelet Transform (in JPEG2000) Decompose an image in both vertical and horizontal direction to the sub-bands representing the coarse and detail part while preserving space information LL LH HL HH

Discrete Wavelet Transform (in JPEG2000) Vertical filtering followed by horizontal filtering Highly parallel but bandwidth intensive Distinct memory access pattern becomes a problem Adopt Jasper [Adams2005] as a baseline code

Presentation Outline Discrete Wavelet Transform Cell Broadband Engine architecture -Comparison with the traditional multicore processor -Impact in performance and programmability Optimization Strategies -Previous work -Data decomposition scheme -Real number representation -Loop interleaving -Fine-grain data transfer control Performance Evaluation -Comparison with the AMD Barcelona Conclusions

Cell/B.E. vs Traditional Multi-core Processor In-order No dynamic branch prediction SIMD only => Small and simple core SPE Traditional Multi-core Processor Out-of-order Dynamic branch prediction Scalar + SIMD => Large and complex core

Cell/B.E. vs Traditional Multi-core Processor Exe. Pipeline LS Main Memory L2 I1D1 L3 Isolated constant latency LS access Software controlled DMA data transfer between LS and main memory Every memory access is cache coherent Hardware controlled data transfer Exe. Pipeline

Cell/B.E. Architecture - Performance More cores within power and transistor budget Invest the larger fraction of the die area for actual computation Highly scalable memory architecture Enable fine-grain data transfer control Efficient vectorization is even more important (No scalar unit)

Cell/B.E. Architecture - Programmability Software (mostly programmer up to date) controlled data transfer Limited LS size Manual vectorization Manual branch hint, loop unrolling, etc. Efficient DMA data transfer requires cache line alignment and transfer size needs to be a multiple of cache line size. Vectorization (SIMD) requires 16 byte alignment and vector size needs to be 16 byte. => Challenging to deal with misaligned data !!!

Cell/B.E. Architecture - Programmability for( i = 0 ; i < n ; i++ ) { a[i] = b[i] + c[i] } v_a = ( vector int* )a; v_b = ( vector int* )b; v_c = ( vector int* )c; for( i = 0 ; i < n_c / 4 ; i++ ) { v_a[i] = v_add( v_b[i], v_c[i] ) } //n_c: a constant multiple of 4 n_head = ( 16 – ( ( unsigned int )a % 16 ) / 4; n_head = n_head % 4; n_body = ( n – n_head ) / 4; n_tail = ( n – n_head ) % 4; for( i = 0 ; i < n_head ; i++ ) { a[i] = b[i] + c[i]; } v_a = ( vector int* )( a + n_head ); v_b = ( vector int* )( b + n_head ); v_c = ( vector int* )( c + n_head ); for( i = 0 ; i < n_body ; i++ ) { v_a[i] = v_add( v_b[i], v_c[i] ) } a = ( int* )( v_a + n_body ); b = ( int* )( v_b + n_body ); c = ( int* )( v_c + n_body ); for( i = 0 ; i < n_tail ; i++ ) { a[i] = b[i] + c[i]; } =>Even more complex if a, b, and c are misaligned!!! Satisfies alignment and size requirements No guarantee in alignment and size Head Body Tail

Presentation Outline Discrete Wavelet Transform Cell Broadband Engine architecture -Comparison with the traditional multicore processor -Impact in performance and programmability Optimization Strategies -Previous work -Data Decomposition Scheme -Real Number Representation -Loop Interleaving -Fine-grain Data Transfer Control Performance Evaluation -Comparison with the AMD Barcelona Conclusions

Previous work Column grouping [Chaver2002] to enhance cache behavior in vertical filtering Muta et al. [Muta2007] optimized convolution based (require up to 2 times more operations than lifting based approach) DWT for Cell/B.E. -High single SPE performance -Does not scale above 1 SPE

Data Decomposition Scheme 2-D array width Row padding A multiple of the cache line size Remainder Distributed to the SPEs Processed by the PPE 2-D array height A multiple of the cache line size A unit of data transfer and computation A unit of data distribution to the processing elements Cache line aligned

Data Decomposition Scheme Satisfies the alignment and size requirements for efficient DMA data transfer and vectorization. Fixed LS space requirements regardless of an input image size Constant loop count A unit of data transfer and computation constant width

Vectorization – Real number representation Jasper adopts fixed point representation -Replace floating point arithmetic with fixed point arithmetic -Not a good choice for Cell/B.E. Inst.Latency (SPE) mpyh7 cycles mpyu7 cycles a2 cycles fm6 cycles mpyh $5, $3, $4 mpyh $2, $4, $3 mpyu $4, $3, $4 a $3, $5, $2 a $3, $3, $4 fm $3, $3, $4

Loop Interleaving In a naïve approach, a single vertical filtering involves 3 or 6 times data transfer Bandwidth becomes a bottleneck Interleave splitting, lifting, and optional scaling steps Does not fit into the LS

Loop Interleaving low0 high0 low1 high1 low2 high2 low3 high3 low0 low1 low2 low3 high0 high1 high2 high3 low0* low1* low2* low3* high0* high1* high2* high3* Interleaved Lifting Splitting First interleave multiple lifting steps Then, merge splitting step with the interleaved lifting step Overwritten before read Use temporary main memory buffer for the upper half

Fine–grain Data Transfer Control low0 high0 low1 high1 low2 high2 low3 high3 low0 low1 low2 low3 high0 high1 high2 high3 low0* low1* low2* low3* high0* high1* high2* high3* Interleaved Lifting Splitting Initially, we copy data from the buffer after the interleaved loop is finished Yet, we can start it just after low2 and high2 are read Cell/B.E.’s software controlled DMA data transfer enables this

Presentation Outline Discrete Wavelet Transform Cell Broadband Engine architecture -Comparison with the traditional multicore processor -Impact in performance and programmability Optimization Strategies -Previous work -Data decomposition scheme -Real number representation -Loop interleaving -Fine-grain data transfer control Performance Evaluation -Comparison with the AMD Barcelona Conclusions

Performance Evaluation * 3800 X 2600 color image, 5 resolution levels * Execution time and scalability up to 2 Cell/B.E. chips (IBM QS20)

Performance Evaluation – Comparison with x86 Architecture ParallelizationOpenMP based parallelization VectorizationAuto-vectorization with compiler directives Real Number Representation Identical to the Cell/B.E. case Loop Interleaving Identical to the Cell/B.E. case Run-time profile feedback Compile with runtime profile feedback -One 3.2 GHz Cell/B.E. chip (IBM QS20) -One 2.0 GHz AMD Barcelona chip (AMD Quad-core Opteron 8350) * Optimization for the Barcelona

Presentation Outline Discrete Wavelet Transform Cell Broadband Engine architecture -Comparison with the traditional multicore processor -Impact in performance and programmability Optimization Strategies -Previous work -Data decomposition scheme -Real number representation -Loop interleaving -Fine-grain data transfer control Performance Evaluation -Comparison with the AMD Barcelona Conclusions

Cell/B.E. has a great potential to speed-up parallel workloads but requires judicious implementation We design an efficient data decomposition scheme to achieve high performance with affordable programming complexity Our implementation demonstrates 34 and 56 times speedup over one PPE, and 4.7 and 3.7 times speedup over the AMD Barcelona processor with one Cell/B.E. chip Cell/B.E. can also be used as an accelerator in combination with the traditional microprocessor

Acknowledgment of Support David A. Bader 26

References [1] M.D. Adams. The JPEG-2000 Still Image Compression Standard, Dec [2] D. Chaver, M. Prieto, L. Pinuel, and F. Tirado. Parallel wavelet transform for large scale image processing, Int’l Parallel and Distributed Processing Symp., Apr [3] H. Muta, M. Doi, H. Nakano, and Y. Mori. Multilevel parallelization on the Cell/B.E. for a Motion JPEG 2000 encoding server, ACM Multimedia Conf., Sep