Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
New Algorithms for SIMD Alignment Liza Fireman - Technion Ayal Zaks – IBM Haifa Research Lab Erez Petrank – Microsoft Research & Technion.
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
OpenSSL acceleration using Graphics Processing Units
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Codeplay CEO © Copyright 2012 Codeplay Software Ltd 45 York Place Edinburgh EH1 3HP United Kingdom Visit us at The unique challenges of.
09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
GPU Architecture and Programming
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
CUDA - 2.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
My Coordinates Office EM G.27 contact time:
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
The Present and Future of Parallelism on GPUs
Code Optimization.
CS427 Multicore Architecture and Parallel Computing
5.2 Eleven Advanced Optimizations of Cache Performance
Vector Processing => Multimedia
Chapter 12 Pipelining and RISC
6- General Purpose GPU Programming
CUDA Fortran Programming with the IBM XL Fortran Compiler
Presentation transcript:

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University

2 Outline CGiS  Language, compiler and GPU back-end SIMD back-end  Hardware  Challenges  Transformations and optimizations Experimental results Future Work Conclusion

3 CGiS C-like data-parallel programming language Goals:  Exploitation of parallel processing units in common PCs (GPU, SIMD units)  Easy access for inexperienced programmers  High abstraction level 32-bit scalar and small vector data types Two forms of explicit parallelism  SPMP (iteration), SIMD (vector types)

4 CGiS Example: YUV to RGB PROGRAM yuv_to_rgb; INTERFACE extern in float3 YUV ; extern out float3 RGB ; CODE procedure yuv2rgb (in float3 yuv, out float3 rgb) { rgb = yuv.x + [0, 0.344, 1.77 ] * yuv.y + [1.403, 0.714, 0] * yuv.z; } CONTROL forall (yuv in YUV, rgb in RGB) { yuv2rgb (yuv, rgb); }

5 CGiS Compiler Overview CGiS Source CGiS Compiler CGiS Runtime Application PPU Code Interface

6 CGiS for GPUs nVidia G80:  128 floating points units  Scalar and vector data processible 2-on-2 mapping of CGiS‘ parallelism Code generation for various GPU generations  NV30, NV40, G80, CUDA  Limited access to hardware features through the driver

7 SIMD Hardware Every common PC features SIMD units  Intel‘s SSE and Freescale‘s AltiVec SIMD parallelism not easily accessible for standard compilers  Well-known vectorization problems Data access  Hardware requires 16-byte aligned loads  Slow but cached Only 4-way SIMD vector parallelism usable

8 The SIMD Back-end Goal is mapping of CGiS parallelisms to SIMD hardware  “2-on-1” mapping SIMD vectorization problems  Avoided by design: data dependency analyses  Control flow Divergence in consecutive elements  Misalignment and data layout Reordering might be needed  Gathering operations are bottle-necks in load- heavy algorithms on multidimensional streams

9 Transformations and Optimizations Control flow conversion  If/loop conversion Loop sectioning for 2D streams  Increase cache performance for gather accesses Kernel flattening  IR transformation that replaces compound variables and operations by scalar ones  “2-on-1”

10 Control Flow Conversion Full inlining If/loop converison with slightly modified Allen- Kennedy algorithm  No guarded assignments  Masks for select operations are the results of vector compares  Live and written variables after a control flow join are copied at the branching  Select operations are inserted at the join

11 Loop Sectioning Adaptation of iteration sequence to better exploit cached data  Only interesting for 2D streams  Iterations subdivided in stripes  Width depends on access pattern, cache size and local variables

12 Kernel Flattening SIMD vectorization for yuv2rgb not applicable Thus “flatten” the procedure or kernel:  Code transformation on the IR  All variables and all statements are split into scalar ones  Those can be subjected to SIMD vectorization procedure yuv2rgb (in float3 yuv, out float3 rgb) { rgb = yuv.x + [0, 0.344, 1.77 ] * yuv.y + [1.403, 0.714, 0] * yuv.z; }

13 Kernel Flattening Example procedure yuv2rgb_f (in float yuv_x, in float yuv_y, in float yuv_z, out float rgb_x, out float rgb_y, out float rgb_z) { float cy = 0.344, cz = 1.77, dx = 1.403, dy = 0.714; rgb_x = yuv_x + + dx * yuv.z; rgb_y = yuv_x + cy * yuv.y + dy * yuv.z; rgb_z = yuv_x + cz * yuv.y; } Procedure yuv2rgb_f now features data types suitable to be SIMD-parellelized

14 Kernel Flattening But: data layout doesn’t fit  No stride-one access for single components  Reordering of data required Locally via permutes or shuffles Globally via memory copy

15 Kernel Flattening Data Reorderig

16 Global vs. Local Reordering Global reordering  Reusable for further iterations  Simple, but expensive in-memory copy  Destroys locality for gather accesses Local reordering  Original stream data untouched  Insertion of possibly many relatively cheap in-register permutation operations  Locality for gathering preserved

17 Experimental Results Tested on Intel Core 2 Duo 1.83GHz and PowerPC G5 1.8GHz  Compiled with intrinsics on gcc Examples  Image processing: Gaussian blur Loop sectioning  Computation of mandelbrot set Control flow conversion  Block cipher encryption: rc5 encryption Kernel flattening

18 Experimental Results

19 Future Work Replace intrinsics by inline-assembly  Improvement of conditionals  Better control over register allocation Improvement of register re-utilization for AltiVec  Raises with inline-assembly Cell back-end  SIMD instruction set close to AltiVec  Work list algorithm to distribute stream parts to single PEs More applications

20 Conclusion CGiS abstracts GPUs as well as SIMD units SIMD back-end of the CGiS compiler produces efficient code  Other transformations and optimizations needed than for the GPU backend  Full control flow conversion needed  Gather accesses gain speed with loop sectioning  Kernel flattening enables better exploitation