A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology.
1 Chapter 04 Authors: John Hennessy & David Patterson.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
GPU Architecture and Programming
1 Latest Generations of Multi Core Processors
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
EKT303/4 Superscalar vs Super-pipelined.
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
CS203 – Advanced Computer Architecture Performance Evaluation.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Computer Engg, IIT(BHU)
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Single Instruction Multiple Data
SIMD Multimedia Extensions
Parallel Processing - introduction
Enabling machine learning in embedded systems
Morgan Kaufmann Publishers
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
Graphics Processing Unit
Multicore and GPU Programming
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek Rajski, Nels Oscar, David Burri, Alex Diede

Introduction We have seen how to improve performance through exploitation of: Instruction-level parallelism Thread-level parallelism One other exploitation we have not discussed is Data-level parallelism.

Introduction Flynn's Taxonomy An organization of computer architectures based on their instruction and data streams Divides all architectures into 4 categories: 1.SISD 2.SIMD 3.MISD 4.MIMD

Introduction Implementations of SIMD Prevalent in GPUs SIMD extensions in CPU Embedded systems and Mobile Platforms

Introduction Software for SIMD Many libraries utilize and encapsulate SIMD Adopted in these areas o Graphics o Signal Processing o Video Encoding/Decoding o Some scientific applications

Introduction SIMD Implementations fall into three high- level categories: 1.Vector Processors 2.Multimedia Extensions 3.Graphics Processors

Introduction Going forward: Streaming SIMD Extensions(MMX/SSE/AVX) o Similar technology in GPUs Compiler techniques for DLP Problems in the world of SIMD Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for x86 computers. This figure assumes that two cores per chip for MIMD will be added every two years and the number of operations for SIMD will double every four years. Copyright © 2011, Elsevier Inc.

SIMD in Hardware Register Size/Hardware changes Intel Core i7 example The ‘Roofline’ model Limitations of streaming extensions in a CPU

SIMD in Hardware Streaming SIMD requires some basic components o Wide Registers  Rather than 32bits, have 64, 128, or 256 bit wide registers. o Additional control lines o Additional ALU's to handle the simultaneous operation on up to operand sizes of 16-bytes

Hardware Figure 4.4 Using multiple functional units to improve the performance of a single vector add instruction, C = A + B. The vector processor (a) on the left has a single add pipeline and can complete one addition per cycle. The vector processor (b) on the right has four add pipelines and can complete four additions per cycle. The elements within a single vector add instruction are interleaved across the four pipelines. The set of elements that move through the pipelines together is termed an element group. s

Intel i7 The Intel i7 Core o Superscalar processor o Contains several SIMD extensions  16x256-bit wide registers, and physical registers on pipeline.  Support for 2 and 3 operand instructions

The Roofline Model of Performance The Roofline model of performance aggregates floating-point performance, operational intensity memory

The Roofline Model of Performance Opteron X2

The Roofline Model of Performance Opteron X2

The Roofline Model of Performance Opteron X2

Limitations Memory Latency Memory Bandwidth The actual amount of vectorizable code

SIMD at the software level SIMD is not a new field. But more focus has been brought to it by the GPGPU movement.

SIMD at the software level CUDA Developed by Nvidia Compute Unified Device Architecture Closed to GPUs with chips from Nvidia Graphics cards G8x and newer Provides both high and low level API

SIMD at the software level OpenCL Developed by Apple Open to any vendor that decide to support it Designed to execute across GPUs and CPUs Graphics cards G8x and newer Provides both high and low level API

SIMD at the software level Direct Compute Developed by Microsoft Open to any vendor that supports DirectX11 Windows only Graphics cards GTX400 and HD5000 Intel’s Ivy Bridge will also be supported

Compiler Optimization Not everyone programs in SIMD based languages. But C, Java were never designed with SIMD in mind. Compiler technology had to improve to catch code with vectorizable instructions.

Compiler Optimization Before optimization can begin Data dependencies have to be understood But only within the vector window size matter Vector window size - The size of data executed in parallel with the SIMD instruction

Compiler Optimization Before optimization can begin Example: for( int i = 0; i < 16; i++){ C[i] = c[i+1]; C[i] = c[i+16]; } for( int i = 0; i < 16; 4++){ C[i] = c[i+1]; C[i+1] = c[i+2]; (Wrong) C[i+2] = c[i+3]; (Wrong) C[i+3] = c[i+4]; (Wrong) C[i] = c[i+16]; C[i+1] = c[i+17]; C[i+2] = c[i+18]; C[i+3] = c[i+20]; }

Compiler Optimization Framework for vectorization o Prelude o Loop o Postlude o Cleanup

Compiler Optimization Framework for vectorization Prelude Loop independent variables are prepared for use. Run time checks that vectorization is possible Loop Vectorizable instructions are performed in order with original code. Loop could be split into multiple loops. Vectorizable sections could be split by more complex code in original loop.

Compiler Optimization Framework for vectorization o Postlude  All loop independent variables are returned. o Cleanup  Non vectorizable iterations of the loop are run.  These include the remainder of vectorizable instructions that do not fit evenly into the vector size.

Compiler Optimization Compiler techniques Loop Level Automatic Vectorization Basic Block Level Automatic Vectorization In the presence of control flow

Compiler Optimization Loop Level Automatic Vectorization 1. Find innermost loop that can be vectorized. 2. Transform loop and create vector instructions. Original Code for (i = 0; i < 1024; i+=1) C[i] = A[i]*B[i]; Vectorized Code for( i=0; i<1024; i+=4){ vA = vec_ld( A[i] ); vB = vec_ld( B[i] ); vC = vec_mul( vA, vB); vec_st( vC, C[i] ); }

Compiler Optimization Basic Block Level Automatic Vectorization 1. The inner most loop is unrolled by the size of the vector window. 2. Isomorphic scalar instructions are packed into vector instruction. Original Code for (i = 0; i < 1024; i+=1) C[i] = A[i]*B[i]; Vectorized Code for (i = 0; i < 1024; i+=4) C[i] = A[i]*B[i]; C[i+1] = A[i+1]*B[i+1]; C[i+2] = A[i+2]*B[i+2]; C[i+3] = A[i+3]*B[i+3];

Compiler Optimization In the presence of control flow 1. Apply predication 2. Apply method from above 3. Remove vector predication 4. Remove scalar predication Original Code for (i = 0; i < 1024; i+=1){ if (A[i] > 0) C[i] = B[i]; else D[i] = D[i-1]; } After Predication for (i = 0; i < 1024; i+=1){ P = A[i] > 0; NP = !P; C[i] = B[i]; (P) D[i] = D[i-1]; (NP) }

Compiler Optimization In the presence of control flow After Vectorization for (i = 0; i < 1024; i+=4){ vP=A[i:i+3] > (0,0,0,0); vNP=vec_not(vP); C[i:i+3]=B[i:i+3]; (vP) (NP1,NP2,NP3,NP4) = vP; D[i+3]=D[i+2]; (NP4) D[i+2]=D[i+1]; (NP3) D[i+1]=D[i]; (NP2) D[i]=D[i-1]; (NP1) } After Removing Predicates for (i = 0; i < 1024; i+=4){ vP=A[i:i+3] > (0,0,0,0); vNP=vec_not(vP); C[i:i+3]=vec_sel(C[i:i+3], B[i:i+3], vP); (NP1,NP2,NP3,NP4) = vP; if (NP4) D[i+3]=D[i+2]; if (NP3) D[i+2]=D[i+1]; if (NP2) D[i+1]=D[i]; if (NP1) D[i]=D[i-1]; }

CPU vs GPU Founding of the GPU as we know it today was Nvidia in 1999 Popularity increased in recent years VisionTek GeForec 256 [Wikipedia]Nvidia GeForce GTX590 [Nvidia]

CPU vs GPU Theoretical GFLOP/s & Bandwidth [Nvidia, NVIDIA CUDA C Programming Guide]

CPU vs GPU Intel Core i7 Nehalem Die Shot [NVIDIA’s Fermi: The First Complete GPU Computing Architecture]

CPU vs GPU Game, Little Big Planet [

CPU vs GPU OpenGL Graphics Pipeline [Wojtek Palubicki;

CPU vs GPU CPU SIMD vs. GPU SIMD Intel’s sandy-bridge architecture: 256-bit AVX --> on 8 registers parallel CUDA multiprocessor up to 512 raw mathematical operations in parallel

CPU vs GPU Nvidia’s Fermi Source:

CPU vs GPU [Nvidia; NVIDIA’s Next Generation CUDA Compute Architecture: Fermi] Nvidia’s Fermi

Standardization Problems and Industry Challenges [Widescreen Wallpapers;

1998 o AMD - 3Dnow o Intel - SSE instruction set a few years later without supporting the 3Dnow o Intel won this battle since SSE was better Standardization Problems and Industry Challenges

2001 o Intel - Itanium processor (64-bit, parallel computing instruction set) o AMD - Its own 64-bit instruction set (backward compatible) o AMD won this time because of its backward compatibility o AMD - SSE5 o Intel - AVX Standardization Problems and Industry Challenges

Example: fused-multiply-add (FMA) o d = a + b * c AMD o Supports since 2011 FMA4 o FMA4 - 4 operand form Intel o Will support FMA3 in 2013 with Haswell o FMA3 - 3 operand form Standardization Problems and Industry Challenges

This causes More work for the programmer Impossible maintenance of the code Standardization required! Standardization Problems and Industry Challenges

SIMD Processors exploit data-level parallelism increasing performance. The hardware requirements are easily met as transistor size decreases. HPC languages have been created to give programmers access to high and low level SIMD operations. Conclusion

Compiler technology has improved to recognize some potential SIMD operations in serial code. The utility of SIMD instructions in modern microprocessors is diminishing except in special purpose applications due to standardization problems and industry in-fighting. The increasing adoption of GPGPU computing has the potential to supplant SIMD type instructions in the CPU. On-chip GPU's appear to be on the horizon, so wider really is better. Conclusion