Copyright © 2011-2014 Curt Hill SIMD Single Instruction Multiple Data.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
A many-core GPU architecture.. Price, performance, and evolution.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.
High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Organization of a Simple Computer. Computer Systems Organization  The CPU (Central Processing Unit) is the “brain” of the computer. Fetches instructions.
Prince Sultan College For Woman
How a Computer Processes Data Hardware. Major Components Involved: Central Processing Unit Types of Memory Motherboards Auxiliary Storage Devices.
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
XP Practical PC, 3e Chapter 16 1 Looking “Under the Hood”
Basics and Architectures
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
1 Chapter 04 Authors: John Hennessy & David Patterson.
CISC…AGAIN!!! (and a bit o’ RISC, too) by Javier Arboleda.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Organization of a Simple Computer The organization of a simple computer with one CPU and two I/O devices.
RISC Architecture RISC vs CISC Sherwin Chan.
Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Introduction to MMX, XMM, SSE and SSE2 Technology
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Copyright © – Curt Hill Types What they do.
Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.
Copyright © Curt Hill Parallelism in Processors Several Approaches.
Vector and symbolic processors
Computer performance issues* Pipelines, Parallelism. Process and Threads.
EKT303/4 Superscalar vs Super-pipelined.
Simple ALU How to perform this C language integer operation in the computer C=A+B; ? The arithmetic/logic unit (ALU) of a processor performs integer arithmetic.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Copyright © 2005 – Curt Hill MicroProgramming Programming at a different level.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
Processor Level Parallelism 1
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Single Instruction Multiple Data
Part of the Assembler Language Programmers Toolbox
A Closer Look at Instruction Set Architectures
Exploiting Parallelism
Phnom Penh International University (PPIU)
Morgan Kaufmann Publishers
Pipelining and Vector Processing
Special Instructions for Graphics and Multi-Media
Array Processor.
Superscalar Processors & VLIW Processors
Microprocessor & Assembly Language
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Multivector and SIMD Computers
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
CSE 502: Computer Architecture
Computer Organization
Presentation transcript:

Copyright © Curt Hill SIMD Single Instruction Multiple Data

SIMD Only successful when the data is highly parallel There is a very large amount of time spent on array processing The array element processing is somewhat independent –Such as adding corresponding array elements of two arrays There are plenty of applications but they are specialized, usually scientific Copyright © Curt Hill

Data level parallelism Suppose, we have two arrays of 32 floating point operands and we want to add them A single processor will go down the line summing one at a time –If it is superscalar and it has two FPUs it can do this slightly more than 16 units of time otherwise 32 Not bad but generally outperformed by array and vector processors

Copyright © Curt Hill Array Processor Single control unit that drives multiple ALUs –The ALUs usually have individual memories In the previous case it will take units Here if there are 32 floating point units and the vector register contains 32 slots it will take one unit When adding two scalar variables the two would be the same speed, but when adding two array variables (length<=32) then the vector processor would be 32 times faster

Copyright © Curt Hill Why In most applications such parallelism would be a waste, but in many scientific applications an array of size 32 is pretty small and substantial use could be made of this parallelism An array processor is a large number of identical processors that perform the same instruction on different pieces of data –Single control unit for the many processors –Parallel memories for the parallel processors

Examples ILLIAC IV was the first in the late 60s –Largely used by NASA for fluid dynamics calculations –Very large amount of parallelism in this application Thinking Machines Connection Machine 1 and 2 Goodrich Massively Parallel Processor MasPar MP 1 and 2 Copyright © Curt Hill

Disadvantages: Hardware heavy – expensive –Never mass produced since they fit a niche market –Register/memory configuration is unusual Difficult to program –Most languages have no support –High Performance FORTRAN is usual choice Only exceptional on truly parallel computations

Copyright © Curt Hill Vector processor Essentially a normal processor, usually superscalar, heavily pipelined What it has different are vector registers –A normal register contains a single value, either integer or floating point of some size –A vector register contains an array of these items that can be added with array arithmetic

Copyright © Curt Hill Crays Most of the Cray super computers were vector processors Programmed more like a regular processor –There was usually a vector load/store instruction The number of values in a vector register was usually modest: 4-8 –This made the cost more reasonable –The performance was not so lopsided on vector operations

Commercially The market for these sorts of array and vector processors is very limited There are few organizations that will always be able to utilize them In general it is a niche market However there are some common ones as well Copyright © Curt Hill

Intel MMX instructions The Pentium should not be considered a vector processor Yet it has vector operations in the MMX subset –The SSE sets extend these These allow one 32 bit register to be considered four eight-bit registers or two 16 bit registers This allows array processing of 8 bit pixels or 16 bit sound samples Copyright © Curt Hill

GPU The graphics processing unit is the most common vector processor The pixel manipulation present in a GPU is an ideal SIMD environment Shading, for example, can be easily done in parallel Lets consider one GPU: ATI Radeon HD 4870 Copyright © Curt Hill

Radeon HD 4870 There are 10 cores –Each is SIMD Each core has 256 registers Each of these registers is actually a vector register of size 64 The contents of one of these slots is a 4 byte float Multiply this out and it is 2.5MB of register Copyright © Curt Hill

Exploiting the GPU There is substantial power sitting in the GPU If 3D moving displays (such as games) or video playing most of this power is sitting idle A number of options are now available to use this for scientific computing GPGPU – General Purpose computing on Graphics Processing Unit Copyright © Curt Hill

Super Computers A number of groups have organized clusters of GPUs to achieve super computers Example: Chinese Mole 8.5 (2011) –2200 NVIDIA Tesla GPUs –Used to simulate an H1N1 influenza virus Copyright © Curt Hill

Finally The scientific big computers are a niche market Supercomputers have been fabricated using clusters of GPUs –This is likely the future of SIMD Copyright © Curt Hill