vector computer overlap arithmetic operation on the elements of the vectorinstruction-level.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Computer Organization and Architecture
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
The University of Adelaide, School of Computer Science
Lecture 11 Oct 12 Circuits for floating-point operations addition multiplication division (only sketchy)
Computer Organization and Architecture
Microprocessors Introduction to RISC Mar 19th, 2002.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Computer performance.
NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
MICROPROCESSOR INPUT/OUTPUT
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Principles of Linear Pipelining
Introduction to MMX, XMM, SSE and SSE2 Technology
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Processor Architecture
Chapter One Introduction to Pipelined Processors
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
EKT303/4 Superscalar vs Super-pipelined.
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
Computer Architecture Lecture 11 Arithmetic Ralph Grishman Oct NYU.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
بسم الله الرحمن الرحيم MEMORY AND I/O.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
Pipelining. A process of execution of instructions may be decomposed into several suboperations Each of suboperations may be executed by a dedicated segment.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Computer Organization and Architecture Lecture 1 : Introduction
These slides are based on the book:
Single Instruction Multiple Data
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
William Stallings Computer Organization and Architecture 8th Edition
Parallel Processing - introduction
Embedded Systems Design
Simple Illustration of L1 Bandwidth Limitations on Vector Performance
Architecture & Organization 1
Morgan Kaufmann Publishers
Lecture 10: Floating Point, Digital Design
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
Recent from Dr. Dan Lo regarding 12/11/17 Dept Exam
Pipelining and Vector Processing
Superscalar Processors & VLIW Processors
Architecture & Organization 1
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Arithmetic Logical Unit
Multivector and SIMD Computers
A study on SIMD architecture
August 29 New address for Fang-Yi
Introduction to Microprocessor Programming
Morgan Kaufmann Publishers Arithmetic for Computers
Recent from Dr. Dan Lo regarding 12/11/17 Dept Exam
Review In last lecture, done with unsigned and signed number representation. Introduced how to represent real numbers in float format.
CSE 502: Computer Architecture
Husky Energy Chair in Oil and Gas Research
Presentation transcript:

vector computer overlap arithmetic operation on the elements of the vectorinstruction-level (data) parallelism A vector computer contains a set of special arithmetic units called pipelines. These pipelines overlap the execution of the different parts of an arithmetic operation on the elements of the vector, producing a instruction-level (data) parallelism. Do you see any similarity with the hyperthreading ? In the literature, this kind of computation is often refered as: vector computing; SIMD (Single Instruction Multiple Data); Instruction-level parallelism(ILP); dual issue; double FPU; Most of modern processors have the ability to perform vector computations. Vector computing does not mean or require several processors (or cores). Some special architectures are equiped with several floating point units (FPU) Example: the IBM® Blue Gene®/L supercomputer, whose processors are enhanced with a specially designed dual Floating Point Unit. INTRODUCTION Vector computation requires specific registers called vector registers. Vector operations are performed on vector registers. Thus, their length is an important hardware characteristic, also does their number. The memory bandwidth is also important when evaluating vector computing potential.

Could you identify and explain other type of pipelines ? VECTOR PROCESSING scalar implementation 6n steps A scalar implementation of adding two array of length n will require 6n steps vector implementation 6 + (n-1) steps A vector implementation of adding two array of length n will require 6 + (n-1) steps vector processing applies to # operations Depending on the architecture, vector processing applies to # operations (arith, logical). 6 steps (stages) involved in a floating-point addition Consider the 6 steps (stages) involved in a floating-point addition on a sequential machine with IEEE arithmetic hardware: A. exponents are compared for the smallest magnitude. B. exponents are equalized by shifting the significand smaller. C. the significands are added. D. the result of the addition is normalized. E. checks are made for floating-point exceptions such as overflow. F.rounding is performed. pipelinevector register n = 2, 4, 8 The pipeline process occurs within a vector register (thus n = 2, 4, 8, …) Some vector architecture provide a wider vectorization by chaining the pipelines. p-length vector n-arrayn/p Roughly speeking, a p-length vector computation on a given n-array needs n/p steps.

Intel Sandy Bridge processor (2011) family with 256-bit (16 bytes) vector registers. FPGA (Field Programmable Gate Array), an integrated flexible circuit which can be configured as desired to implement a specific processing. SOME PROCESSORS WITH VECTOR PROCESSING UNITS PowerPC (Motorola/Apple/IBM), with bit vector registers. SPE of the IBM-CELL BE with bit vector registers. GPU

SIMD single operation on multiple data A SIMD enabled-processor can execute a single operation on multiple data. SIMDimage, audio, video and digital signal processing SIMD works well on image, audio, video and digital signal processing APPLICATIONS OF VECTOR COMPUTING SIMD stream processing SIMD suits for stream processing (typically uniform streaming) The expected characteristics are: Compute intensive (arithmetic operations are dominant compare to I/O) Data parallel (same function to all records independently) Data locality (data to be accessed are contiguous on the memory) No branching (no control flow, straight line) AA BB B? B? SIMD video games programming, genomics, linear algebra SIMD is widely use in video games programming, genomics, linear algebra, … Could you explain the meaning of “independently” in the data parallelism ? Could you explain why branching is hindering for SIMD ?

VECTOR COMPUTING IMPLEMENTATIONS Altivec or VMX (PowerPC) IBM CELL-BE SPE intrinsics Intrinsics are there to facilitate vector programming We can expect or force the compiler to vectorize our code (but do no rely on it !)

DATA ALIGNMENT Data alignment Data alignment is crucial in vector computing and important for performances data alignment Lack of knowledge about data alignment could raise the following issues Your software will run slower. Your application will lock up. Your operating system will crash. Your software will silently fail, yielding incorrect results. Memory Memory is accessed by chunks of constant sizes (cache line) memory address p-aligned multiple of p A memory address is said to be p-aligned iff it is a multiple of p (typically 128) align the size It is important to also align the size of data types (padding if necessary). compilerautomatically padd Sometimes the compiler will automatically padd your data structures (check with sizeof() ) specialized libraries memory aligned There are specialized libraries for memory aligned allocations Write a C routine that implements aligned memory allocations ? Illustrate a misaligned memory address and explain some consequences ? typedef struct{char a; long b; char c;}mystruct; Is mystruct aligned ? How to fix it ?

SIMD CODES (SSE)

SIMD CODES (AVX) __m256 xxx256_x_coeff1 = _mm256_load_ps( &interp_coef_x[0] ); __m256 xxx256_data0a = _mm256_load_ps( &pf[index_signal_start] ); __m256 xxx256_data1a = _mm256_load_ps( &pf[index_signal_start+nx] ); __m256 xxx256_data2a = _mm256_load_ps( &pf[index_signal_start+nx2] ); xxx256_data0a = _mm256_mul_ps( xxx256_data0a, xxx256_x_coeff1 ); xxx256_data1a = _mm256_mul_ps( xxx256_data1a, xxx256_x_coeff1 ); xxx256_data2a = _mm256_mul_ps( xxx256_data2a, xxx256_x_coeff1 ); __m256 xxx256_sum1 = _mm256_add_ps(_mm256_mul_ps(xxx256_data2a,xxx256_y_coeff2), _mm256_mul_ps(xxx256_data3a, xxx256_y_coeff3) ); xxx256_sum = _mm256_add_ps(xxx256_sum, _mm256_movehdup_ps(xxx256_sum)); xxx256_sum1 = _mm256_unpackhi_ps(xxx256_sum, xxx256_sum) ; xxx256_sum1 = _mm256_add_ps(xxx256_sum, xxx256_sum1) ; xxx256_sum = _mm256_permute2f128_ps(xxx256_sum1, xxx256_sum1, 0x01) ; xxx256_sum = _mm256_add_ps(xxx256_sum, xxx256_sum1); _mm256_store_ps( f, xxx256_sum ); signal_value = f[0];

SIMD CODES (VMX) int vmult(float *array1, float *array2, float *out, int arraySize) { /* This code assumes that the arrays are quadword-aligned. */ /* This code assumes that the arraySize is divisible by 4. */ int i, arraySizebyfour; arraySizebyfour = arraySize >> 2; /* arraySize/4 vectors */ vector float *varray1 = (vector float *) (array1); vector float *varray2 = (vector float *) (array2); vector float *vout = (vector float *) (out); for(i = 0; i < arraySizebyfour; i++) { /*vec_mul is an intrinsic that multiplies vectors */ vout[i] = vec_mul(varray1[i], varray2[i]); } return 0; }