Computer Architecture Vector Architectures Ola Flygt Växjö University +46 470 70 86 49.

Slides:

Advertisements

Similar presentations

1 Authentication Applications Ola Flygt Växjö University, Sweden

Advertisements

25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst

3-Software Design Basics in Embedded Systems

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Computer Organization and Architecture

Computer Architecture Instruction-Level Parallel Processors

Streaming SIMD Extension (SSE)

CS364 CH16 Control Unit Operation

Superscalar and VLIW Architectures Miodrag Bolic CEG3151.

RISC and Pipelining Prof. Sin-Min Lee Department of Computer Science.

School of Engineering & Technology Computer Architecture Pipeline.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

The University of Adelaide, School of Computer Science

1 RISC Machines Because of their load-store ISAs, RISC architectures require a large number of CPU registers. These register provide fast access to data.

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University

Computer Architecture Computer Architecture Processing of control transfer instructions, part II Ola Flygt Växjö University

Maths for Computer Graphics

Processor Architecture Kieran Mathieson. Outline Memory CPU Structure Design a CPU Programming Design Issues.

CSCE 212 Quiz 9a – 4/1/11 For the following questions, assume the clock cycle times given above and the following set of instructions: lw $5, -16($5) sw.

Chapter One Introduction to Pipelined Processors.

Computer Architecture The Concept Ola Flygt V ä xj ö University

Computer Architecture Parallel Processing

Datapath Architecture Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

$100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300.

Principles of Linear Pipelining

Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University.

Chapter One Introduction to Pipelined Processors

Computer Architecture SIMD Ola Flygt Växjö University

Vector and symbolic processors

1  1998 Morgan Kaufmann Publishers Where we are headed Performance issues (Chapter 2) vocabulary and motivation A specific instruction set architecture.

EGR 115 Introduction to Computing for Engineers MATLAB Basics 5: Applications – Vector Math Friday 12 Sept 2014 EGR 115 Introduction to Computing for Engineers.

EGR 115 Introduction to Computing for Engineers MATLAB Basics 1: Variables & Arrays Wednesday 03 Sept 2014 EGR 115 Introduction to Computing for Engineers.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Chapter One Introduction to Pipelined Processors

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

Chapter One Introduction to Pipelined Processors.

The most precious invention in modern world

Vector projections (resolutes)

Advanced Architectures

William Stallings Computer Organization and Architecture 8th Edition

Chapter 14 Instruction Level Parallelism and Superscalar Processors

COMP4211 : Advance Computer Architecture

Computer Architecture Introduction to Data-Parallel architectures

Instruction Level Parallelism and Superscalar Processors

Array Processor.

Superscalar Processors & VLIW Processors

Single Thread Parallelism

Superscalar Pipelines Part 2

Single Thread Parallelism

Physics 111 Practice Problem Solutions 01 Units, Measurement, Vectors SJ 8th Ed.: Ch , 3.1 – 3.4 Contents: 1-7, 1-9, 1-10, 1-12, 1-15, 1-21* 3-5,

CSCI N207 Data Analysis Using Spreadsheet

buses, crossing switch, multistage network.

Part 2: Parallel Models (I)

EDLC(Embedded system Development Life Cycle ).

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Superscalar and VLIW Architectures

Extra Reading Data-Instruction Stream : Flynn

Memory System Performance Chapter 3

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Presentation transcript:

Computer Architecture Vector Architectures Ola Flygt Växjö University

Outline  Introduction  Basic priciples  Sd  Examples  Cray  xcx CH01

Scalar processing 4n clock cycles required to process n elements! Timeop 0a0a0 4a1a1 8a2a2 …… 4nanan

Pipelining 4n/(4+n) clock cycles required to process n elements! Timeop 0 op 1 op 2 op 3 0a0a0 1a1a1 a0a0 2a2a2 a1a1 a0a0 3a3a3 a2a2 a1a1 a0a0 4a4a4 a3a3 a2a2 a1a1 …………… nanan a n-1 a n-2 a n-3

Pipeline Basic Principle  Stream of objects  Number of objects = stream length n  Operation can be subdivided into sequence of steps  Number of steps = pipeline length p  Advantage  Speedup = pn/(p+n)  Stream length >> pipeline length  Speedup approx.p Speedup is limited by pipeline length!

Vector Operations Operations on vectors of data (floating point numbers)  Vector-vector  V1 <-V2 + V3 (component-wise sum)  V1 <-- V2  Vector-scalar  V1 <-c * V2  Vector-memory  V <-A (vector load)  A <-V (vector store)  Vector reduction  c <-min(V)  c <-sum(V)  c <-V1 * V2 (dot product)

Vector Operations, cont.  Gather/scatter  V1,V2 <-GATHER(A)  load all non-zero elements of A into V1 and their indices into V2  A <-SCATTER(V1,V2)  store elements of V1 into A at indices denoted by V2 and fill rest with zeros  Mask  V1 <-MASK(V2,V3)  store elements of V2 into V1 for which corresponding position in V3 is non-zero

Example, Scalar Loop approx. 6n clock cycles to execute loop. Fortran loop: DO I=1,N A(I) = A(I)+B(I) ENDDO Scalar assembly code: R0 <- N R1 <- I JMP J L: R2 <- A(R1) R3 <- B(R1) R2 <- R2+R3 A(R1) <- R2 R1 <- R1+1 J: JLE R1, R0, L

Example, Vector Loop 4n clock cycles, because no loop iteration overhead (ignoring speedup by pipelining) Fortran loop: DO I=1,N A(I) = A(I)+B(I) ENDDO Vectorized assembly code: V1 <- A V2 <- B V3 <- V1+V2 A <- V2

Chaining  Overlapping of vector instructions  (see Hwang, Figure 8.18)  Hence: c+n ticks (for small c)  Speedup approx.6  (c=16, n=128, s=(6*128)/(16+128)=5.33)  The longer the vector chain, the better the speedup!  A <-B*C+D  chaining degree 5  Vectorization speedups between 5 and 25

Vector Programming How to generate vectorized code? 1. Assembly programming. 2. Vectorized Libraries. 3. High-level vector statements. 4. Vectorizing compiler.

Vectorized Libraries  Predefined vector operations (partially implemented in assembly language)  VECLIB, LINPACK, EISPACK, MINPACK  C = SSUM(100, A(1,2), 1, B(3,1), N) vector length A(1,2)...vector address A 1...vector stride A B(3,1)...vector address B N...vector stride B Addition of matrix column to matrix row.

High-Level Vector Statements e.g. Fortran 90 INTEGER A(100), B(100), C(100), S A(1:100) = S*B(1:100)+C(1:100) * Vector-vector operations. * Vector-scalar operations. * Vector reduction. *... Easy transformation into vector code.

Vectorizing Compiler 1. Fortran 77 DO Loop * DO I=1, N D(I) = A(I)*B+C(I) ENDDO 2. Vectorization * D(1:N) = A(1:N)*B+C(1:N) 3. Strip mining * DO I=1, N/128 D(I:I+127) = A(I:I+127)*B + C(I:I+127) ENDDO IF ((N.MOD.128).NEQ.0) A((N/128)*128+1:N) =... ENDIF 4. Code generation * V0 <- V0*B... Related techniques for parallelizing compiler!

Vectorization In which cases can loop be vectorized? DO I = 1, N-1 A(I) = A(I+1)*B(I) ENDDO | V A(1:128) = A(2:129)*B(1:128) A(129:256) = A(130:257)*B(129:256).... Vectorization preserves semantics.

Loop Vectorization s semantics always preserved? DO I = 2, N A(I) = A(I-1)*B(I) ENDDO | V A(2:129) = A(1:128)*B(2:129) A(130:257) = A(129:256)*B(130:257).... Vectorization has changed semantics!

Vectorization Inhibitors  Vectorization must be conservative; when in doubt, loop must not be vectorized.  Vectorization is inhibited by  Function calls  Input/output operations  GOTOs into or out of loop  Recurrences (References to vector elements modified in previous iterations)

Components of a vectorizing supercomputer

The DS for floating-point precision

The DS for integer precision

How vectorization works Un-vectorized computation

How vectorization works vectorized computation

How vectorization speeds up computation

Speed improvements Non-pipelined computation

Speed improvements pipelined computation

Increasing the granularity of a pipeline Repetition governed by slowest component

Increasing the granularity of a pipeline Granularity increased to improve repetition

Parallel computation of floating point and integer results

Mixed functional and data parallelism

The DS for parallel computational functionality

Performance of four generations of Cray systems

Communication between CPUs and memory

The increasing complexity in Cray systems

Integration density

Convex C4/XA system

The configuration of the crossbar switch

The processor configuration