Multivector and SIMD Computers

Slides:

Advertisements

Similar presentations

Instruction Level Parallelism and Superscalar Processors

Advertisements

Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks.

DSPs Vs General Purpose Microprocessors

PIPELINE AND VECTOR PROCESSING

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

The University of Adelaide, School of Computer Science

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

Instruction Level Parallelism (ILP) Colin Stevens.

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Computer Organization Computer Organization & Assembly Language: Module 2.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Operating Systems Lecture No. 2. Basic Elements  At a top level, a computer consists of a processor, memory and I/ O Components.  These components are.

MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.

Ch. 2 Data Manipulation 4 The central processing unit. 4 The stored-program concept. 4 Program execution. 4 Other architectures. 4 Arithmetic/logic instructions.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Principles of Linear Pipelining

Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.

Chapter One Introduction to Pipelined Processors

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

EKT303/4 Superscalar vs Super-pipelined.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Vector computers.

Computer Architecture. Instruction Set “The collection of different instructions that the processor can execute it”. Usually represented by assembly codes,

UNIT-V PIPELINING & VECTOR PROCESSING.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

PARALLEL COMPUTER ARCHITECTURE

Advanced Architectures

Higher Level Parallelism

Computer Organization and Architecture + Networks

Computer Architecture Chapter (14): Processor Structure and Function

Distributed Processors

A Closer Look at Instruction Set Architectures

Parallel Processing - introduction

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.

Morgan Kaufmann Publishers

Prof. Zhang Gang School of Computer Sci. & Tech.

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

COMP4211 : Advance Computer Architecture

Operating System Concepts

Pipelining and Vector Processing

Array Processor.

Functional Units.

Computer Architecture and the Fetch-Execute Cycle

Chapter 2: Data Manipulation

Lecture 3: Main Memory.

William Stallings Computer Organization and Architecture 8th Edition

Computer Architecture

Chapter 2: Data Manipulation

ECE 352 Digital System Fundamentals

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

Topic 2: Vector Processing and Vector Architectures

CPU Structure CPU must:

Computer Architecture Assembly Language

Chapter 2: Data Manipulation

Presentation transcript:

Multivector and SIMD Computers Vector Processing Principles Multivector Multiprocessors Compound Vector Processing SIMD Computer Organizations The Connection Machine CM-5 EENG-630

Vector Processing Principles A vector is a set of scalar data items, all of the same type, stored in memory. Usually, the vector elements are ordered to have a fixed addressing increment between successive elements called the stride. A vector processor is an ensemble of hardware resources, including vector registers, functional pipelines, processing elements, and register counters, for performing vector operations. Vector processing occurs when arithmetic or logical operations are applied to vectors. The conversion from scalar processing to vector code is called vectorization. Vector processing speedup 10..20 compared with scalar processing. A compiler capable of vectorization is called vectorizing compiler or vectorizer. EENG-630

Vector instructions 1. Vector-vector instructions One or two vector operands are fetched form the respective vector registers, enter through a functional pipeline unit, and produce result in another vector register. 2. Vector-scalar instructions 3. vector-memory instructions Store-load of vector registers 4. Vector reduction instructions maximum, minimum, sum, mean value. 5. Gather and scatter instructions Two instruction registers are used to gather or scatter vector elements randomly through the memory (operations with sparse vectors). 6. Masking instructions The Mask vector is used to compress or to expand a vector to a shorter or longer index vector (bit per index correspondence). EENG-630

Vector-access memory schemes Vector operands may have arbitrary length. Vector elements are not necessarily stored in contiguous memory locations. To access a vector a memory, one must specify its base, stride, and length. Since each vector register has fixed length, only a segment of the vector can be loaded into a vector register. Vector operands should be stored in memory to allow pipelined and parallel access. Access itself should be pipelined. C-Access memory organization The m-way low-order memory structure, allows m words to be accessed concurrently and overlapped. S-Access memory organization All modules are accessed simultaneously storing consecutive words to data buffers. The low order address bits are used to multiplex the m words out of buffers. C/S-Access memory organization. EENG-630

C-access Eight-way interleaved memory (m = 8 and w = 8). m is called the degree of interleaving. The major cycle is the total time required to complete the access of a single word form a memory. The minor cycle is the actual time needed to produce one word, assuming overlapped access of successive memory modules separated in every memory cycle . EENG-630

Relative Vector/Scalar Performance Let r be the vector/scalar speed ratio and f the vectorization ratio. By Ahmdahl's law the following relative performance can be defined: The limiting case is P -> 1 if f -> 0. Example: IBM - r = 3:::4, Cray - r = 10:::25. EENG-630

Multivector Multiprocessors Architecture Design Goals Maintaining a good vector/scalar performance balance. The vector balance point is defined as the percentage of vector code in a program required to achieve equal utilization of vector and scalar hardware (usually 90...97%). Supporting scalability with an increasing number of processors (The dominant problem involves support of shared memory with an increasing number of processor and memory ports). Increasing memory system capacity (up to Tbytes today, hierarchy is necessary) and performance Providing high-performance I/O (>50 Gbytes/s) and easy-access network. EENG-630

Compound Vector Processing A compound vector function (CVF) is defined as a composite function of vector operations converted from a looping structure of linked scalar operations. Do 10 I=1,N Load R1, X(I) Load R2, Y(I) Multiply R1, S Add R2, R1 Store Y(I), R2 10 Continue CVF: Y(1:N) = S X(1:N) + Y(1:N) or Y(I) = S X(I) + Y(I) EENG-630

CVF Chaining and Strip-mining Typical CVF for one-dimensional arrays are load, store, multiply, divide, logical and shifting operations. *** The number of available vector registers and functional pipelines impose some restrictions on how many CVFs can be executed simultaneously. Chaining Chaining is an extension of technique of internal data forwarding practiced in scalar processors. Chaining is limited by the small number of functional pipelines available in a vector processor. Strip-mining When a vector has a length greater than that of the vector registers, segmentation of the long vector into fixed-length segments is necessary. One vector segment is processed at a time (in Cray computers segment is 64 elements). Recurrence The special case of vector loops in which the output of a functional pipeline may feed back into one of its own source vector registers. EENG-630

SIMD Computer Organizations SIMD models dierentiates on base of memory distribution and addressing scheme used. Most SIMD computers use a single control unit and distributed memories, except for a few that use associative memories. Distributed memory model Spatial parallelism among PEs. A distributed memory SIMD consists of an array of PEs (supplied with local memory) which are controlled by the array control unit. Program and data are loaded into the control memory through the host computer and distributed from there to PEs local memories. EENG-630

SIMD using distributed local memory EENG-630

SIMD using shared-memory EENG-630