Vector computers.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

CPE 631: Vector Processing (Appendix F in COA4)
PIPELINE AND VECTOR PROCESSING
Instruction Set Design
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
1 RISC Machines Because of their load-store ISAs, RISC architectures require a large number of CPU registers. These register provide fast access to data.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Tuan Tran. What is CISC? CISC stands for Complex Instruction Set Computer. CISC are chips that are easy to program and which make efficient use of memory.
Krste CS 252 Feb. 27, 2006 Lecture 12, Slide 1 EECS 252 Graduate Computer Architecture Lec. 12: Vector Computers Krste Asanovic ( )
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Compiler Challenges for High Performance Architectures
1 Vector Architectures Sima, Fountain and Kacsuk Chapter 14 CSE462.
CS 152 Computer Architecture and Engineering Lecture 17: Vector Computers Krste Asanovic Electrical Engineering and Computer Sciences University of California,
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
April 1, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 17: Vectors Part II Krste Asanovic Electrical Engineering and Computer.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
11/11/05ELEC CISC (Complex Instruction Set Computer) Veeraraghavan Ramamurthy ELEC 6200 Computer Architecture and Design Fall 2005.
CS 152 Computer Architecture and Engineering Lecture 17: Vector Computers Krste Asanovic Electrical Engineering and Computer Sciences University of California,
CS 252 Graduate Computer Architecture Lecture 7: Vector Computers Krste Asanovic Electrical Engineering and Computer Sciences University of California,
Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.
Krste Asanovic Electrical Engineering and Computer Sciences
Synchronization and Communication in the T3E Multiprocessor.
1 Chapter 04 Authors: John Hennessy & David Patterson.
PIPELINING AND VECTOR PROCESSING
CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.
Ramesh.B ELEC 6200 Computer Architecture & Design Fall /29/20081Computer Architecture & Design.
Computer Architecture Lec. 12: Vector Computers. Supercomputers Definition of a supercomputer: Fastest machine in world at given task A device to turn.
1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.
The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Principles of Linear Pipelining
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE Lecture 19 & 20 Instruction Formats PDP-8,PDP-10,PDP-11 & VAX Course Instructor: Engr. Aisha Danish.
Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Pipelining and Parallelism Mark Staveley
Chapter One Introduction to Pipelined Processors
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Data Manipulation Brookshear, J.G. (2012) Computer Science: an Overview.
CISC and RISC 12/25/ What is CISC? acronym for Complex Instruction Set Computer Chips that are easy to program and which make efficient use of memory.
1 The Instruction Set Architecture September 27 th, 2007 By: Corbin Johnson CS 146.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Yiorgos Makris Professor Department of Electrical Engineering University of Texas at Dallas EE (CE) 6304 Computer Architecture Lecture #19 (11/19/15) Course.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Computer Architecture: SIMD and GPUs (Part I)
14: Vector Computers: an old-fashioned approach
Massachusetts Institute of Technology
Advanced Topic: Alternative Architectures Chapter 9 Objectives
Prof. Zhang Gang School of Computer Sci. & Tech.
CISC (Complex Instruction Set Computer)
Morgan Kaufmann Publishers
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
Pipelining and Vector Processing
CSCE Fall 2013 Prof. Jennifer L. Welch.
Multivector and SIMD Computers
Chapter 2: Data Manipulation
CSCE Fall 2012 Prof. Jennifer L. Welch.
Chapter 2: Data Manipulation
Memory System Performance Chapter 3
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 15 – Vectors Krste Asanovic Electrical Engineering and Computer.
COMPUTER ORGANIZATION AND ARCHITECTURE
Chapter 2: Data Manipulation
Presentation transcript:

Vector computers

Supercomputer Definition of a supercomputer Fastest machine in the world at given task Any machine costing $30 milion + A device to turn a compute-bound problem into an I/O bound problem Any machine designed by Seymour Cray  In 70s, 80s, Supercomputer  Vector machine

First Vector Computers / Processors CDC STAR-100, TI ASC (1972) Memory-memory vector processors High start-up overhead Relatively slow scalar units (underestimation of Amdahl’s Law) Cray-1 (1976) Vector-register vector processor (lower start-up overhead, reduced bandwidth requirements) Fastest scalar processor in the world at that time Vector chaining support

Vector Computers Memory-memory vector computers CDC CYBER 205 (1981) Memory-memory architecture Four lanes with multiple functional units Wide load-store pipeline Support for nonunit stride memory accesses and sparse vectors ETA-10 (CDC, late 80s) 10 processors Each supporting the memory-memory architecture Last significant memory-memory design

Vector Computers Vector-register vector processors Cray X-MP (1983) Better chaining support Multiple memory pipelines Cray-2 (middle 80s) Up to 4 processors Use of DRAM memory modules (256MW – 64bit words) Lacked chaining High memory latency, one memory pipeline per processor Convex C-1, C-4 (early 80s) Afordable mini-supercomputers ($0.5 - $1mln) Software compatiblitiy with Cray Effective compiler High quality UNIX OS implementation IBM System/370 vector architecture (1986) Japanese supercomputers (middle 80s) Fujitsu VP100, VP200 Hitachi S810 NEC SX/2 Cray Y-MP (1988) 8 processors Fastest supercomputer at that time Cray Computer Corporation Cray-3 (1993) Only prototype deliverd to the National Center for Atmospheric research (NCAR) Cray-4 (unfinished) Cray Research Cray C90 (1991) Cray Research Cray J90 (low end) T90 (high end) (1995) Cray Research acquired by Silicon Graphics SV1 (1995) K. Asanovic. "Vector processors”, Appendix G in “Computer Architecture: A Quantitative Approach”.

Vector Computers Vector-register vector processors Figure G.2 Characteristics of several vector-register architectures. If the machine is a multiprocessor, the entries correspond to the characteristics of one processor. Several of the machines have different clock rates in the vector and scalar units; the clock rates shown are for the vector units. The Fujitsu machines’ vector registers are config-urable: The size and count of the 8K 64-bit entries may be varied inversely to one another (e.g., on the VP200, from eight registers each 1K elements long to 256 registers each 32 elements long). The NEC machines have eight fore-ground vector registers connected to the arithmetic units plus 32–64 background vector registers connected between the memory system and the foreground vector registers. The reciprocal unit on the Cray processors is used to do division (and square root on the Cray-2). Add pipelines perform add and subtract. The multiply/divide-add unit on the Hitachi S810/820 performs an FP multiply or divide followed by an add or subtract (while the multiply-add unit performs a multiply followed by an add or subtract). Note that most processors use the vector FP multiply and divide units for vector integer multiply and divide, and several of the processors use the same units for FP scalar and FP vector operations. Each vector load-store unit represents the ability to do an independent, overlapped transfer to or from the vector registers. The number of lanes is the number of parallel pipelines in each of the functional units as described in Section G.4. For example, the NEC SX/5 can complete 16 multiplies per cycle in the multiply functional unit. The Convex C-1 can split its single 64-bit lane into two 32-bit lanes to increase performance for applications that require only reduced precision. The Cray SV1 can group four CPUs with two lanes each to act in unison as a single larger CPU with eight lanes, which Cray calls a Multi-Streaming Processor (MSP). K. Asanovic. "Vector processors”, Appendix G in “Computer Architecture: A Quantitative Approach”.

Vector Computers Memory-memory vs vector-register Memory-memory vector computers Operands fetched directly from the main Results written directly to the memory Vector-register vector computers Vector elements read from the memory into the register by a LOAD VECTOR operation All arithmetic and logic operations are register-register operations Results of vector operations are put into vector registers and may be stored back in memory by a STORE VECTOR operation

Vector Computers Memory-memory vs vector-register Memory-memory architecture Requires greater bandwidth Unables easy reuse of intermediate results Makes difficult to overlap multiple vector operations Start-up time is significantly increased due to cost of memory accesses Becomes more efficient for very long vectors Vector-register architecture Free of disadavantages of memory-memory machines Experience has shown that shorter vectors are more commonly used

Vector computers Memory bandwidth & latency Memory access latency adds to the start-up cost of fetching a vector from memory Assuring sustainable sufficient bandwidth requires special memory organization into multiple memory banks Additional problems arise when the memory is accessed in an irregular pattern (very typical for various matrix based computations)

Vector transfer control Vector Computers Simplified general structure of a vector-register vector computer Data (vectors) External memory Main memory Vector transfer control and address generator Vector registers (local memory) Data Address parameters Data Vector operation control Functions Status Pipelined functional units Data (scalars) Data Vector processor Scalar processor Scalar instructions Vector instructions Instruction processor Instructions

Vector Computers Cray-1 Main features of a classical vector-register vector computer Load/Store Architecture Vector Registers Vector Instructions Hardwired Control Highly Pipelined Functional Units Interleaved Memory System (16 banks, 4 cycle busy time, 12 cycle latency) No Data Caches No Virtual Memory

Basic Cray-1 architecture

Vector computers Vector instructions ai = f1 ( bi ) sine, cosine, square root, … scalar = f2 ( A ) sum, maximum, … ai = f3 ( bi ; ci ) add, subtract, … ai = f4 ( scalar ; ci ) multiply vector by scalar, … It is possible to combine the above operations

Vector computers Vector instruction set advantages Compact One short instruction encodes N operations (may be an equivalent to an entire loop) Expressive Each instruction tells hardware that these N operations: are independent use the same functional unit access disjoint registers access registers in the same pattern as previous instructions access a contiguous block of memory (unit-stride load/store) access memory in a known pattern (strided load/store) Scalable The same object code can be run on more parallel pipelines or lanes

Vector computers Stripmining Theoretical throughput as a function of vector length. What happens when a vector length exceeds the size of vector Registers?

Vector computers Stripmining Performance of Spert-II system on dot product with unit-stride operands. K. Asanovic, “Vector microprocessors”. (32 vector registers)

Vector computers Vector chaining Example: y = axi + yi a ax11 ax10 ax9 ax8 ax7 ax6 ax5 … ,x13 ,x12 ax3+y3 ax2+y2 ax1+y1 … ,y5 ,y4 Performance of Cray-1 was almost doubled with the use of vector chaining, from 80 Mflops to 153 Mflops.

Vector computers Scatter and gather Sometimes, only certain elements of a vector are needed in a computation If the elements to be used are in a regularly-spaced pattern, the spacing between the elements to be gathered is called stride Example: Elements extracted x1, x5, x9, x13, … , x[4*floor((n-1)/4)+1] from a vector x1, x2, x3, x4, x5, x6, x7, x8, … , xn with a stride equal to 4

Vector computers Scatter and gather Scatter and gather operations may be also used with irregularly-spaced data Example: operation gather 1 3 4 7 a1 a2 a3 a4 a5 a6 a7 a8 a1 a3 a4 a7

Vector computers Compress and expand Scatter and gather operations may be also used with irregularly-spaced data Example: operation compress 1 1 1 1 a1 a2 a3 a4 a5 a6 a7 a8 a1 a3 a4 a7

Vector computers Vector conditional execution Vectorization of a loop with a conditional code for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i]; else A[i] = C[i]; Use of vector mask register (1bit per element) lv vA, rA # Load A vector mgtz m0, vA # Set bits in mask register m0 where A>0 lv.m vA, rB, m0 # Load B vector into A under mask fnot m1, m0 # Invert mask register lv.m vA, rC, m1 # Load C vector into A under mask sv vA, rA # Store A back to memory (no mask)

Vector computers Vector conditional execution lv vA, rA mgtz m0, vA lv.m vA, rB, m0 fnot m1, m0 lv.m vA, rC, m1 sv vA, rA Source A 5 1 2 3 4 m0 1 1 1 1 1 B B1 B2 B3 B4 B5 B6 B7 B8 m0 1 1 1 1 1 Result A B1 C2 B3 C4 C5 B6 B7 B8 m1 1 1 1 C C1 C2 C3 C4 C5 C6 C7 C8

Vector computers Programing vector computers Assembly language programming Libraries Data-parallel languages Support for data-parallel operations as an inherent part of the langauge (intrinsic operators and functions) Fortran 90, High Performance Fortran Vectorizing compilers Extensive loop dependencies analysis

Vector computers Vector processing applications Problems that can be efficiently formulated in terms of vectors Long- range weather forecasting Petroleum explorations Seismic data analysis Medical diagnosis Aerodynamics and space flight simulations Artificial intelligence and expert systems Mapping the human genome Image processing