Download presentation
1
Vector computers
2
Supercomputer Definition of a supercomputer
Fastest machine in the world at given task Any machine costing $30 milion + A device to turn a compute-bound problem into an I/O bound problem Any machine designed by Seymour Cray In 70s, 80s, Supercomputer Vector machine
3
First Vector Computers / Processors
CDC STAR-100, TI ASC (1972) Memory-memory vector processors High start-up overhead Relatively slow scalar units (underestimation of Amdahl’s Law) Cray-1 (1976) Vector-register vector processor (lower start-up overhead, reduced bandwidth requirements) Fastest scalar processor in the world at that time Vector chaining support
4
Vector Computers Memory-memory vector computers
CDC CYBER 205 (1981) Memory-memory architecture Four lanes with multiple functional units Wide load-store pipeline Support for nonunit stride memory accesses and sparse vectors ETA-10 (CDC, late 80s) 10 processors Each supporting the memory-memory architecture Last significant memory-memory design
5
Vector Computers Vector-register vector processors
Cray X-MP (1983) Better chaining support Multiple memory pipelines Cray-2 (middle 80s) Up to 4 processors Use of DRAM memory modules (256MW – 64bit words) Lacked chaining High memory latency, one memory pipeline per processor Convex C-1, C-4 (early 80s) Afordable mini-supercomputers ($0.5 - $1mln) Software compatiblitiy with Cray Effective compiler High quality UNIX OS implementation IBM System/370 vector architecture (1986) Japanese supercomputers (middle 80s) Fujitsu VP100, VP200 Hitachi S810 NEC SX/2 Cray Y-MP (1988) 8 processors Fastest supercomputer at that time Cray Computer Corporation Cray-3 (1993) Only prototype deliverd to the National Center for Atmospheric research (NCAR) Cray-4 (unfinished) Cray Research Cray C90 (1991) Cray Research Cray J90 (low end) T90 (high end) (1995) Cray Research acquired by Silicon Graphics SV1 (1995) K. Asanovic. "Vector processors”, Appendix G in “Computer Architecture: A Quantitative Approach”.
6
Vector Computers Vector-register vector processors
Figure G.2 Characteristics of several vector-register architectures. If the machine is a multiprocessor, the entries correspond to the characteristics of one processor. Several of the machines have different clock rates in the vector and scalar units; the clock rates shown are for the vector units. The Fujitsu machines’ vector registers are config-urable: The size and count of the 8K 64-bit entries may be varied inversely to one another (e.g., on the VP200, from eight registers each 1K elements long to 256 registers each 32 elements long). The NEC machines have eight fore-ground vector registers connected to the arithmetic units plus 32–64 background vector registers connected between the memory system and the foreground vector registers. The reciprocal unit on the Cray processors is used to do division (and square root on the Cray-2). Add pipelines perform add and subtract. The multiply/divide-add unit on the Hitachi S810/820 performs an FP multiply or divide followed by an add or subtract (while the multiply-add unit performs a multiply followed by an add or subtract). Note that most processors use the vector FP multiply and divide units for vector integer multiply and divide, and several of the processors use the same units for FP scalar and FP vector operations. Each vector load-store unit represents the ability to do an independent, overlapped transfer to or from the vector registers. The number of lanes is the number of parallel pipelines in each of the functional units as described in Section G.4. For example, the NEC SX/5 can complete 16 multiplies per cycle in the multiply functional unit. The Convex C-1 can split its single 64-bit lane into two 32-bit lanes to increase performance for applications that require only reduced precision. The Cray SV1 can group four CPUs with two lanes each to act in unison as a single larger CPU with eight lanes, which Cray calls a Multi-Streaming Processor (MSP). K. Asanovic. "Vector processors”, Appendix G in “Computer Architecture: A Quantitative Approach”.
7
Vector Computers Memory-memory vs vector-register
Memory-memory vector computers Operands fetched directly from the main Results written directly to the memory Vector-register vector computers Vector elements read from the memory into the register by a LOAD VECTOR operation All arithmetic and logic operations are register-register operations Results of vector operations are put into vector registers and may be stored back in memory by a STORE VECTOR operation
8
Vector Computers Memory-memory vs vector-register
Memory-memory architecture Requires greater bandwidth Unables easy reuse of intermediate results Makes difficult to overlap multiple vector operations Start-up time is significantly increased due to cost of memory accesses Becomes more efficient for very long vectors Vector-register architecture Free of disadavantages of memory-memory machines Experience has shown that shorter vectors are more commonly used
9
Vector computers Memory bandwidth & latency
Memory access latency adds to the start-up cost of fetching a vector from memory Assuring sustainable sufficient bandwidth requires special memory organization into multiple memory banks Additional problems arise when the memory is accessed in an irregular pattern (very typical for various matrix based computations)
10
Vector transfer control
Vector Computers Simplified general structure of a vector-register vector computer Data (vectors) External memory Main memory Vector transfer control and address generator Vector registers (local memory) Data Address parameters Data Vector operation control Functions Status Pipelined functional units Data (scalars) Data Vector processor Scalar processor Scalar instructions Vector instructions Instruction processor Instructions
11
Vector Computers Cray-1
Main features of a classical vector-register vector computer Load/Store Architecture Vector Registers Vector Instructions Hardwired Control Highly Pipelined Functional Units Interleaved Memory System (16 banks, 4 cycle busy time, 12 cycle latency) No Data Caches No Virtual Memory
12
Basic Cray-1 architecture
13
Vector computers Vector instructions
ai = f1 ( bi ) sine, cosine, square root, … scalar = f2 ( A ) sum, maximum, … ai = f3 ( bi ; ci ) add, subtract, … ai = f4 ( scalar ; ci ) multiply vector by scalar, … It is possible to combine the above operations
14
Vector computers Vector instruction set advantages
Compact One short instruction encodes N operations (may be an equivalent to an entire loop) Expressive Each instruction tells hardware that these N operations: are independent use the same functional unit access disjoint registers access registers in the same pattern as previous instructions access a contiguous block of memory (unit-stride load/store) access memory in a known pattern (strided load/store) Scalable The same object code can be run on more parallel pipelines or lanes
15
Vector computers Stripmining
Theoretical throughput as a function of vector length. What happens when a vector length exceeds the size of vector Registers?
16
Vector computers Stripmining
Performance of Spert-II system on dot product with unit-stride operands. K. Asanovic, “Vector microprocessors”. (32 vector registers)
17
Vector computers Vector chaining
Example: y = axi + yi a ax11 ax10 ax9 ax8 ax7 ax6 ax5 … ,x13 ,x12 ax3+y3 ax2+y2 ax1+y1 … ,y5 ,y4 Performance of Cray-1 was almost doubled with the use of vector chaining, from 80 Mflops to 153 Mflops.
18
Vector computers Scatter and gather
Sometimes, only certain elements of a vector are needed in a computation If the elements to be used are in a regularly-spaced pattern, the spacing between the elements to be gathered is called stride Example: Elements extracted x1, x5, x9, x13, … , x[4*floor((n-1)/4)+1] from a vector x1, x2, x3, x4, x5, x6, x7, x8, … , xn with a stride equal to 4
19
Vector computers Scatter and gather
Scatter and gather operations may be also used with irregularly-spaced data Example: operation gather 1 3 4 7 a1 a2 a3 a4 a5 a6 a7 a8 a1 a3 a4 a7
20
Vector computers Compress and expand
Scatter and gather operations may be also used with irregularly-spaced data Example: operation compress 1 1 1 1 a1 a2 a3 a4 a5 a6 a7 a8 a1 a3 a4 a7
21
Vector computers Vector conditional execution
Vectorization of a loop with a conditional code for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i]; else A[i] = C[i]; Use of vector mask register (1bit per element) lv vA, rA # Load A vector mgtz m0, vA # Set bits in mask register m0 where A>0 lv.m vA, rB, m0 # Load B vector into A under mask fnot m1, m0 # Invert mask register lv.m vA, rC, m1 # Load C vector into A under mask sv vA, rA # Store A back to memory (no mask)
22
Vector computers Vector conditional execution
lv vA, rA mgtz m0, vA lv.m vA, rB, m0 fnot m1, m0 lv.m vA, rC, m1 sv vA, rA Source A 5 1 2 3 4 m0 1 1 1 1 1 B B1 B2 B3 B4 B5 B6 B7 B8 m0 1 1 1 1 1 Result A B1 C2 B3 C4 C5 B6 B7 B8 m1 1 1 1 C C1 C2 C3 C4 C5 C6 C7 C8
23
Vector computers Programing vector computers
Assembly language programming Libraries Data-parallel languages Support for data-parallel operations as an inherent part of the langauge (intrinsic operators and functions) Fortran 90, High Performance Fortran Vectorizing compilers Extensive loop dependencies analysis
24
Vector computers Vector processing applications
Problems that can be efficiently formulated in terms of vectors Long- range weather forecasting Petroleum explorations Seismic data analysis Medical diagnosis Aerodynamics and space flight simulations Artificial intelligence and expert systems Mapping the human genome Image processing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.