Array computers
Single Instruction Stream Multiple Data Streams computer There two types of general structures of array processors SIMD Distributerd Memory SIMD Shared memory
Array computers Shared Memory SIMD CP – Control Processor CP MM – Control Processor Memory Module PE – Processing Element MM – Memory Module, IS – Instruction Stream DS – Data Stream
Array computers Distributed Memory SIMD CP – Control Processor CP MM – Control Processor Memory Module PE – Processing Element MM – Memory Module, IS – Instruction Stream DS – Data Stream
Array computers Interconnection networks Static interconnection topologies: 2-D mesh Star Tree Ring 2-D torus Illiac mesh (in a minute...)
Array computers Interconnection networks Static interconnection topologies (continued): Hypercube Network size - 16 nodes Node degree – 4 links Network diameter – max 4 links need to be traversed between any 2 nodes
Array computers Interconnection networks Dynamic interconnection topologies (various solutions), three main categories Buses Crossbar Multistage switches Example: 8x8 Omega network with 2x2 switches possible switch connections
Array computers Illiac IV In 1966 the Department of Defense Advanced Research Projects contracted the University of Illinois to build large parallel processing computer, the Illiac IV It did not become operational until 1976 at NASA’s Ames research Center
Array computers Illiac IV One of the most infamous supercomputers ever It used early ideas on SIMD The project started in 1965, it used 64 processors and a 13MHz clock. In 1976 it ran its first sucessfull application. It had 1MB memory (64x16KB) Its actual performance was 15 MFLOPS, it was estimated in initial predictions to be 1000 MFLOPS Only a quarter of the fully planned machine was ever built, costs escalated from the $8 million estimated in 1966 to $31 million by 1972, and the computer took three more years of enginering before it was operational
Array computers Illiac IV – functional structure Illiac IV system Illiac IV arrayIlliac IV I/O Array processor Control unit 64 PE64 PEM General purpose computer B-6500 Peripherals Disk file system I/O subsystem Control descriptor Buffer In/Out switch Functional structure of the Illiac IV system
Array computers Illiac IV – general structure In the original design four arrays of processing elements were planned Only one of the four arrays was actually built
Illiac IV – control unit Instruction buffer (128) Local data buffer (64 64-bit words) 4 64-bit accumulator registers 64-bit arithmetic logic unit 24-bit address adder Queue of addresses and data sent to processing elements
Array computers Illiac IV – routing network
Illiac IV – processing element Accumulator (RGA 64-bit) B register (RGB 64-bit) Holds the second operand in a binary operation (such as ADD, SUBTRACT, MULTIPLY, or DIVIDE) Routing register (RGR 64- bit) Used to transmit information from one PE to another S register (RGS 64-bit) Temporary storage register Index register (RGX 16-bit) Index register to modify the address field of an instruction Mode register (RGD 8-bit) Controls the active or nonactive status of each PE independently
Array computers Illiac IV – processing element memory Each PE has its own 2048-word 64-bits per word random-access memory Each memory is called a PEM, and they are numbered 0 through 63 also PE and PEM taken together are called a processing unit or PU. PE i may only access PEM i so that one PU cannot modify the memory of another PU Information can, however, be passed from one PU to another via the routing network, which is one of the 4 paths by which data flow through the Illiac IV array
Array computers Illiac IV – data paths Data paths in the array of Illiac IV
Array computers Illiac IV – data paths Control Unit Bus (CUB) Instructions or data from the PEMs in blocks of eight words can be sent to the CU via the CU bus Operating system takes care of fetching and executing instructions, data can also be fetched in blocks of eight words under program control using the CU bus Common Data Bus (CDB) Information stored in the CU can be "broadcast" to the entire 64 PE array simultaneously via the CDB A value such as a constant to be used as a multiplier need not be stored 64 times in each PEM; instead this value can be stored within a CU register and then broadcast to each enabled PE in the array
Array computers Illiac IV – data paths Routing network Information in one PE register can be sent to another PE register by special routing instructions (information can be transferred from PE register to PEM by standard LOAD or STORE instructions) High-speed routing lines run between every RGR of every PE and its nearest left and right neighbor (distances of -1 and + 1, respectively) and its neighbor 8 positions to the left and 8 positions to the right (-8 and +8, respectively)
Array computers Illiac IV – data paths Mode-bit line Mode-bit line consists of one line coming from the RGD of each PE in the array Mode-bit line can transmit one of the eight mode bits of each RGD in the array up to an ACAR in the CU If this bit is the bit which indicates whether or not a PE is on or off, we can transmit a "mode pattern" to an ACAR. This mode pattern reflects the status or on-offness of each PE in the array There are instructions which are executed completely within the CU that can test this mode pattern and branch on a zero or nonzero condition In this way branching in the instruction stream can occur based on the mode pattern of the entire 64-PE array.
Array computers Programming Illiac IV Hardware-first design Relatively poor software support Rudimentary operating system (for example, no support for disk management) GLYPNIR (ALGOL derivative) Provided statements which identified vector arithmetic suitable for parallel implementation CFD (Computational Fluid Dynamics) Fortran based language written for work at NASA None of these languages hid the architecture of the machine very well, but they allowed the development of machine specific programs in a High Level Language.
Examples of algorithms executed on array computers '*' in the array subscripts identifies that all the elements of a vector are to be operated upon simultaneously Programmer needs to be aware that only 64 processors are available, and so needs to structure the computation to make use of this Compiler will split a computation up into a number of 64 element iterations of the instruction where required
Examples of algorithms executed on array computers J=1 DO 1 I=1,3 A(*) = A(*) + A(* + J) J = J * 2 1 CONTINUE Time 0 Time 1 Time 2 Time 3
Decoupling sequential code DO 10 I = 2, A(I) = B(I) + A(I-1) 63 statements: A(2) = B(2) + A(1) A(3) = B(3) + A(2)... A 2 = B 2 + A 1 A 3 = B 3 + A 2 = B 3 + B 2 + A 1 A 4 = B 4 + A 3 = B 4 + B 3 + B 2 + A 1... A N = B N + B N-1...B 2 + A 1
Decoupling sequential code S = A(1) DO 10 N = 2,8 S = S + B(N) 10 A(N) = S A(1) B(2) B(3) B(4) B(5) B(6) B(7) B(8) EP 0 EP 1 EP 2 EP 3 EP 4 EP 5 EP 6 EP Step 1Step 2Step 3
Decoupling sequential code 1.Enable all PEs (turn ON all PEs) 2.All PEs LOAD RGA from location 3.i 0 4.All PEs LOAD RGR from their RGA (this instruction is performed by all PEs, whether they are ON (enabled) or OFF (disabled)) 5.All PEs ROUTE their RGR contents a distance of 2 i to the right (this instruction is also performed by all PEs, regardless of whether they are ON or OFF) 6.j 2 i 1 7.Disable PEs number 0 through j (turn them OFF) 8.All enabled PEs ADD to RGA, the contents of RGR 9.i i+1 10.If i<3 go back to step 4, otherwise to step Enable all PEs 12.All PEs STORE the contents of RGA to location + 1
Array computers Subsequent designs BSP – Burroughs Scientific Processor (1977) MPP – Massively Parallel Processor (1983) CM-1, CM-2 Thinking Machines