Laxmi Narayan Bhuyan http://www.cs.ucr.edu/~bhuyan SIMD Architectures Laxmi Narayan Bhuyan http://www.cs.ucr.edu/~bhuyan
Data Parallel Model Operations can be performed in parallel on each element of a large regular data structure, such as an array 1 Control Processsor broadcast to many PEs (see Ch. 1, Fig. 1-25, page 45 of [CSG99]) When computers were large, could amortize the control portion of many replicated PEs Condition flag per PE so that can skip Data distributed in each memory Early 1980s VLSI => SIMD rebirth: 32 1-bit PEs + memory on a chip was the PE Data parallel programming languages lay out data to processor
Data Parallel Model Vector processors have similar ISAs, but no data placement restriction SIMD led to Data Parallel Programming languages Advancing VLSI led to single chip FPUs and whole fast µProcs (SIMD less attractive) SIMD programming model led to Single Program Multiple Data (SPMD) model All processors execute identical program Data parallel programming languages still useful, do communication all at once: “Bulk Synchronous” phases in which all communicate after a global barrier
SIMD Programming – High-Performance Fortran (HPF) Single Program Multiple Data (SPMD) FORALL Construct similar to Fork: FORALL (I=1:N), A(I) = B(I) + C(I), END FORALL Data Mapping in HPF 1. To reduce interprocessor communication 2. Load balancing among processors http://www.npac.syr.edu/hpfa/ http://www.crpc.rice.edu/HPFF/
How does an SIMD computer work? A Host computer is necessary to do the I/O operations The user program is loaded into the control memory The data is distributed to all the memory modules The control unit decodes the instn and executes it if it is a scalar instn. If it is a vector instn, it broadcasts the control signals to the PEs to do the executions Before broadcasting the control signals, the CU broadcasts an enable vector which will enable the PEs
Masking and Data Routing Mechanisms A,B,C – working registers Si = status (1 active, 0 inactive) Ri – Data routing register Di – holds address Ii – Index register
Example
Matrix Multiplication
N * N Mesh
The Illiac IV Architecture Distributed memory architecture 64 PEs connected as an 8X8 2-D mesh with end around connection LDB: Local Data Buffer 64, 64-bit each PEM: 2K X 64 bits memory
The Illiac IV Network
Maspar MP-1 Architecture Configuration with 1K-16K PEs are available Each PE has a 4-bit ALU, 1-bit logic unit, a 64-bit mantissa unit, a 16-bit exponent unit, communication input and output ports Each PE has 40 32-bit registers available to the programmer Each processor board has 1024 PEs arranges as 64 PE clusters (PECs) with 16 PEs per cluster Each PEC is a chip connected to 8 neighbors via an octagonal mesh Another network, called Multistage Crossbar Network, with three router stages gives a function of 1024X1024 crossbar for routing from any PEC to another PEC