STARAN Parallel processor system hardware By KENNETH E. BATCHER Presented by Manoj k. Yarlagadda Manoj k. Yarlagadda
Presentation Topics Parallel Processors Why Parallelism? Why Parallelism Now? EVOLUTION OF STARAN! STARAN Configuration Diagram Multi-Dimensional Access (MDA) STARAN BLOCK DIAGRAM
Parallel Processors Interconnection Networks SIMD Computers MIMD Computers Other Architectures –Dataflow and Neural Network
SIMD MIMD There are N data streams, one per processor so different data can be used in each processor. Each processor operates under the control of an instruction stream issued by its own control unit
Why Parallelism? Even though the CPU-memory connection is a bottleneck, we are still greatly interested in processor speed up. Parallelism can be used in the following are: –Simulations of complex physical systems (e.g., weather forecasting, molecular modeling) –Image processing –Massive data processing (e.g., seismic data) –Large databases
Why Parallelism Now? Parallel Processors have been available for decades, but only due to recent technological changes have they become feasible: –Evolution of ICs to current VLSI (or VVLSI) –Dramatic reduction in power requirements –Decreased cost of production –Increased speed of processors –Increased reliability of processors Current SIMD machines have up to 65,336 PEs!
EVOLUTION OF STARAN High cost of semiconductor memory and logic elements. The Versions of Associative processor (AP): 1)Built for USAF by Goodyear Aerospace Corporation June 1969 at Akron, Ohio. 2)The same machine updated including large Instruction memory, was loaned by USAF in )The lessons learned in programming and testing the USAF AP model resulted in a new design called STARAN S which was commited to production in 1971.
…Contd 4) Demonstrations in May 1972 at TRANSPO exhibit in Washington D.C. and June, 1972 at Boston. The initial uses of AP’s would be weighted toward real- time applications involving interface with a wide variety of sensors, Conventional computers, signal processors, interactive displays and mass storage devises. To accommodate all such interfaces the STARAN was divided into
STARAN Configuration Dig Standardized main frame unit Custom interface unit: a) A variety of I/O operation includes Direct memory access (DMA) Buffered I/O channels External function channels Unique interface called Parallel I/O
MDA MEMORIES The Memory for such an associative processor could be a simple random-access memory with data rotated 90-deg, so that it is accessed by bit- slices instead of by words. The MDA memory is treated as a square array of bits, 256 words with 256 bits in each word. To Accommodate both bit-slice accesses for associative processing and word-slice accesses for STARAN input/output the Data are stored in MDA (Multi dimensional access memory)
..Contd It has Read/Write busses for parallel access to a large number of (256) of memory bits. Write mask bus for selective writing of bits. Memory accesses (Read & Write) are controlled by address & access mode controlled I/P’s
Bit-Slice & Word access modes Bit-slice used to access one bit of all words in parallel. Word-slice: used for I/O operations a) all bits of one word in parallel.
…Contd The MDA memory structure is not limited to a square array of 256 by 256. One Can access 32 Consecutive bytes of a record in parallel. One can access the corresponding bytes of all records. One can access the a bit from each byte in parallel.
STARAN ARRAY MODULES
…Contd 1)Array module components communicate through a network called flip network. 2)Selector Chooses a 256-bit source item from MDA read bus. 3)Flip network Which may shift & permute the bits in various ways. a) It allows the inter-PE communication. A PE can read the data from another PE directly or indirectly MDA or from registers. b) It can permute the 256-bit data item as whole or divide it into groups like 2, 4, 8, 16, 32, 64 or 128 bits. 4) Mirroring Reduce the number of passes.
…Contd 5) Three 256-bit Registers (M,X, and Y) through a flip network. Note: X & Y-> logic registers 6) The general logic associated with the X-register can perform any 16 Boolean functions of two variables If x i is the state of the i th X-Register bit, and f I is the state of the i th flip network output Then, x i <- Ø (x i, f i ) (i = 0, 1,..., 255) Ø Boolean function Y-Register: y i <- Ø( y i, f i ) ( i = 0,1,..., 255)
4) If X & Y are operated together, the same Boolean function, F is applied to both registers. x i <- Ø (xi, fi) y i <- Ø(yi, fi) 5) The programmer also can choose to operate on X selectively, using Y as a mask: x i <- Ø(x i, f i ) (where yi = 1) x i <- x i (where y i = 0) 6) Another choice is to operate on X selectively while operating on Y: x i <- Ø (x i, f i ) (where y i = 1) x i <- x i (where y i = 0) y i <- Ø (y i, f i ) In this case, the old state of Y (before modification by f ) is used as the mask for the X operation.
Programming example This operation adds the contents of a Field A of all memory words to the contents of a Field B of the words and stores the sum in a Field S of the words. At the beginning of each loop execution, the carry (c) from the previous bits is stored in Y, and X contains zeroes: x i = 0 y i = c i Note: Start with LSB to MSB
Four steps : Step 1: Read Bit-slice a and exclusive-or ( ) it to X selectively and also to Y: x i <- x i y i.a i y i <- y i a i The states of X and Y are now: x i = a i.c i y i = a i c i Step 2: Read Bit-slice b and exclusive-or it to X selectively and also to Y: x i <- x i y i.b i Y i <- y i b i Registers X and Y now contain the carry and sum bits: x i =a i c i a i.b i b i.c i = c' i y i = a i b i c i = s i
…Contd Step 3: Write the sum bit from Y into Bit-slice s and also complement X selectively: si <- yi xi <- xi yi The states of X and Y are now: xi= c‘i si yi = si Step 4: Read the X-register and exclusive-or it into both X and Y: xi <- xi xi yi <- yi xi clear X and store the carry bit into Y for next execution of the loop: xi = 0 yi= c‘i
STARAN BLOCK DIAGRAM Assignment switch: Connects it’s control I/P & Data I/P and outputs to AP. AP( Associative processor) : Contains Reg & logic. It receives instructions from the Control memory & transfer the data to and from Control memory.
Registers in the AP: 1) Instruction Register: To hold the 32-bit instruction being executed. 2) Program status word: To hold the CM address of the next instruction to be executed and the program priority level. 3) Common register: to hold a 32-bit search command 4) Array select Reg: to Select a subset of assigned register 5) Four field pointers: To hold MDA addresses 6) Three Counters: To keep track of number of executions of loops. 7) Data pointer : To allow stepping through a set of operands in CM. 8) Two access Mode Reg: To hold the MDA access modes
Parallel input/output module (PIO): 1) PIO flip network a) Port 0 to 3 connects to 4 Array modules b) Port 7 connects to the 32 bit data bus in PIO control through a fan-in & fan-out switch c) Port 6,5,4 are Spare (High bandwidth peripherals, Radar) 2) PIO Control unit ( Controls the array modules, FLIP) 3) Control memory ( It has 5 Banks of bipolar memory) 4) DEC/PDP-11 ( To handle the peripherals, control the system from console commands. 5) External function ( It controls AP & Sequential & PIO )
STARAN Applications Fast Fourier Transform (used in Real-time processing of radar and sonar signals) Sonar post- processing ( Signal processing & Post processing) String search (Searching a string is 100 times faster than conventional computer search.) File processing Air traffic control
Architectures for Applications Fast Fourier Transform : Speed increases over sequential computers STARAN leads itself to efficient manipulation of data in the FFT. Ex: Air Force supplied radar data to GAC By using 512-point 16-bit FFT 2.7 milli-sec( 2 MDA) 1024-point transform 3.0 milli-sec( 4 MDA) Sonar post-processing: Sorting and Editing of the signal processor output