Anshul Kumar, CSE IITD CS718 : Data Parallel Processors 27 th April, 2006
Anshul Kumar, CSE IITD Data Parallel Architectures SIMD Processors –Multiple processing elements driven by a single instruction stream Associative Processors –SIMD like processors with associative memory Vector Processors –Uni-processors with vector instructions Systolic Arrays –Application specific VLSI structures
Anshul Kumar, CSE IITD SIMDSIMD C P P M IS DS One of the earliest model of parallel computer
Anshul Kumar, CSE IITD ILLIAC IV SIMD Model P M P M P M P M Interconnection network PE1PE2PEn CU I/O bus Planned for 64 x 4 PEs, built only 64
Anshul Kumar, CSE IITD Burroughs Scientific Processor (BSP) Model P M P1P1 M1M1 P2P2 M2M2 PnPn MkMk Interconnection network CU I/O bus
Anshul Kumar, CSE IITD SIMD algorithms: sum of vector elements Si = ai + ai+1 i = 0,2,4,6 Si = Si + Si+2 i = 0,4 Si = Si + Si+4 i = 0 a0a1a2a3a4a5a6a7 a0+a1a2+a3a4+a5a6+a7 a0+a1+ a2+a3 a4+a5+ a6+a7 a0+a1+a2+a3+ a4+a5+a6+a7 step 1: step 2: step 3: Si = ai + ai+4 i = 0,1,2,3 Si = Si + Si+2 i = 0,1 Si = Si + Si+1 i = 0 OR
Anshul Kumar, CSE IITD No. of processors vs time Adding vector elements: –n processors – log n steps –n/log n processors – log n steps Matrix multiplication: –n processor – n 2 steps –n 2 processors – n steps –n 3 processors – log n steps –n 3 /log n processors – log n steps Important factors: data distribution, network
Anshul Kumar, CSE IITD Rise and fall of SIMDs Introduced in 60’s (e.g. Illiac, BSP) Problems: –not cost effective –serial fraction and Amdahl’s law –I/O bottle neck Overshadowed by Vector Processors Resurrected in 80’s (MPP from Goodyear, Connection machine from Thinking Machines Inc., MP-1 from MasPar) Did not survive because of high cost
Anshul Kumar, CSE IITD Related ideas Coarse grain SIMD with off the shelf processors (synchronized MIMD), e.g. CM5 of Thinking Machines This gave rise to SPMD (single program multiple data) MMX and SIMD instructions in Pentium
Anshul Kumar, CSE IITD Vector Processors I-cache D-cache Mem control I-unit and control V-regGPRs address unit VFU FU Buses Memory
Anshul Kumar, CSE IITD Four Generations of CRAY systems (vector processors) SystemCPUsClockFlops/WordsMflopsGates/ MHzclock/moved/chip CPUclk/CPU CRAY X-MP Y-MP C
Anshul Kumar, CSE IITD Cray History
Anshul Kumar, CSE IITD CRAY C90 8GB central memory shared by 16 CPUs 128 CPU - mem paths word = 64 bits + 16 ECC Dual vector pipes 128 element segments Memory 8 sections 8x8 sub sections 8x8x2 bank groups 8x8x2x8 banks
Anshul Kumar, CSE IITD Convex C4/XA system CPU: 7.5 ns clock, 1620 MFLOPs Mem: 32 MB x 32 banks, 64 bit word, 50ns access time 3 FP pipes, 2 results each Vector regs - FPU cross bar 1.1 GB/s per I/O port 5 x 5 crossbar CPUs memories I/Outilities
Anshul Kumar, CSE IITD Other examples NEC SX - X 4 CPUs 4 x 2 pipes each Fujitsu VP CPUs 2 LS pipes 3 Func pipes 2 mask pipes Fujitsu VP CPUs
Anshul Kumar, CSE IITD Systolic Arrays (H.T. Kung 1978) Simplicity, Regularity, Concurrency, Communication Example : Band matrix multiplication
B 11 B 12 B 21 B 31 A 11 A 12 A 21 A 22 A 31 A 23 T=0
B 11 B 12 B 21 B 31 B 22 A 11 A 12 A 21 A 22 A 31 A 23 A 32 T=1
A 11 A 12 A 21 A 22 A 31 A 23 A 32 A 33 B 11 B 12 B 21 B 31 B 22 B 32 T=2
A 21 A 22 A 31 A 23 A 32 A 33 A 34 B 12 B 31 B 22 B 32 B 42 A 11 B 11 A 42 B 23 A 12 B 21 T=3
A 22 A 31 A 23 A 32 A 33 A 34 B 31 B 22 B 32 B 42 A 11 B 11 A 12 B 21 A 42 B 23 A 11 B 12 A 21 B 11 B 33 A 43 T=4
A 23 A 32 A 33 A 34 B 31 B 32 B 42 A 42 B 23 B 33 A 43 A 11 B 12 A 12 B 22 A 21 B 12 A 21 B 11 A 22 B 21 C 11 A 31 B 11 T=5
A 33 A 34 B 32 B 42 A 42 B 33 A 43 A 21 B 12 A 22 B 22 A 21 B 11 A 22 B 21 A 23 B 31 C 11 A 31 B 12 A 31 B 11 A 32 B 21 C 12 A 12 B 23 A 53 A 44 B 43 T=6
Anshul Kumar, CSE IITD WARP: Programmable Systolic Processor [Kung, CMU 1987] Complete contrast to the original idea not application specific not a single VLSI complex cell (pipelined FP adder, mult, FIFOs, RAM, cross bar) linear asynchronous
Anshul Kumar, CSE IITD ReferencesReferences D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer Architectures : A Design Space Approach", Addison Wesley, K. Hwang, "Advanced Computer Architecture : Parallelism, Scalability, Programmability", McGraw Hill, 1993.