Download presentation
Presentation is loading. Please wait.
Published byJocelin Nichols Modified over 9 years ago
1
Anshul Kumar, CSE IITD CS718 : Data Parallel Processors 27 th April, 2006
2
Anshul Kumar, CSE IITD Data Parallel Architectures SIMD Processors –Multiple processing elements driven by a single instruction stream Associative Processors –SIMD like processors with associative memory Vector Processors –Uni-processors with vector instructions Systolic Arrays –Application specific VLSI structures
3
Anshul Kumar, CSE IITD SIMDSIMD C P P M IS DS One of the earliest model of parallel computer
4
Anshul Kumar, CSE IITD ILLIAC IV SIMD Model P M P M P M P M Interconnection network PE1PE2PEn CU I/O bus Planned for 64 x 4 PEs, built only 64
5
Anshul Kumar, CSE IITD Burroughs Scientific Processor (BSP) Model P M P1P1 M1M1 P2P2 M2M2 PnPn MkMk Interconnection network CU I/O bus
6
Anshul Kumar, CSE IITD SIMD algorithms: sum of vector elements Si = ai + ai+1 i = 0,2,4,6 Si = Si + Si+2 i = 0,4 Si = Si + Si+4 i = 0 a0a1a2a3a4a5a6a7 a0+a1a2+a3a4+a5a6+a7 a0+a1+ a2+a3 a4+a5+ a6+a7 a0+a1+a2+a3+ a4+a5+a6+a7 step 1: step 2: step 3: Si = ai + ai+4 i = 0,1,2,3 Si = Si + Si+2 i = 0,1 Si = Si + Si+1 i = 0 OR
7
Anshul Kumar, CSE IITD No. of processors vs time Adding vector elements: –n processors – log n steps –n/log n processors – log n steps Matrix multiplication: –n processor – n 2 steps –n 2 processors – n steps –n 3 processors – log n steps –n 3 /log n processors – log n steps Important factors: data distribution, network
8
Anshul Kumar, CSE IITD Rise and fall of SIMDs Introduced in 60’s (e.g. Illiac, BSP) Problems: –not cost effective –serial fraction and Amdahl’s law –I/O bottle neck Overshadowed by Vector Processors Resurrected in 80’s (MPP from Goodyear, Connection machine from Thinking Machines Inc., MP-1 from MasPar) Did not survive because of high cost
9
Anshul Kumar, CSE IITD Related ideas Coarse grain SIMD with off the shelf processors (synchronized MIMD), e.g. CM5 of Thinking Machines This gave rise to SPMD (single program multiple data) MMX and SIMD instructions in Pentium
10
Anshul Kumar, CSE IITD Vector Processors I-cache D-cache Mem control I-unit and control V-regGPRs address unit VFU FU Buses Memory
11
Anshul Kumar, CSE IITD Four Generations of CRAY systems (vector processors) SystemCPUsClockFlops/WordsMflopsGates/ MHzclock/moved/chip CPUclk/CPU CRAY-1 1 80 21 80 2 X-MP 410523 840 16 Y-MP 8166 23 2667 2500 C90 16240 4615360 10000
12
Anshul Kumar, CSE IITD Cray History http://www.cray.com/company/history.html
13
Anshul Kumar, CSE IITD CRAY C90 8GB central memory shared by 16 CPUs 128 CPU - mem paths word = 64 bits + 16 ECC Dual vector pipes 128 element segments Memory 8 sections 8x8 sub sections 8x8x2 bank groups 8x8x2x8 banks
14
Anshul Kumar, CSE IITD Convex C4/XA system CPU: 7.5 ns clock, 1620 MFLOPs Mem: 32 MB x 32 banks, 64 bit word, 50ns access time 3 FP pipes, 2 results each Vector regs - FPU cross bar 1.1 GB/s per I/O port 5 x 5 crossbar CPUs memories I/Outilities
15
Anshul Kumar, CSE IITD Other examples NEC SX - X 4 CPUs 4 x 2 pipes each Fujitsu VP5000 7 - 222 CPUs 2 LS pipes 3 Func pipes 2 mask pipes Fujitsu VP2000 1 - 2 CPUs
16
Anshul Kumar, CSE IITD Systolic Arrays (H.T. Kung 1978) Simplicity, Regularity, Concurrency, Communication Example : Band matrix multiplication
17
B 11 B 12 B 21 B 31 A 11 A 12 A 21 A 22 A 31 A 23 T=0
18
B 11 B 12 B 21 B 31 B 22 A 11 A 12 A 21 A 22 A 31 A 23 A 32 T=1
19
A 11 A 12 A 21 A 22 A 31 A 23 A 32 A 33 B 11 B 12 B 21 B 31 B 22 B 32 T=2
20
A 21 A 22 A 31 A 23 A 32 A 33 A 34 B 12 B 31 B 22 B 32 B 42 A 11 B 11 A 42 B 23 A 12 B 21 T=3
21
A 22 A 31 A 23 A 32 A 33 A 34 B 31 B 22 B 32 B 42 A 11 B 11 A 12 B 21 A 42 B 23 A 11 B 12 A 21 B 11 B 33 A 43 T=4
22
A 23 A 32 A 33 A 34 B 31 B 32 B 42 A 42 B 23 B 33 A 43 A 11 B 12 A 12 B 22 A 21 B 12 A 21 B 11 A 22 B 21 C 11 A 31 B 11 T=5
23
A 33 A 34 B 32 B 42 A 42 B 33 A 43 A 21 B 12 A 22 B 22 A 21 B 11 A 22 B 21 A 23 B 31 C 11 A 31 B 12 A 31 B 11 A 32 B 21 C 12 A 12 B 23 A 53 A 44 B 43 T=6
24
Anshul Kumar, CSE IITD WARP: Programmable Systolic Processor [Kung, CMU 1987] Complete contrast to the original idea not application specific not a single VLSI complex cell (pipelined FP adder, mult, FIFOs, RAM, cross bar) linear asynchronous
25
Anshul Kumar, CSE IITD ReferencesReferences D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer Architectures : A Design Space Approach", Addison Wesley, 1997. K. Hwang, "Advanced Computer Architecture : Parallelism, Scalability, Programmability", McGraw Hill, 1993.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.