Download presentation
Presentation is loading. Please wait.
1
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was also addressable as registers.
2
2 COMP 740: Computer Architecture and Implementation Montek Singh Thu, April 2, 2009 Topic: Multiprocessors I
3
3 3 Uniprocessor Performance (SPECint) VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 3X
4
4 4 Déjà vu all over again? “… today’s processors … are nearing an impasse as technologies approach the speed of light..” David Mitchell, The Transputer: The Time Is Now (1989) Transputer had bad timing (Uniprocessor performance ) Procrastination rewarded: 2X seq. perf. / 1.5 years “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs) Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/YearAMD/’05Intel/’06IBM/’04Sun/’05 Processors/chip2228 Threads/Processor1224 Threads/chip24432
5
5 5 Other Factors Multiprocessors Growth in data-intensive applications Data bases, file servers, … Data bases, file servers, … Growing interest in servers, server perf. Increasing desktop perf. less important Outside of graphics Outside of graphics Improved understanding in how to use multiprocessors effectively Especially server where significant natural TLP Especially server where significant natural TLP Advantage of leveraging design investment by replication Rather than unique design Rather than unique design
6
6 6 Flynn’s Taxonomy Flynn classified by data and control in 1966 SIMD Data Level Parallelism MIMD Thread Level Parallelism MIMD popular because Flexible: N pgms and 1 multithreaded pgm Flexible: N pgms and 1 multithreaded pgm Cost-effective: same MPU in desktop & MIMD Cost-effective: same MPU in desktop & MIMD Single Instruction Single Data (SISD) (Uniprocessor) Single Instruction Multiple Data SIMD (single PC: Vector, CM-2) Multiple Instruction Single Data (MISD) (????) Multiple Instruction Multiple Data MIMD (Clusters, SMP servers) M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
7
7 7 Back to Basics Parallel Architecture = Computer Architecture + Communication Architecture 2 classes of multiprocessors WRT memory: 1. Centralized Memory Multiprocessor < few dozen processor chips (and < 100 cores) in 2006 Small enough to share single, centralized memory 2. Physically Distributed-Memory multiprocessor Larger number chips and cores than 1. BW demands Memory distributed among processors
8
8 8 Centralized vs. Distributed Memory Centralized Memory Distributed Memory
9
9 9 Centralized Memory Multiprocessor Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors because single main memory has a symmetric relationship to all processors Large caches and single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch instead of bus, and many memory banks Can scale to a few dozen processors by using a switch instead of bus, and many memory banks Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases
10
10 Distributed Memory Multiprocessor Pros: Cost-effective way to scale memory bandwidth Cost-effective way to scale memory bandwidth If most accesses are to local memory Reduces latency of local memory accesses Reduces latency of local memory accesses Cons: Communicating data between processors more complex Communicating data between processors more complex Must change software to take advantage of increased memory BW Must change software to take advantage of increased memory BW
11
11 2 Models for Comm and Mem Arch 1. Communication occurs explicitly by passing messages among the processors: message- passing multiprocessors by passing messages among the processors: message- passing multiprocessors 2. Communication occurs implicitly through a shared address space (via loads and stores): shared memory multiprocessors through a shared address space (via loads and stores): shared memory multiprocessors Either: Either: UMA (Uniform Memory Access time) for shared address, centralized memory MP NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP Note: In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space
12
12 Challenges of Parallel Processing First challenge is Amdahl’s Law: what % of program inherently sequential? what % of program inherently sequential? Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? a. 10% b. 5% c. 1% d. <1%
13
13 Amdahl’s Law Answers
14
14 Challenges of Parallel Processing Second challenge: long latency to remote memory Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.) Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.) What is performance impact if 0.2% instructions involve remote access? What is performance impact if 0.2% instructions involve remote access? a. 1.5X b. 2.0X c. 2.5X
15
15 CPI Equation CPI = Base CPI + Remote request rate x Remote request cost CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3 No communication is 1.3/0.5 or 2.6 faster than 0.2% instructions involve local access
16
16 Challenges of Parallel Processing 1. Application parallelism – primarily via new algorithms that have better parallel performance 2. Long remote latency impact For example, reduce frequency of remote accesses either by For example, reduce frequency of remote accesses either by Caching shared data (HW) Restructuring the data layout to make more accesses local (SW) We’ll look at reducing latency via caches We’ll look at reducing latency via caches
17
17 T1 (“Niagara”) Target: Commercial server applications High thread level parallelism (TLP) High thread level parallelism (TLP) Large numbers of parallel client requests Low instruction level parallelism (ILP) Low instruction level parallelism (ILP) High cache miss rates Many unpredictable branches Frequent load-load dependencies Power, cooling, and space are major concerns for data centers Metric: Performance/Watt/Sq. Ft. Approach: Multicore, Fine-grain multithreading, Simple pipeline, Small L1 caches, Shared L2
18
18 T1 Architecture Also ships with 6 or 4 processors
19
19 T1 pipeline Single issue, in-order, 6-deep pipeline: F, S, D, E, M, W 3 clock delays for loads & branches. Shared units: L1, L2 L1, L2 TLB TLB
20
20 T1 Fine-Grained Multithreading Each core: supports four threads supports four threads has its own level one caches (16KB instr and 8 KB data) has its own level one caches (16KB instr and 8 KB data) Switches to a new thread on each clock cycle Switches to a new thread on each clock cycle Idle threads are bypassed in the scheduling –Waiting due to a pipeline delay or cache miss Processor is idle only when all 4 threads are idle or stalled Both loads and branches incur a 3 cycle delay that can only be hidden by other threads Both loads and branches incur a 3 cycle delay that can only be hidden by other threads A single set of floating point functional units is shared by all 8 cores floating point performance not focus for T1 floating point performance not focus for T1
21
21 Conclusion Parallelism challenges: % parallelizable, long latency to remote memory Centralized vs. distributed memory Small MP vs. lower latency, larger BW for Larger MP Small MP vs. lower latency, larger BW for Larger MP Message Passing vs. Shared Address Uniform access time vs. Non-uniform access time Uniform access time vs. Non-uniform access time Cache critical Next: Review of caching (App. C) Review of caching (App. C) Methods to ensure cache consistency in SMPs Methods to ensure cache consistency in SMPs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.