1 Lecture 25: Multi-core Processors Today’s topics:  Writing parallel programs  SMT  Multi-core examples Reminder:  Assignment 9 due Tuesday.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 6: Multicore Systems
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models.
1 Lecture 19: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
Lecture 18: Multiprocessors
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 18: Large Caches, Multiprocessors Today: NUCA caches, multiprocessors (Sections ) Reminder: assignment 5 due Thursday (don’t procrastinate!)
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 24: Multiprocessors Today’s topics:  Directory-based cache coherence protocol  Synchronization  Consistency  Writing parallel programs Reminder:
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
1 Lecture 27: Disks, Reliability, SSDs, Processors Topics: HDDs, SSDs, RAID, Intel and IBM case studies Final exam stats:  Highest 91, 18 scores of 82+
Chapter 17 Parallel Processing.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming models, cache coherence (snooping-based)
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
1 Lecture 18: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
1 Lecture 1: Introduction Course organization:  13 lectures on parallel architectures  ~5 lectures on cache coherence, consistency  ~3 lectures on TM.
1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)
Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
1 Lecture 2: Parallel Programs Topics: parallel applications, parallelization process, consistency models.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Pipelining and Parallelism Mark Staveley
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
1 Lecture 25: Multiprocessors Today’s topics:  Synchronization  Consistency  Shared memory vs message-passing  Simultaneous multi-threading (SMT)
1 Lecture: Memory Technology Innovations Topics: state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Anthony Rowe and Gregory Kesden 27 th (and last) Lecture, 28 April 2011 Multi-Core.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Multiprocessors Oracle SPARC M core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,
Lecture: SMT, Cache Hierarchies
Lecture 26: Multiprocessors
Lecture 26: Multiprocessors
Lecture 2: Parallel Programs
Lecture: SMT, Cache Hierarchies
Lecture: Coherence Protocols
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture 27: Pot-Pourri Today’s topics:
Lecture: Coherence Protocols
Lecture: SMT, Cache Hierarchies
Lecture 27: Multiprocessors
Lecture 17: Multi-threaded Applications
Lecture: SMT, Cache Hierarchies
Lecture 26: Multiprocessors
Lecture 3: Coherence Protocols
Lecture 19: Coherence Protocols
Lecture 18: Cache Coherence
Lecture 21: Synchronization & Consistency
Lecture 22: Multithreading
Presentation transcript:

1 Lecture 25: Multi-core Processors Today’s topics:  Writing parallel programs  SMT  Multi-core examples Reminder:  Assignment 9 due Tuesday

2 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model Communication is implicit and hardware handles protection Hardware-controlled caching Message-passing: No cache coherence  simpler hardware Explicit communication  easier for the programmer to restructure code Software-controlled caching Sender can initiate data transfer

3 Ocean Kernel Procedure Solve(A) begin diff = done = 0; while (!done) do diff = 0; for i  1 to n do for j  1 to n do temp = A[i,j]; A[i,j]  0.2 * (A[i,j] + neighbors); diff += abs(A[i,j] – temp); end for if (diff < TOL) then done = 1; end while end procedure.. Row 1 Row k Row 2k Row 3k …

4 Shared Address Space Model int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A  G_MALLOC(); initialize (A); CREATE (nprocs,Solve,A); WAIT_FOR_END (nprocs); end main procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i  mymin to mymax for j  1 to n do … endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; BARRIER (bar1, nprocs); endwhile

5 Message Passing Model main() read(n); read(nprocs); CREATE (nprocs-1, Solve); Solve(); WAIT_FOR_END (nprocs-1); procedure Solve() int i, j, pid, nn = n/nprocs, done=0; float temp, tempdiff, mydiff = 0; myA  malloc(…) initialize(myA); while (!done) do mydiff = 0; if (pid != 0) SEND(&myA[1,0], n, pid-1, ROW); if (pid != nprocs-1) SEND(&myA[nn,0], n, pid+1, ROW); if (pid != 0) RECEIVE(&myA[0,0], n, pid-1, ROW); if (pid != nprocs-1) RECEIVE(&myA[nn+1,0], n, pid+1, ROW); for i  1 to nn do for j  1 to n do … endfor if (pid != 0) SEND(mydiff, 1, 0, DIFF); RECEIVE(done, 1, 0, DONE); else for i  1 to nprocs-1 do RECEIVE(tempdiff, 1, *, DIFF); mydiff += tempdiff; endfor if (mydiff < TOL) done = 1; for i  1 to nprocs-1 do SEND(done, 1, I, DONE); endfor endif endwhile

6 Multithreading Within a Processor Until now, we have executed multiple threads of an application on different processors – can multiple threads execute concurrently on the same processor? Why is this desireable?  inexpensive – one CPU, no external interconnects  no remote or coherence misses (more capacity misses) Why does this make sense?  most processors can’t find enough work – peak IPC is 6, average IPC is 1.5!  threads can share resources  we can increase threads without a corresponding linear increase in area

7 How are Resources Shared? Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Cycles Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slot SuperscalarFine-Grained Multithreading Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Idle

8 Performance Implications of SMT Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4

9 Pentium4: Hyper-Threading Two threads – the Linux operating system operates as if it is executing on a two-processor system When there is only one available thread, it behaves like a regular single-threaded superscalar processor

10 Multi-Programmed Speedup

11 Why Multi-Cores? New constraints: power, temperature, complexity Because of the above, we can’t introduce complex techniques to improve single-thread performance Most of the low-hanging fruit for single-thread performance has been picked Hence, additional transistors have the biggest impact on throughput if they are used to execute multiple threads … this assumes that most users will run multi-threaded applications

12 Efficient Use of Transistors Transistors can be used for: Cache hierarchies Number of cores Multi-threading within a core (SMT)  Should we simplify cores so we have more available transistors? Core Cache bank

13 Design Space Exploration Bullet p – scalar pipelines t – threads s – superscalar pipelines From Davis et al., PACT 2005

14 Commercial servers require high thread-level throughput and suffer from cache misses Sun’s Niagara focuses on:  simple cores (low power, design complexity, can accommodate more cores)  fine-grain multi-threading (to tolerate long memory latencies) Case Study I: Sun’s Niagara

15 Niagara Overview

16 SPARC Pipe No branch predictor Low clock speed (1.2 GHz) One FP unit shared by all cores

17 Case Study II: Intel Core Architecture Single-thread execution is still considered important   out-of-order execution and speculation very much alive  initial processors will have few heavy-weight cores To reduce power consumption, the Core architecture (14 pipeline stages) is closer to the Pentium M (12 stages) than the P4 (30 stages) Many transistors invested in a large branch predictor to reduce wasted work (power) Similarly, SMT is also not guaranteed for all incarnations of the Core architecture (SMT makes a hotspot hotter)

18 Cache Organizations for Multi-cores L1 caches are always private to a core L2 caches can be private or shared – which is better? P4P3P2P1 L1 L2 P4P3P2P1 L1 L2

19 Cache Organizations for Multi-cores L1 caches are always private to a core L2 caches can be private or shared Advantages of a shared L2 cache:  efficient dynamic allocation of space to each core  data shared by multiple cores is not replicated  every block has a fixed “home” – hence, easy to find the latest copy Advantages of a private L2 cache:  quick access to private L2 – good for small working sets  private bus to private L2  less contention

20 Title Bullet