ECE/CS 552: Multithreading and Multicore © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 6: Multicore Systems
Structure of Computer Systems
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
ECE/CS 757: Advanced Computer Architecture II Instructor:Mikko H Lipasti Spring 2015 University of Wisconsin-Madison Lecture notes based on slides created.
CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Chapter 17 Parallel Processing.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)
Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Chapter 18 Multicore Computers
Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.
Executing Multiple Threads Prof. Mikko H. Lipasti University of Wisconsin-Madison.
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Multi-core architectures. Single-core computer Single-core CPU chip.
ECE/CS 552: Parallel Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison.
Multi-Core Architectures
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
ECE 757 Review: Parallel Processors © Prof. Mikko Lipasti Lecture notes based in part on slides created by John Shen, Mark Hill, David Wood, Guri Sohi,
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Niagara: a 32-Way Multithreaded SPARC Processor
© Krste Asanovic, 2015CS252, Spring 2015, Lecture 13 CS252 Graduate Computer Architecture Fall 2015 Lecture 13: Multithreading Krste Asanovic
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Executing Multiple Threads Prof. Mikko H. Lipasti University of Wisconsin-Madison.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
CSC 7080 Graduate Computer Architecture Lec 8 – Multiprocessors & Thread- Level Parallelism (3) – Sun T1 Dr. Khalaf Notes adapted from: David Patterson.
ECE/CS 757: Advanced Computer Architecture II Instructor:Mikko H Lipasti Spring 2009 University of Wisconsin-Madison Lecture notes based on slides created.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
COMP 740: Computer Architecture and Implementation
Simultaneous Multithreading
Lynn Choi School of Electrical Engineering
Computer Structure Multi-Threading
ECE/CS 552: Virtual Memory
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Executing Multiple Threads ECE/CS 752 Fall 2017
Hyperthreading Technology
Computer Architecture: Multithreading (I)
ECE 757 Review: Parallel Processors
Levels of Parallelism within a Single Processor
Hardware Multithreading
ECE/CS 757: Advanced Computer Architecture II
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Adaptive Single-Chip Multiprocessing
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
Hardware Multithreading
Presentation transcript:

ECE/CS 552: Multithreading and Multicore © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith

Overview Multithreading – Fine-grained – Coarse-grained or switch-on-event (SOE) – Simultaneous or SMT Multicore processors Niagara case study 2

Multithreading Basic idea: CPU resources are expensive and should not be left idle 1960’s: Virtual memory and multiprogramming – VM/MP invented to tolerate latency to secondary storage (disk/tape/etc.) – Processor:secondary storage cycle-time ratio: microseconds to tens of milliseconds (1:10000 or more) – OS context switch used to bring in other useful work while waiting for page fault or explicit read/write – Cost of context switch must be much less than I/O latency (easy) 1990’s: Memory wall and multithreading – Processor: non-cache storage cycle-time ratio: nanosecond to fractions of a microsecond (1:500 or worse) – H/W task switch used to bring in other useful work while waiting for cache miss – Cost of context switch must be much less than cache miss latency Very attractive for applications with abundant thread-level parallelism – Commercial multi-user workloads 3

Approaches to Multithreading Fine-grained multithreading – Switch contexts at fixed fine-grain interval (e.g. every cycle) – Need enough thread contexts to cover stalls – Example: Tera MTA, 128 contexts, no data caches Benefits: – Conceptually simple, high throughput, deterministic behavior Drawback: – Very poor single-thread performance 4

Approaches to Multithreading Coarse-grained multithreading – Switch contexts on long-latency events (e.g. cache misses) – Need a handful of contexts (2-4) for most benefit Example: IBM Northstar, 2 contexts Benefits: – Simple, improved throughput (~30%), low cost – Thread priorities mostly avoid single-thread slowdown Drawback: – Nondeterministic, conflicts in shared caches – Not suitable for out-of-order processors 5

Approaches to Multithreading Simultaneous multithreading – Multiple concurrent active threads (no notion of thread switching) – Need a handful of contexts for most benefit (2-8) Example: Intel P4, Core i7, IBM Power 5/6/7, Benefits: – Natural fit for OOO superscalar – Improved throughput – Low incremental cost Drawbacks: – Additional complexity over OOO superscalar – Cache conflicts 6

SMT Microarchitecture [Emer, PACT ’01] 7

Multithreading with Multicore Chip Multiprocessors (CMP) Share nothing in the core: – Implement multiple cores on die – Perhaps share L2, system interconnect (memory and I/O bus) Example: IBM Power4, 2 cores per die, shared L2 Benefits: – Simple replication – Packaging density – Low interprocessor latency – ~2x throughput Drawbacks: – L2 shared – conflicts – Mem bandwidth shared – could become bottleneck 8

Approaches to Multithreading Chip Multiprocessors (CMP) Ubiquitous ProcessorCores/ chip Multi- threaded? Resources shared IBM Power 42NoL2/L3, system interface IBM Power 52Yes (2T)Core, L2/L3, system interface Sun Ultrasparc2NoSystem interface Sun Niagara8Yes (4T)Everything Intel Pentium D2Yes (2T)Core, nothing else Intel Core i74, 6YesL3, DRAM, system interface AMD Opteron2, 4, 6, 12 NoL3, DRAM, system interface 9

Approaches to Multithreading Chip Multithreading (CMT) – Similar to CMP Share something in the core: – Expensive resource, e.g. floating-point unit (FPU) – Also share L2, system interconnect (memory and I/O bus) Example: – Sun Niagara, 8 cores, one FPU – AMD Bulldozer, FPU shared by two adjacent cores Benefit: amortize cost of expensive resource Drawbacks: – Shared resource may become bottleneck – Next generation Niagara does not share FPU 10

Multithreaded Processors Many approaches for executing multiple threads on a single die – Mix-and-match: IBM Power7 8-core CMP x 4-way SMT MT ApproachResources shared between threadsContext Switch Mechanism NoneEverythingExplicit operating system context switch Fine-grainedEverything but register file and control logic/stateSwitch every cycle Coarse-grainedEverything but I-fetch buffers, register file and con trol logic/state Switch on pipeline stall SMTEverything but instruction fetch buffers, return address stack, architected register file, control logic/state, reorder buffer, store queue, etc. All contexts concurrently active; no switching CMTVarious core components (e.g. FPU), secondary cache, system interconnect All contexts concurrently active; no switching CMPSecondary cache, system interconnectAll contexts concurrently active; no switching 11

IBM Power4: Example CMP 12

Niagara Case Study Targeted application: web servers – Memory intensive (many cache misses) – ILP limited by memory behavior – TLP: Lots of available threads (one per client) Design goal: maximize throughput (/watt) Results: – Pack many cores on die (8) – Keep cores simple to fit 8 on a die, share FPU – Use multithreading to cover pipeline stalls – Modest frequency target (1.2 GHz) 13

Niagara Block Diagram [Source: J. Laudon] 8 in-order cores, 4 threads each 4 L2 banks, 4 DDR2 memory controllers 14

Ultrasparc T1 Die Photo [Source: J. Laudon] 15

Niagara Pipeline [Source: J. Laudon] Shallow 6-stage pipeline Fine-grained multithreading 16

T2000 System Power 271W running SpecJBB2000 Processor is only 25% of total DRAM & I/O next, then conversion losses 17

Niagara Summary Example of application-specific system optimization – Exploit application behavior (e.g. TLP, cache misses, low ILP) – Build very efficient solution Downsides – Loss of general-purpose suitability – E.g. poorly suited for software development (parallel make, gcc) – Very poor FP performance (fixed in Niagara 2) 18

Summary Multithreading – Fine-grained, coarse-grained, simultaneous Multicore processors Niagara case study 19