1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Advertisements

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
1 Lecture 25: Multi-core Processors Today’s topics:  Writing parallel programs  SMT  Multi-core examples Reminder:  Assignment 9 due Tuesday.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
1 Lecture 25: Multiprocessors Today’s topics:  Synchronization  Consistency  Shared memory vs message-passing  Simultaneous multi-threading (SMT)
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Prof. Onur Mutlu Carnegie Mellon University
Simultaneous Multithreading
Multi-core processors
Computer Structure Multi-Threading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Lecture 26: Multiprocessors
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Levels of Parallelism within a Single Processor
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture: SMT, Cache Hierarchies
Simultaneous Multithreading in Superscalar Processors
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture: SMT, Cache Hierarchies
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture 27: Multiprocessors
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 21: Synchronization & Consistency
Lecture 22: Multithreading
Presentation transcript:

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

2 Improving Performance Techniques to increase performance:  pipelining  improves clock speed  increases number of in-flight instructions  hazard/stall elimination  branch prediction  register renaming  efficient caching  out-of-order execution with large windows  memory disambiguation  bypassing  increased pipeline bandwidth

3 Deep Pipelining Increases the number of in-flight instructions Decreases the gap between successive independent instructions Increases the gap between dependent instructions Depending on the ILP in a program, there is an optimal pipeline depth Tough to pipeline some structures; increases the cost of bypassing

4 Increasing Width Difficult to find more than four independent instructions Difficult to fetch more than six instructions (else, must predict multiple branches) Increases the number of ports per structure

5 Reducing Stalls in Fetch Better branch prediction Trace cache

6 Reducing Stalls in Rename/Regfile Larger ROB/register file/issue queue Virtual physical registers Runahead Two-level register files

7 Stalls in Issue Queue Two-level issue queues Value prediction Memory dependence prediction Load hit prediction

8 Functional Units Pipelining Clustering

9 Thread-Level Parallelism Motivation:  a single thread leaves a processor under-utilized for most of the time  by doubling processor area, single thread performance barely improves Strategies for thread-level parallelism:  multiple threads share the same large processor  reduces under-utilization, efficient resource allocation Simultaneous Multi-Threading (SMT)  each thread executes on its own mini processor  simple design, low interference between threads Chip Multi-Processing (CMP)

10 Multithreading Within a Processor Until now, we have executed multiple threads of an application on different processors – can multiple threads execute concurrently on the same processor? Why is this desireable?  inexpensive – one CPU, no external interconnects  no remote or coherence misses (more capacity misses) Why does this make sense?  most processors can’t find enough work – peak IPC is 6, average IPC is 1.5!  threads can share resources  we can increase threads without a corresponding linear increase in area

11 How are Resources Shared? Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Cycles Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slot SuperscalarFine-Grained Multithreading Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Idle

12 What Resources are Shared? Multiple threads are simultaneously active (in other words, a new thread can start without a context switch) For correctness, each thread needs its own PC, its own logical regs (and its own mapping from logical to phys regs) For performance, each thread could have its own ROB (so that a stall in one thread does not stall commit in other threads), I-cache, branch predictor, D-cache, etc. (for low interference), although note that more sharing  better utilization of resources Each additional thread costs a PC, rename table, and ROB – cheap!

13 Resource Sharing R1  R1 + R2 R3  R1 + R4 R5  R1 + R3 R2  R1 + R2 R5  R1 + R2 R3  R5 + R3 P73  P1 + P2 P74  P73 + P4 P75  P73 + P74 P76  P33 + P34 P77  P33 + P76 P78  P77 + P35 P73  P1 + P2 P74  P73 + P4 P75  P73 + P74 P76  P33 + P34 P77  P33 + P76 P78  P77 + P35 FU Instr Fetch Instr Rename Issue Queue Register File Thread-1 Thread-2

14 Performance Implications of SMT Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread While fetching instructions, thread priority can dramatically influence total throughput – a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4 Alpha and Intel Pentium 4 are examples of SMT

15 Title Bullet