Korey Sewell, Trevor Mudge, Steven K. Reinhardt* † *Advanced Computer Architecture Labaratory (ACAL) University of Michigan, Ann Arbor † Advanced Micro.

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

SoC CAD 1 Simultaneous Continual Flow Pipeline Architecture 徐子傑 Hsu,Zi Jei Department of Electrical Engineering National Cheng Kung University Tainan,

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Part 5 – Superscalar & Dynamic Pipelining - An Extra Kicker! 5/5/04+ Three major directions that simple pipelines of chapter 6 have been extended If you.

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

Architecture Basics ECE 454 Computer Systems Programming

SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)

University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Hyper-Threading Technology Architecture and Micro-Architecture.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Computer Architecture Lecture 27 Fasih ur Rehman.

Redundant Multithreading Techniques for Transient Fault Detection Shubu Mukherjee Michael Kontz Steve Reinhardt Intel HP (current) Intel Consultant, U.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

My Coordinates Office EM G.27 contact time:

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1 Lecture 5a: CPU architecture 101 boris.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

COMP 740: Computer Architecture and Implementation

Data Prefetching Smruti R. Sarangi.

Adaptive Cache Partitioning on a Composite Core

Prof. Onur Mutlu Carnegie Mellon University

Computer Structure Multi-Threading

Lecture: Out-of-order Processors

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Hyperthreading Technology

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Hardware Multithreading

Lecture 8: Dynamic ILP Topics: out-of-order processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Data Prefetching Smruti R. Sarangi.

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Overview Prof. Eric Rotenberg

CS 286 Computer Architecture & Organization

Levels of Parallelism within a Single Processor

Hardware Multithreading

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Korey Sewell*, Trevor Mudge*, Steven K. Reinhardt* † *Advanced Computer Architecture Labaratory (ACAL) University of Michigan, Ann Arbor † Advanced Micro Devices (AMD) XVP e X treme V irtual P ipelining (XVP): Moving Towards Scalable Multithreaded Processors ASPLOS – WACI ‘09

2 Many-Core Mansion (~32-64P, ~2-4T) Multicore-Estates (2-4P, ~2-4T) Multithreading-Ville (1P, ~2-4T) The Comp. Arch. Research Train Did we miss a stop on the way???? What about “Many”-Threading?!!! Uniprocessor-Place (1P, 1T) P = Processor(s) T =Thread(s)

ASPLOS - WACI ’09 3 CHANGES the way we think about architecture… even –Moving from 2-4 threads per core to 16, 32 or even 64 threads per core –Threads aren’t just Parallel…They’re Adjacent! What would you create if you had “threads to throw away”? –Hmmmmmmm….. Why “Many-Threading”?

ASPLOS - WACI ’09 4 (1)“Coherence-Free” Synchronization & Communication  Why Suffer from Non-Deterministic Memory Latency when so many threads are adjacent (on same core)? Memory System CPU … T0TNT2T1 CPU … T0 TNT2T1 WACI,“Many”-Threading Possibilities

ASPLOS - WACI ’09 5 Branch Misprediction T … (2)Extremely Speculative Multithreading  Use extra threads during speculative events (e.g. branch misprediction, cache miss)  Fast forward execution by traversing speculation tree and then switching threads. WACI,“Many”-Threading Possibilities F F F T T

ASPLOS - WACI ’09 6 (3)Super Virtual Machines  Security: Every application given it’s own VM? (4)Many-Many Systems!  Many Threads, Many Cores  1000 thread system = 64 cores, 16 threads per core (5)Redundant Multithreading (6)This list keeps going….and going…and going!!! WACI,“Many”-Threading Possibilities

ASPLOS - WACI ’09 7 A design that avoids non-scalable, conventional multithreading pitfalls such as… –Replication of per-thread resources –Extensive size increases of shared resources –Complex resource distribution methods amongst threads How do we get to Many- Threading?

ASPLOS - WACI ’09 8 Provide each thread the illusion that it has all the processor resources to itself Traditionally, simultaneous executing threads have a shared pipeline view XVP XVP WACI Solution: e X treme V irtual P ipelining (XVP) DRF LSQ IQ ROB EXE RF = T 0 - T N DRF LSQ IQ ROB EXE RF = T 1 DRF LSQ IQ ROB EXE RF = T 0 DRF LSQ IQ ROB EXE RF = T N …

ASPLOS - WACI ’09 9 Pipeline Virtualization: Resource entries are mapped into each thread’s address space XVP XVP WACI Solution: e X treme V irtual P ipelining (XVP): CPU MEMORY 0 7 T0 Base T0 +0 … 7 Resource “X” T1 Base T1 +0 … 7 TN Base TN +0 … 7

ASPLOS - WACI ’09 10 Virtualize all stalling processor resources to memory –Fetch Buffer, Instruction Queue, Load/Store Queue, Register File, Reorder Buffer XVPXVP extends the notion of a hardware context to include pipeline resources –Add a C-Cache (Context) to avoid D-Cache thrashing and potentially reduce memory footprint in workloads Each stallable resource matched with it’s own “on-demand” Fill-Spill-Unit (FSU) –Ex:Spill IQ on dep. load miss / Fill when miss resolves –FSU allows resources to dynamically partition themselves for arbitrary workloads XVP XVP WACI Solution: e X treme V irtual P ipelining (XVP): DRF LSQ IQ ROB EXE RF FSU C-Cache

ASPLOS - WACI ’09 11 XVP XVP WACI Conclusion: e X treme V irtual P ipelining (XVP) A high # of threads per core opens up interesting multithreading research angles XVP’sXVP’s pipeline virtualization moves toward scalable many-threads per core –Each thread has illusion that it has it’s own pipeline XVPXVP can also benefit single-thread processors… –Because XVP’s virtualization provides more resources than traditionally available.

ASPLOS - WACI ’09 12 Thanks for Listening!