Mechanisms: Run-time and Compile-time. Outline Agents of Evolution Run-time Branch prediction Trace Cache SMT, SSMT The memory problem (L2 misses) Compile-time.

Slides:



Advertisements
Similar presentations
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Advertisements

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Yale Patt The University of Texas at Austin Universidade de Brasilia Brasilia, DF -- Brazil August 12, 2009 Future Microprocessors: Multi-core, Multi-nonsense,
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
NYU DARPA DIS kick-off September 24, Comparing IA-64 and HPL-PD NYU.
Branch Prediction Dimitris Karteris Rafael Pasvantidιs.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Analysis of Branch Predictors
My major mantra: We Must Break the Layers. Algorithm Program ISA (Instruction Set Arch)‏ Microarchitecture Circuits Problem Electrons.
Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
"DDMT", Amir Roth, HPCA-71 Speculative Data-Driven Multithreading (an implementation of pre-execution) Amir Roth and Gurindar S. Sohi HPCA-7 Jan. 22, 2001.
VLIW Digital Signal Processor Michael Chang. Alison Chen. Candace Hobson. Bill Hodges.
Lecture 2: Computer Architecture: A Science ofTradeoffs.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Processor Level Parallelism 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
ECE Dept., Univ. Maryland, College Park
Computer Architecture Principles Dr. Mike Frank
15-740/ Computer Architecture Lecture 21: Superscalar Processing
Simultaneous Multithreading
Prof. Onur Mutlu Carnegie Mellon University
5.2 Eleven Advanced Optimizations of Cache Performance
Samira Khan University of Virginia Nov 13, 2017
Improving Program Efficiency by Packing Instructions Into Registers
Carnegie Mellon University
Super Quick Architecture Review
The University of Texas at Austin
Introduction, Focus, Overview
Levels of Parallelism within a Single Processor
Comparison of Two Processors
15-740/ Computer Architecture Lecture 24: Control Flow
Yingmin Li Ting Yan Qi Zhao
Computer Architecture: Multithreading (IV)
15-740/ Computer Architecture Lecture 5: Precise Exceptions
The University of Texas at Austin
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Henk Corporaal TUEindhoven 2011
15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up
Sampoorani, Sivakumar and Joshua
Computer Architecture: A Science of Tradeoffs
Levels of Parallelism within a Single Processor
How to improve (decrease) CPI
Introduction, Focus, Overview
Lecture 4: Instruction Set Design/Pipelining
rePLay: A Hardware Framework for Dynamic Optimization
The University of Adelaide, School of Computer Science
Presentation transcript:

Mechanisms: Run-time and Compile-time

Outline Agents of Evolution Run-time Branch prediction Trace Cache SMT, SSMT The memory problem (L2 misses) Compile-time Block structured ISA Predication Fast track, slow track Wish branches Braids Multiple levels of cache

Evolution of the Process Pipelining Branch Prediction Speculative execution (and recovery) Special execution units (FP, Graphics, MMX) Out of order execution (and in-order retirement) Pipelining with hiccups (R/S, ROB) Wide issue Trace cache SMT, SSMT Soft errors Handling L2 misses

Run-time Mechanisms Branch prediction Trace Cache SMT, SSMT The memory problem (L2 misses)

The Conditional Branch Problem Why? Because there are pipelines. (Note: HEP) Mechanisms to solve it –Delayed branch –Take both paths –Eliminate branches (predication, compound predicates) Branch Prediction –Static: Always taken, BTFN –Early dynamic: LT, 2 bit counter –2 level  gshare, agree, hybrid –Indirect jumps –Perceptron, TAGE –Expose branch prediction hardware to the software –Wish branches

Trace Cache Issues Storage partitioning (and set associativity) Path Associativity Partial Matching Inactive Issue Branch Promotion Trace Packing Dual Path Segments Block collection (retire vs. issue) Fill unit latency

SMT and SSMT SMT Invented by Mario Nemirovsky, called DMS (HICSS-94) Multiple threads sharing common back end SSMT (aka “helper threads) Proposed in ISCA 99 (Rob Chappell et.al) Simultaneously invented by M. Dubois, USC tech report When the application only has one thread Subordinate threads work on chip infrastructure Subordinate threads: in microcode or at ISA level

The memory problem Simply: off-chip, on-chip access time imbalance (assuming bandwidth is not a problem) What to do about it MLP replacement policy DIP replacement VWay cache Runahead execution Address, value prediction 3D Stacking Two-level register file Denser encoding of instructions, data Better organization of code, data

Compile-time mechanisms Predication (and wish branches) Block structured ISA Fast track, slow track Braids Multiple levels of cache

Wish Branches (Predicated execution to the next level) Compile-time and Run-time At compile time: –Does predication even make sense –If no, regular branch and branch prediction –If yes, mark wish branch and defer to run-time At run time: –Prediction accuracy low: predicate –Prediction accuracy high: branch predict

Block-structured ISA References: Stephen Melvin dissertation (Berkeley, 1991) Eric Hao dissertation (Michigan, 1997) ISC paper (Melvin and Patt, 1989) ISCA paper (Melvin and Patt, 1991) Micro-29 paper (Hao, Chang, Evers, Patt, 1996) The Atomic Unit of Processinng Enlarged Blocks Savings on Register Pressure Faulting Branches, Trapping Branches Serial Execution on Exceptions Comparison to Superblocks, Hyperblocks

Compile-time mechanisms Predication (and wish branches) Block structured ISA Fast track, slow track Braids (Francis Tseng, ISCA 2008) Multiple levels of cache