Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Instruction Level Parallelism (ILP) Colin Stevens.

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Multiscalar processors

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.

CB E D F G Frequently executed path Not frequently executed path Hard to predict path A C E B H Insert select-µops (φ-nodes SSA) Diverge Branch CFM point.

Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Predicate Execution 2008/01/10 Presented by Jinho.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Power Awareness through Selective Dynamically Optimized Traces Roni Rosner, Yoav Almog, Micha Moffie, Naftali Schwartz and Avi Mendelson – Intel Labs,

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

2D-Profiling Detecting Input-Dependent Branches with a Single Input Data Set Hyesoon Kim M. Aater Suleman Onur Mutlu Yale N. Patt HPS Research Group The.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

15-740/ Computer Architecture Lecture 29: Control Flow II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/30/11.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

CS 352H: Computer Systems Architecture

Computer Architecture Lecture 10: Branch Prediction II

Computer Architecture: Branch Prediction (II) and Predicated Execution

15-740/ Computer Architecture Lecture 3: Performance

15-740/ Computer Architecture Lecture 21: Superscalar Processing

Multiscalar Processors

CS5100 Advanced Computer Architecture Advanced Branch Prediction

Improving Program Efficiency by Packing Instructions Into Registers

Henk Corporaal TUEindhoven 2009

15-740/ Computer Architecture Lecture 25: Control Flow II

Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

EE 382N Guest Lecture Wish Branches

Address-Value Delta (AVD) Prediction

15-740/ Computer Architecture Lecture 24: Control Flow

Yingmin Li Ting Yan Qi Zhao

15-740/ Computer Architecture Lecture 26: Predication and DAE

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Henk Corporaal TUEindhoven 2011

Suhas Chakravarty, Zhuoran Zhao, Andreas Gerstlauer

Instruction Level Parallelism (ILP)

HARP Control Divergence & Assignment 4

CSC3050 – Computer Architecture

Patrick Akl and Andreas Moshovos AENAO Research Group

Loop-Level Parallelism

rePLay: A Hardware Framework for Dynamic Optimization

Predication ECE 721 Prof. Rotenberg.

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin

2 Outline  Predicated Execution  Diverge-Merge Processor (DMP)  Implementation of DMP  Experimental Evaluation  Conclusion

3 Predicated Execution Convert control flow dependence to data dependence (normal branch code) CB D A T N p1 = (cond) branch p1, TARGET mov b, 1 jmp JOIN TARGET: mov b, 0 A B C B C D A (predicated code) A B C if (cond) { b = 0; } else { b = 1; } p1 = (cond) (!p1) mov b, 1 (p1) mov b, 0

4 Fetch Decode Rename Schedule RegisterRead Execute Benefit of Predicated Execution  Predicated Execution can be high performance and energy-efficient. A B C D A E F Predicated Execution Branch Prediction Pipeline flush!! EDBF nop Fetch Decode Rename Schedule RegisterRead Execute A B A C BA CB D A DCBEAEDCFB A FEDC BAAFBCDE F EDABCFEABCD FED CBA FE DCAB EDC BAFAFBCDE

5 Limitations/Problems of Predication  ISA: Predicate registers and predicated instructions Dynamic-Hammock Predication[Klauser ’ 98] can solve this problem but it is only applicable to simple hammocks.  Adaptivity: Static predication is not adaptive to run-time branch behavior. Branch behavior changes based on input set, phase, control-flow path. Wish Branches[Kim ’ 05]  Complex CFG: A large subset of control-flow graphs is not converted to predicated code. Function calls, loops, many instructions inside a region, and complex CFGs Hyperblock[Mahlke ’ 92] cannot adapt to frequently-executed paths dynamically.

6 Outline  Predicated Execution  Diverge-Merge Processor (DMP)  Implementation of DMP  Experimental Evaluation  Conclusion

7 Diverge-Merge Processor (DMP)  DMP can dynamically predicate complex branches (in addition to simple hammocks).  The compiler identifies Diverge branches Control-flow merge (CFM) points  The microarchitecture decides when and what to predicate dynamically.

8 select-µops (φ-nodes in SSA) Dynamic Predication A B C H Klauser et al.[PACT’98]: Dynamic-hammock predication CB H A T N mov R1, 1 jmp JOIN TARGET: mov R1, 0 A B C p1 = (cond) branch p1, TARGET (mov R1, 1) PR10 = 1 (mov R1, 0) PR11 = 0 PR12 = (cond) ? PR11 : PR10 Low-confidence H JOIN: add R5, R1, 1

9 Diverge-Merge Processor CB E D F G Frequently executed path Not frequently executed path A C E B H Insert select-µops Diverge Branch CFM point A H

10 diverge-branch executed block CFM point Diverge-Merge Processor CB E D F G Frequently executed path Not frequently executed path AAA AAA A H

11 Control-Flow Graphs A simple hammock A nested hammock A frequently-hammock A loop A non-merging DMP Dynamic Hammock SW pred Wish br. Dual-path

12 Dual-path Execution vs. DMP Low-confidence C D E F B D E F A B C D E F path 1path 2 C D E F B path 1path 2 Dual-pathDMP CFM

13 Control-Flow Graphs A simple hammock A nested hammock A frequently-hammock A loop A non-merging DMP Dynamic- hammock SW pred Wish br. Dual-path sometimes

14 Distribution of Mispredicted Branches  66% of mispredicted branches can be dynamically predicated in DMP.

15 Distribution of Mispredicted Branches  66% of mispredicted branches can be dynamically predicated in DMP.

16 Outline  Predicated Execution  Diverge-Merge Processor (DMP)  Implementation of DMP  Experimental Evaluation  Conclusion

17 Fetch Mechanism CB E D F G predicted path A C E B H Diverge Branch CFM point A H Low Confidence Round-robin fetch

18 PR21 PR11 PR41 add pr21  pr13, #1 (p1) Dynamic Predication Arch.Phy.M R1 R2PR12 R3PR13 A C E B H branch r0, C add r1  r3, #1 add r4  r1, r3 add r1  r2, # -1 branch pr10,C p1 = pr10 add pr24  pr41, pr13add pr31  pr12, # -1(!p1) Arch.Phy.M R1 R2PR12 R3PR13 PR select-µop pr41 = p1? pr21 : pr31 RAT2 RAT1 Forks RAT, RAS, and GHR PR11

19 DMP Support  ISA Support Mark diverge branches/CFM points.  Compiler Support [CGO’07] The compiler identifies diverge branches and the corresponding CFM points.  Hardware Support Confidence estimator Fetch mechanisms Load/store processing Instruction retirement Dynamic predication

20 Hardware Complexity Analysis ST-LD Forwarding SW pred. Dual path Select-Uop Gen. Rename Support Front-End Check Flush/no Flush Predicate Registers Confidence Estimator Wish br. Multi path Dyn. ham. DMP

21 Outline  Predicated Execution  Diverge-Merge Processor (DMP)  Implementation of DMP  Experimental Evaluation  Conclusion

22 Simulation Methodology  12 SPEC 2000 INT, 5 SPEC 95 INT Different input sets for profiling and evaluation  Alpha ISA execution driven simulator  Baseline processor configuration 64KB perceptron predictor/O-GEHL (paper) Minimum 30-cycle branch misprediction penalty 8-wide, 512-entry instruction window 2 KB 12-bit history enhanced JRS confidence estimator  Less aggressive processor (paper)  Power model using Wattch

23 Different CFG types

24 Performance Improvement

25 Energy Consumption

26 Outline  Predicated Execution  Diverge-Merge Processor (DMP)  Implementation of DMP  Experimental Evaluation  Conclusion

27 Conclusion  DMP introduces the concept of frequently-hammocks and it dynamically predicates complex CFGs.  DMP can overcome the three major limitations of software predication: ISA support, adaptivity, complex CFG.  DMP reduces branch mispredictions energy efficiently 19% performance improvement, 9% less energy  DMP divides the work between the compiler and the microarchitecture: The compiler analyzes the control-flow graphs. The microarchitecture decides when and what to predicate dynamically.

Thank You!!

Questions?