Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

Slides:

Advertisements

Similar presentations

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…

Advanced Piloting Cruise Plot.

Chapter 1 The Study of Body Function Image PowerPoint

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 5 Author: Julia Richards and R. Scott Hawley.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Chapter 1 Image Slides Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

My Alphabet Book abcdefghijklm nopqrstuvwxyz.

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

Year 6 mental test 5 second questions

Year 6 mental test 10 second questions

Around the World AdditionSubtraction MultiplicationDivision AdditionSubtraction MultiplicationDivision.

Who Wants To Be A Millionaire? Decimal Edition Question 1.

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.

Warm-Up Methodology for HW/SW Co-Designed Processors A. Brankovic, K. Stavrou, E. Gibert, A. Gonzalez.

Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §

1 Lecture 2: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation, Summarizing performance with AM, GM, HM Video 1: Using AM.

ABC Technology Project

Cache and Virtual Memory Replacement Algorithms

Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.

Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

15. Oktober Oktober Oktober 2012.

“Start-to-End” Simulations Imaging of Single Molecules at the European XFEL Igor Zagorodnov S2E Meeting DESY 10. February 2014.

Squares and Square Root WALK. Solve each problem REVIEW:

© 2012 National Heart Foundation of Australia. Slide 2.

Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN

Understanding Generalist Practice, 5e, Kirst-Ashman/Hull

Chapter 5 Test Review Sections 5-1 through 5-4.

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Addition 1’s to 20.

Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M

25 seconds left…...

Equal or Not. Equal or Not

Test B, 100 Subtraction Facts

Januar MDMDFSSMDMDFSSS

SE-292 High Performance Computing

Execution Cycle. Outline (Brief) Review of MIPS Microarchitecture Execution Cycle Pipelining Big vs. Little Endian-ness CPU Execution Time 1 IF ID EX.

We will resume in: 25 Minutes.

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.

A SMALL TRUTH TO MAKE LIFE 100%

1 Unit 1 Kinematics Chapter 1 Day

PSSA Preparation.

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

CpSc 3220 Designing a Database

Traktor- og motorlære Kapitel 1 1 Kopiering forbudt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Presentation transcript:

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011

Processor Stalls Instruction Dependences – Compiler Optimization – Dynamic Scheduling Branch Instructions – Branch Predictors Memory Accesses – Caches – Non-Blocking Caches – Cache Prefetchers – Pre-Execution 2

Code Example 3

What is Pre-Execution? In-Order Pre-Execution 4

Pre-Execution Techniques Out-of-Order Execution Run Ahead Execution – Run Ahead [Dundas1997], [Multu2003] – Continuous Flow Pipelines [Srinivasan2004, Hilton2009] – Two-Pass Pipelining [Barnes2006] – Dual-Core Execution [Zhou2005] – Multi-Pass Pipelining [Barnes2006] – Rocks Execute Ahead [Chaudhry2002] 5

Run Ahead Execution In-Order and Out-of-Order Pipeline Designs Two Modes of Operation – Normal Mode: Pipeline functions in traditional manner – Run-Ahead Mode: Instructions are retired without altering the machine state. Run-Ahead Entry on Very Long Latency Memory Operations Upon Run-Ahead Exit, Program Counter Set to Instruction After the Run-Ahead Entry Point. Instruction Results from Run-Ahead Mode Not Reused During Normal Mode Operation 6

What is Reuse? Run Ahead Run Ahead with Reuse 7

Instruction Reuse Questions Previously Shown to be Ineffective for Out-of- Order Processors [Multu2005] Why is reuse ineffective for out-of-order? Is reuse ineffective for in-order operations? If effective for in-order operations, what cause the behavioral differences? How does pipeline variations affect reuse in run-ahead pipelines? 8

Simulation Setup Two Processor Models: In-Order and Out-of-Order 8-wide Instruction Fetch and Issue L1 Cache: 64KB, 2 cycle, 4 loads per cycle L2 Cache: 1 MB, 10 cycle, 1 load per cycle Memory: 500 latency, 4:1 bus freq. ratio Simulations – SPEC CPU2006 benchmarks (reference inputs) – Compiled for x86, 64-bit – 250 million instructions simulations – 25 million warm-up period – Region chosen for statistical relevance [Sherwood2002] 9

Normalized Cycles (Normalized to In-Order) 10

Run Ahead Entries (Values in 1000) 11

In-Order L2 Cache Misses (Values in 1000) 12

Normalized Cycles (Normalized to In-Order) 13

Percentage Clean Memory Accesses 14

Summary of Reuse Results Out-of-Order Reuse vs. Run Ahead Only – Average: 1.03 X – Maximum (mcf): 1.12 X – Reduced Set: 1.05 X In-Order Reuse vs. Run Ahead Only – Average: 1.09 X – Maximum (lbm): 1.47 X – Reduced Set: 1.14 X 15

In-Order Run Ahead Run Ahead with Reuse 16

Out-of-Order Run Ahead Run Ahead with Reuse 17

Variations and Expectations Main Memory Latency (1000, 500, 100) – Reduction in Run-Ahead Benefit for Lower Latency – Convergence of In-Order and Out-of-Order – Increase in Reuse Benefit for Higher Latency Fetch and Issue (8, 4, 3, and 2 Width) – Increase Benefit for Reuse with Smaller Issue – Convergence of In-Order and Out-of-Order 18

Variations on Memory Latency (Normalized to In-Order for Each Latency) 19

Variations on Issue Width (Normalized to In-Order to Each Issue Width) 20

Conclusion Pre-Execution is a Highly Effective Way to Deal with Long Latency Memory Accesses Reuse of Run Ahead Results Provides Little to No Speedup for Out-of-Order Pipelines – Average Speedup = 1.03 X – Maximum Speedup = 1.12 X (MCF) Reuse of Run Ahead Results Provides Speedup for In-Order Pipelines – Average Speedup = 1.09 X – Maximum Speedup = 1.47 X (LBM) 21

Additional Slides 22

Run Ahead Entries (Values in 1000) 23

Percentage Clean Memory Accesses 24

Variations on Memory Latency (Normalized to Cache 1000 In-Order) 25

Variations on Stage Widths (Normalized to Width 2 In-Order) 26