Michigan Technological University, Houghton MI

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

CS 7810 Lecture 8 Memory Dependence Prediction using Store Sets G.Z. Chrysos and J.S. Emer Proceedings of ISCA

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

1 Lecture 2: Review of Computer Organization Operating System Spring 2007.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

EECS 470 Memory Scheduling Lecture 11 Coverage: Chapter 3.

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Revisiting Load Value Speculation:

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,

Lecture 1: Review of Computer Organization

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Data Prefetching Smruti R. Sarangi.

Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison

Dynamic Branch Prediction

Multiscalar Processors

Multilevel Memories (Improving performance using alittle “cash”)

Lynn Choi Dept. Of Computer and Electronics Engineering

PowerPC 604 Superscalar Microprocessor

‘99 ACM/IEEE International Symposium on Computer Architecture

/ Computer Architecture and Design

Out-of-Order Commit Processors

Commit out of order Phd student: Adrián Cristal.

Lecture 6: Advanced Pipelines

Superscalar Processors & VLIW Processors

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Address-Value Delta (AVD) Prediction

Lecture 11: Memory Data Flow Techniques

Phase Capture and Prediction with Applications

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Introduction to the Intel x86’s support for “virtual” memory

How to improve (decrease) CPI

Advanced Computer Architecture

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Out-of-Order Commit Processors

Control unit extension for data hazards

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

* From AMD 1996 Publication #18522 Revision E

Computer Architecture

Data Prefetching Smruti R. Sarangi.

Chapter 12 Pipelining and RISC

Control unit extension for data hazards

Control unit extension for data hazards

So far we have dealt with control hazards in instruction pipelines by:

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

OPERATING SYSTEMS MEMORY MANAGEMENT BY DR.V.R.ELANGOVAN.

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Spring 2019 Prof. Eric Rotenberg

Project Guidelines Prof. Eric Rotenberg.

Handling Stores and Loads

Presentation transcript:

Michigan Technological University, Houghton MI Cost Effective Memory Dependence Prediction Using Speculation Levels and Color Sets Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner

Outline Background Memory dependence prediction. Pairing based approach. Store sets. Color sets Notion of color sets. Color set implementation. Color set predictor. Instruction window modifications. Experimental evaluation Basic policy. Aggressive policy.

Memory Dependence Prediction Assume ST-2, ST-p and LD-s all access the same memory location. If we issue LD-s at this point in time, we’ll get a memory order violation. If we know Load Ld-s is dependent on Store St-p, we can issue the load at the right time. Seq. 1 2 3 p p+1 p+2 p+3 Instruction ST-1 ST-2 ST-3 ST-p ST-p+1 ST-p+2 LD-s Ready No Yes St-p

Dynamic Memory Disambiguation Problem: In the presence of unresolved stores in the instruction window, which load(s) must be held? Ideal Solution: Wait only for the producer store. Simple Solutions: Wait for all - no speculation. Issue blindly - blind speculation.

Memory dependence prediction (Moshovos et al. 1997-1998) Earlier work which mainly concentrated on predicting precise dependencies among pairs of load/store instructions : To enable early issuing of loads through memory dependence prediction. To streamline communication so that values can be directly passed from producers to consumers instead of through memory. Emphasis has been given to identifying the precise store instruction a load may depend on.

Store-set Memory Dependence Predictor (Chrysos & Emer - 1998) A store set is the set of all stores a load has been observed to be dependent on. Initially employ blind speculation for loads. Upon memory order violation create a store set for the offending load and store. Next time the same load is encountered make the load wait until the store issues. Store set may contain multiple stores: chain the stores and make load dependant upon the last store.

Store-set Implementation PC LFST SSID Dependence information is digested to create SETS of colliding instructions. Each set tells exactly which stores a load should wait for. Sufficiently large tables yield performance of an ORACLE.

Color Set predictor Instead of predicting precise dependencies among pairs of loads/stores or constructing sets of store and load instructions which collided in the past, We assign the processor, load and store instructions various speculation levels (colors) and predict the speculation level (i.e.,the color) a load or store can be issued without a collision. Predictor size

Color Set predictor Since we only try to predict the speculation level, we expect to have: smaller storage for the predictor, better performance at smaller hardware budgets, faster implementations, power savings and more collisions.

So, it is something like this 00 01 10 11 Processor 00 01 10 11 Load The rules governing the color change:policies. We investigate two policies, a basic policy and an aggressive policy.

Load instruction selection Eligible load instructions 00 01 10 11 Current processor color

Load instruction selection Eligible load instructions 00 01 10 11 Current processor color

Load instruction selection Eligible load instructions 00 01 10 11 Current processor color

Load instruction selection Eligible load instructions 00 01 10 11 Current processor color

Instruction window extensions Inhibit color Window details Global color 1 + + <= + + + 1 Issue? + + Instructions entering window

Collisions 01 load 01 store load store 01 10 00 01 10 11 Current processor color

Color Set Predictor Basic Policy 1. Basic policy gradually becomes aggressive when port utilization is low. 2. The load instruction is given a higher color and a store instruction given a lower color upon a collision. 3. Processor runs at the smaller of the current processor color and the color of the store instructions. 4. Rules 2 & 3 together runs the processor at a lower speculation level than the level the prior collision has occurred.

Color Set Predictor Aggressive Policy 1. Aggressive policy switches to maximum speculation level when port utilization is low. 2. The load instruction is given a higher color and a store instruction is specifically marked upon a collision. 3. Processor decrements the current processor color when a colliding store is detected. 4. As a result, the processor runs at the highest speculation level that won’t result in a collision and at a different color than the color it had during the collision.

Color Set Predictor Accessed early in the pipeline using L/S PC Updated upon collision/successful speculation Basic Policy 00 No speculation 01 Level 1 10 Level 2 11 Level 3 L/S PC L/S color 10 Aggressive Policy 00 No speculation 01 Level 1 10 Level 2 11 Level 3/Colliding store

Processor’s colorful perspective Basic policy When port utilization is low, the processor moves on to next color. Processor assumes the lowest ranking store’s color. 00 01 10 11 Low port utilization Colliding stores

Processor’s colorful perspective Aggressive policy When a colliding store enters the window, the processor decrements its color. When port utilization is low, processor switches to red. 00 01 10 11 Low port utilization Colliding stores

Load instruction color states Both policies 00 01 10 11 Collision Successful speculation

Simulation Framework Aggressive out-of-order superscalar processor: 8 instructions/cycle fetch/dispatch 16 instructions/cycle retire width 64 entry centralized reservation station 8 symmetric functional units Multi-block gshare fetch unit 2 memory ports r/w Perfect D-cache Simulated using cycle-accurate simulators generated automatically from ADL descriptions using the FAST system.

Performance Spec Fp Arithmetic Mean

Performance Spec Fp Harmonic Mean

Performance Spec Int Arithmetic Mean

Performance Spec Int Harmonic Mean

Individual benchmarks 128-Fp

Individual benchmarks 4096-Fp

Individual benchmarks 128-Int

Individual benchmarks 4096-Int

So ... Cost effective dependence prediction. Why does it work? Design space: Number of colors/number of entries. Confidence mechanisms. Other policies. Power consumption Disable chunks of predictor and use basic policy; Enable and become aggressive.

Have a colorful evening Soner Önder Michigan Technological University Antalya, Turkey