Dynamically Sizing the TAGE Branch Predictor

Slides:

Advertisements

Similar presentations

Bimode Cascading: Adaptive Rehashing for ITTAGE Indirect Branch Predictor Y.Ishii, K.Kuroyanagi, T.Sawada, M.Inaba, and K.Hiraki.

Advertisements

Part IV: Memory Management

H-Pattern: A Hybrid Pattern Based Dynamic Branch Predictor with Performance Based Adaptation Samir Otiv Second Year Undergraduate Kaushik Garikipati Second.

Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

André Seznec Caps Team IRISA/INRIA 1 The O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

Yue Hu David M. Koppelman Lu Peng A Penalty-Sensitive Branch Predictor Department of Electrical and Computer Engineering Louisiana State University.

Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.

TAGE-SC-L Branch Predictors

Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.

Fast Filter Updates for Packet Classification using TCAM Authors: Haoyu Song, Jonathan Turner. Publisher: GLOBECOM 2006, IEEE Present: Chen-Yu Lin Date:

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Prophet/Critic Hybrid Branch Prediction Falcon, Stark, Ramirez, Lai, Valero Presenter: Christian Wanamaker.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Evaluation of Dynamic Branch Prediction Schemes in a MIPS Pipeline Debajit Bhattacharya Ali JavadiAbhari ELE 475 Final Project 9 th May, 2012.

Optimized Hybrid Scaled Neural Analog Predictor Daniel A. Jiménez Department of Computer Science The University of Texas at San Antonio.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

1 Storage Free Confidence Estimator for the TAGE predictor André Seznec IRISA/INRIA.

Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.

Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.

André Seznec Caps Team IRISA/INRIA 1 Analysis of the O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.

CE Operating Systems Lecture 14 Memory management.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

CS 241 Section Week #9 (11/05/09). Topics MP6 Overview Memory Management Virtual Memory Page Tables.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

CSC 360, Instructor Kui Wu Memory Management I: Main Memory.

André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC.

Idealized Piecewise Linear Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.

CHAPTER 3-1, 3-2 MEMORY MANAGEMENT. MEMORY HIERARCHY Small amount of expensive, fast, volatile cache Larger amount of still fast, but slower, volatile.

André Seznec Caps Team IRISA/INRIA 1 Analysis of the O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

CMSC 611: Advanced Computer Architecture

Memory Management.

Chapter 2 Memory and process management

From Monoprogramming to multiprogramming with swapping

Memory Caches & TLB Virtual Memory

How will execution time grow with SIZE?

Lecture: Large Caches, Virtual Memory

FA-TAGE Frequency Aware TAgged GEometric History Length Branch Predictor Boyu Zhang, Christopher Bodden, Dillon Skeehan ECE/CS 752 Advanced Computer Architecture.

Chapter 8: Main Memory.

Exploring Value Prediction with the EVES predictor

Operating System Concepts

Looking for limits in branch prediction with the GTL predictor

Lecture 23: Cache, Memory, Virtual Memory

Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 8 11/24/2018.

Phase Capture and Prediction with Applications

Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 12/1/2018.

Memory Management-I 1.

Main Memory Background Swapping Contiguous Allocation Paging

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

How can we find data in the cache?

CS 704 Advanced Computer Architecture

Database Design and Programming

José A. Joao* Onur Mutlu‡ Yale N. Patt*

TAGE-SC-L Again MTAGE-SC

Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 4/5/2019.

Program Phase Directed Dynamic Cache Way Reconfiguration

Patrick Akl and Andreas Moshovos AENAO Research Group

The O-GEHL branch predictor

Phase based adaptive Branch predictor: Seeing the forest for the trees

Lecture-Hashing.

Page Main Memory.

Presentation transcript:

Dynamically Sizing the TAGE Branch Predictor Stephen Pruett, Siavash Zangeneh, Ali Fakhrzadehgan, Ben Lin, and Yale N. Patt 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Problem Storage efficiency is key (Fundamental tradeoff with conditional probability) Size of TAGE tables must be decided at design time Number of tables (and histories) must be decided at design time Most benchmarks require large amounts of storage in low tables Makes it impossible for designers to justify long (expensive) histories or adding storage to longer history tables Designers must consider what is best for all benchmarks, not what is best for each benchmark 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Our Contributions A reconfigurable architecture that can reallocate storage at run time Algorithms that determine storage that the running application needs Victim Cache 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Outline Architecture Tables, Tiles, and Configuration Vectors Scoring Unit Reconfigurable Interconnect Victim Cache Limitations Results Questions 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Outline Architecture Tables, Tiles, and Configuration Vectors Scoring Unit Reconfigurable Interconnect Victim Cache Limitations Results Questions 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Architecture M geometrically increasing history registers, as in TAGE Cascading MUXes, as in TAGE N tiles instead of tables 2 Reconfigurable Interconnects Scoring Unit Collects run time information Determines the size of each table Configuration Vector Specified by the scoring unit Input into the reconfigurable interconnect E.g.: 2,0,0,0,0,0,0,0,4,0,2,0,4,4,8,8 6/18/16 HPS Research Group, The University of Texas at Austin

One Possible Configuration # of Histories # of Tiles Geometric Series Configuration Vector 6 16 2n 4,4,2,2,2,2 h[1:21] h[1:22] h[1:23] h[1:24] h[1:25] h[1:26] H H H H H H Table 1 Table 2 Table 3 Table 4 Table 5 Table 6 TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE 6/18/16 HPS Research Group, The University of Texas at Austin

Another Possible Configuration # of Histories # of Tiles Geometric Series Configuration Vector 6 16 2n 1,1,2,4,4,4 h[1:21] h[1:22] h[1:23] h[1:24] h[1:25] h[1:26] H H H H H H Table 1 Table 2 Table 3 Table 4 Table 5 Table 6 TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin And Another… # of Histories # of Tiles Geometric Series Configuration Vector 6 16 2n 4,8,2,2,0,0 h[1:21] h[1:22] h[1:23] h[1:24] h[1:25] h[1:26] H H H H H H TILE TILE Table 1 Table 2 Table 3 Table 4 Table 5 Table 6 TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE TILE 6/18/16 HPS Research Group, The University of Texas at Austin

Quanta and Runtime Phases Adaptive Phase Beginning End The currently running program Quantum Learning Phase Time 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Scoring Unit Learning Phase Collects run time information Produces new configuration vectors Adaptive Phase Selects the best configuration Dynamically switches between the configurations produced in the learning phase 6/18/16 HPS Research Group, The University of Texas at Austin

Scoring Unit: Learning Phase Runtime Statistics Table 1 Table 2 Table 3 Table 4 Table 5 Table 6 Mispredictions Mispredictions Mispredictions Mispredictions Mispredictions Mispredictions Conflicts Conflicts Conflicts Conflicts Conflicts Conflicts Attempts Attempts Attempts Attempts Attempts Attempts From the predictor 6/18/16 HPS Research Group, The University of Texas at Austin

Scoring Unit: Learning Phase (cont.) Misprediction Counter incremented when table mispredicts Conflict Counter Incremented when an attempted allocation conflicts (cannot be allocated) in the table Attempt (Attempted Allocation) Counter Increments when there is an attempted allocation 6/18/16 HPS Research Group, The University of Texas at Austin

Scoring Unit: Learning Phase (cont..) Runtime Statistics Table 1 Table 2 Table 3 Table 4 Table 5 Table 6 Mispredictions Mispredictions 2 4 Mispredictions 1 3 3 Mispredictions Mispredictions Mispredictions Conflicts Conflicts Conflicts 2 Conflicts 1 Conflicts Conflicts Attempts Attempts Attempts 2 1 Attempts 3 4 1 2 Attempts Attempts From the predictor 6/18/16 HPS Research Group, The University of Texas at Austin

When does a table need more storage? Highly congested tables usually have many conflicts Symptom: high number of conflicts Some tables are so highly congested that new allocations overwrite entries before they are ever used Symptom: high number of attempted allocations Difficult to tell difference between this and a useless table 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Algorithms Step 1: Reclaim storage Reduce table to smallest power of 2 that can still hold all the entries Very aggressive strategy Step 2: Distribute storage (3 algorithms) Conflict Add storage to tables that have an above average number of conflicts Attempt Add storage to tables that have an above average number of attempted allocations Additionally, give higher priority to tables that are after the max table Hybrid Use the conflict policy if the MPKI is high, otherwise use the attempt policy 6/18/16 HPS Research Group, The University of Texas at Austin

Scoring Unit: Adaptive Phase Misprediction Vector Config 0 Config 1 Config 2 Config 3 Config 4 Config 5 Config 6 Total Mispredicts Total Mispredicts Total Mispredicts Total Mispredicts Total Mispredicts Total Mispredicts Total Mispredicts From the predictor 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Scoring Unit: Adaptive Phase Adopted from T. Juan, S. Sanjeevan, and J. Navarro, “Dynamic History Length Fitting: A Third Level of Adaptivity for Branch Prediction Misprediction Vector Learning Phase Adaptive Phase Config 0 Config 1 Config 2 Config 3 Config 4 Config 5 Config 6 Active Active Active Active Active Active Active 527 Total Mispredicts 415 Total Mispredicts Total Mispredicts 675 897 Total Mispredicts 922 Total Mispredicts 342 417 Total Mispredicts 672 Total Mispredicts Minimum Minimum From the predictor 6/18/16 HPS Research Group, The University of Texas at Austin

Reconfigurable Interconnect 2 Reconfigurable Interconnects Connects histories to tiles Connects tiles to MUXes Each X is a switch, enabled by the CV Tiles organized in a direct mapped fashion Fully-associative would limit the max size of a table Hashing function always produces enough bits to index largest possible table Nice because low bits of hash (used to index tile) do not change after remapping. High bits are compared to TileID 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Victim Cache Boosts the most heavily loaded table Goal: Increase the reuse distance for entries that were never used. I.e., the minimum # of conflict with an above average # of attempts If there were too many conflicts would not be able to restore entry Take advantage of unused bit combinations in each entry Organized as a bloom filter Trade off capacity for correctness 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Limitations Number of tiles must be a power of 2 Otherwise it is possible to create invalid combinations Simplifies logic Problem gets worse as overall predictor size increases Tag size Assume worst case, use 15 bit tags Attempted treating as two entries in lower tables 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Outline Architecture Tables, Tiles, and Configuration Vectors Scoring Unit Reconfigurable Interconnect Victim Cache Limitations Results Questions 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Configuration Parameter 8KB 64KB # of Tiles (N) 32 64 Size of Tile 512 Tag Size 15 10 Quantum 100K 50K Quanta in Learning Phase 7 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Results: 8KB Average MPKI: 5.370 Trace Diff Improv Configuration Vector SS53 2.32 6.59% 1,8,8,4,4,4,1,1,1,0,0,0,0,0,0,0 SS56 2.49 6.17% SS57 2.61 7.57% 1,1,1,1,1,1,4,8,4,4,4,1,1,0,0,0 SM2 2.88 65.58% SM43 5.07 435.97% 0,0,0,0,1,1,2,1,1,8,4,4,4,4,1,1 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Results: 64KB Average MPKI: 4.265 Trace Diff Improv Configuration Vector SS57 1.10 3.85% 4,4,4,8,8,8,4,4,4,4,2,2,2,2,2,2 SS53 1.22 4.26% 4,4,8,8,8,8,4,4,2,2,2,2,2,2,2,2 SM42 1.71 151.15% 2,1,1,4,4,4,8,8,8,8,4,4,2,2,2,2 SM41 1.99 15.35% 1,1,1,1,4,4,4,4,8,8,8,4,4,4,4,4 SM58 2.45 25.52% 2,2,1,1,1,1,4,8,8,8,8,4,4,4,4,4 6/18/16 HPS Research Group, The University of Texas at Austin

HPS Research Group, The University of Texas at Austin Questions? Thank you! 6/18/16 HPS Research Group, The University of Texas at Austin