Dynamic Branch Prediction During Context Switches Jonathan Creekmore Nicolas Spiegelberg T NT.

Slides:

Advertisements

Similar presentations

Branch prediction Titov Alexander MDSP November, 2009.

Advertisements

Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Pipelining and Control Hazards Oct

Dynamic Branch Prediction

Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.

EECE476: Computer Architecture Lecture 20: Branch Prediction Chapter extra The University of British ColumbiaEECE 476© 2005 Guy Lemieux.

VLSI Project Neural Networks based Branch Prediction Alexander ZlotnikMarcel Apfelbaum Supervised by: Michael Behar, Spring 2005.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Evaluation of Dynamic Branch Prediction Schemes in a MIPS Pipeline Debajit Bhattacharya Ali JavadiAbhari ELE 475 Final Project 9 th May, 2012.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Virtual Memory Expanding Memory Multiple Concurrent Processes.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Branch Hazards and Static Branch Prediction Techniques

CSC 4250 Computer Architectures October 31, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Real-World Pipelines Idea Divide process into independent stages

Computer Architecture Chapter (14): Processor Structure and Function

Memory COMPUTER ARCHITECTURE

Stalling delays the entire pipeline

CS203 – Advanced Computer Architecture

Concepts and Challenges

CSC 4250 Computer Architectures

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

CMSC 611: Advanced Computer Architecture

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Module 3: Branch Prediction

Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.

Phase Capture and Prediction with Applications

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 10: Branch Prediction and Instruction Delivery

Dynamic Hardware Prediction

Wackiness Algorithm A: Algorithm B:

Aliasing and Anti-Aliasing in Branch History Table Prediction

October 9, 2003.

Phase based adaptive Branch predictor: Seeing the forest for the trees

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Dynamic Branch Prediction During Context Switches Jonathan Creekmore Nicolas Spiegelberg T NT

Overview 4Branch Prediction Techniques 4Context Switching 4Compression of Branch Tables 4Simulation 4Hardware Model 4Results 4Analysis

Case for Branch Prediction 4Multiple instructions at one time 8Between 15 and 20 4Branches occur every 5 instructions 8if, while, for, function calls, etc. 4Stalling pipeline is unacceptable 8Lose all advantage of multiple instruction issue

Context Switch Time 4Cause program execution to be paused 8State of program is saved 8New program is executed 4Eventually, original program begins executing again 4Not all of the CPU state is saved 8Such as the branch predictor tables

Context Switch Time 41 set of branch predictor state 4Context switch causes a new application to use the previous application’s branch predictor state 8Degrades performance for all applications 4Solution: Save the state of the branch predictor at context switch time

Saving Branch State Table 4Simple branch predictors still have large number of bits 4Storing and restoring the branch predictor should not take too long 8Lose the gain of storing/restoring if it takes longer than the “warm-up” time of the branch predictor

Compression 4Compression is the key 8Requires less storage 4Needs to be done carefully 8Some lossless compression schemes can inflate number of bits 8Luckily, lossy compression is acceptible

Semi-Lossy Compression 4Applies to 2-bit predictors 4Key is to store just taken/not-taken state 8Ignores strong/weak S S TTT NT W W

Semi-Lossy Decompression SNTWNT WTST NT T T T T T

Lossy Compression 4Branch prediction is just an educated guess 4Achieve higher compression ratio if some information is lost 4Majority rules 8Used by correlating branch predictor

Lossy Compression TT T T T T T T T T T T T T T T T NT T T T T 4x

Lossy Decompression 4Reinitialize all elements for an address to the stored value 4Best case -- all elements are correct 4Worst cast -- 50% of elements are correct 4Remember: Branch predictors are just educated guesses

Simulation 4Modified SimpleScalar’s sim-bpred to support context switching 8Not necessary to actually switch between programs 8On context switch, corrupt branch predictor table according to a “dirty” percentage to simulate another program running

Simulation 4Testing compression/decompression becomes simple 8Instead of corrupting branch predictor table, replace entries with the value after compression/decompression 8Testing with: 22-bit semi-lossy compression 24-bit lossy compression 28-bit lossy compression

Hardware Model 4Compression and decompression blocks are fully pipelined 4Compression and decompression blocks can handle n bits of compressed data at a time 4Compression and decompression occur simultaneously

Hardware Model 4Utilize data independence 8Compress 128 bits into 64 bits at one time 8Pipeline overhead should be minimal compared to clock cycle savings

Programs Simulated 4Several SPEC2000 CINT200 programs simulated 8164.gzip Compression 8175.vpr FPGA Place and route 8181.mcf Combinatorial Optimization 8197.parser Word Processing 8256.bzip2 Compression

Predictor Types entry bimodal predictor (4096 bits) entry bimodal predictor (8192 bits) entry two-level predictor with 4-bit history size (16384 bits) entry two-level predictor with 8-bit history size ( bits) entry two-level predictor with 8-bit history size ( bits)

2048 Entry Bimodal Predictor

4096 Entry Bimodal Predictor

1024 entry two-level predictor with 4- bit history size

4096 entry two-level predictor with 8- bit history size

8192 entry two-level predictor with 8- bit history size

Timing Comparison Miss Penalty 10 clock cycles Bandwidth 64 bits per clock cycle

Timing Equations General Timing Equation Special Case for ratio of 0

Timing Comparison Miss Penalty 15 clock cycles Bandwidth 64 bits per clock cycle

Timing Comparison Miss Penalty 10 clock cycles Bandwidth 128 bits per clock cycle

Summary 4Dynamic Branch Prediction is necessary for modern high-performance processors 4Context switches reduce the effect of dynamic branch prediction 4Naïvely saving the branch predictor state is costly

Summary 4Compression can be used to improve the cost of saving branch predictor state 4Higher compression ratios improve fixed save/restore time at a cost of increasing the number of mispredictions 8For low frequency context switches, yields an improvement in performance

Questions