Pipelining - Branch Prediction CS/COE 1541 (term 2174) Jarrett Billingsley
Class Announcements Quizzes/exams reweighted again (last time FOR REAL) Quizzes 5% each (15% total) Exams 15% each (45% total) Short lecture, then quiz We'll talk about branch prediction today Which will be on the quiz :) First exam next week Wednesday, February 1st There will be a study guide – probably Wednesday? No homework for this week since exam is next week! 1/23/2017 CS/COE 1541 term 2174
But first... Improving Branch Penalties 1/23/2017 CS/COE 1541 term 2174
The problem Branches are based on comparisons. Which are done by... The ALU in the EX phase. Which means it take how many cycles to determine if a branch is taken? Three. So how many possibly-wrongly-fetched instructions are in the pipeline behind the branch, which may need to be flushed? Two. But what if our pipeline were longer? What if there were 4 decode phases and 3 execute phases? ugh. Therefore, we want to determine the branch direction (whether or not it's taken) as early as possible. What are some ways we could improve this situation? 1/23/2017 CS/COE 1541 term 2174
The solution: MORE SINKS!! (hardware) If we carefully design our instruction set (like the MIPS designers did), we can determine the branch direction early in decoding! In what phase do we read the registers to be compared? The ID phase. So what hardware can we put in the ID phase to let us determine the branch direction? A comparator! But... what's the downside to adding more hardware? Starting to notice a pattern, huh. So if we mispredict a branch in ID, how many instructions need to be flushed? Just one – the one in IF! 1/23/2017 CS/COE 1541 term 2174
Not all sunshine and rainbows Uh oh. Remember data hazards? Time 1 2 3 4 5 6 7 sub t0,t1,t2 beq t0,$0,end EX ID IF ID IF MEM ID WB EX WAIT! MEM WB Now we've added a forwarding path from EX to ID. And as you can imagine, we need one from MEM to ID too. Doing things in more stages means more forwarding paths. Something to keep in mind as you add more pipeline stages! 1/11/2017 CS/COE 1541 term 2174
Static Branch Prediction 1/23/2017 CS/COE 1541 term 2174
The compiler can help for(s0 = 0 .. 100) print(s0); printf("done"); In the following loop, what can you say about the blt instruction? for(s0 = 0 .. 100) print(s0); printf("done"); li s0, 0 top: move a0, s0 jal print addi s0, s0, 1 blt s0, 100, top la a0, done_msg jal printf In fact, the original version of MIPS had a special kind of branch for this: branch likely (high probability). l 1/11/2017 CS/COE 1541 term 2174
The old ways When MIPS was first designed, the idea was that the compiler could do the instruction scheduling/branch prediction in advance. The regular branch instructions assumed the branch was NOT taken, so the CPU would keep fetching instructions after them. The branch likely instructions assumed the branch WAS taken, so the CPU would start fetching instructions at the branch target. And this can be pretty effective for some control structures! Unfortunately, not effective enough... branch likely instructions are no longer part of the MIPS ISA. Many other "compiler-centric" features of MIPS have lost relevance over the years as well, such as inserting NOP instructions instead of forwarding/stalling dynamically. 1/23/2017 CS/COE 1541 term 2174
The CPU knows best Ultimately, for most programs, the compiler cannot statically predict their behavior to an acceptable degree. Solving the halting problem yadda yadda... The CPU can dynamically analyze program behavior at runtime, and adapt gracefully. Program behavior can change with user input after all! Implementing this analysis in hardware means the CPU architecture can change drastically without changing compilers and old code. It can also allow unoptimized programs to run quickly. We'll be learning about several adaptive execution schemes, starting with... 1/23/2017 CS/COE 1541 term 2174
Dynamic Branch Prediction 1/23/2017 CS/COE 1541 term 2174
Well let's try to turn this into hardware... The problem Some branches are taken 99% of the time, some 1% of the time, some always, some never, some 50% of the time, some randomly... What we need is hardware that can keep track of: Where branch instructions are in the program The probability that each branch is taken The branch target address of each branch Branch PC Probability Branch Target 0x007FA004 32% 0x007FA03C 0x007FA058 94% 0x007FA040 0x007FC380 88% 0x007FC398 0x007FC60C 12% 0x007FC704 Well let's try to turn this into hardware... 1/23/2017 CS/COE 1541 term 2174
Compromises How many branch instructions might there be in your program? How about in all programs running + the operating system? So how many entries should our prediction table have? One of those "try it and see" things – processor simulation is very useful in these cases. Law of diminishing returns. If we have n entries in our table, how can we quickly look up addresses in the table? (We're using this on every instruction!) Lots of comparators... lots of hardware. Hashing... but we could get false positives. If we predict incorrectly, what happens? Program runs a little slower, but nothing catastrophic. 1/23/2017 CS/COE 1541 term 2174
The Branch Target Buffer (BTB) Hash # Branch PC Pred. Branch Target 0x007FA004 NT 0x007FA03C 1 0x007FC60C 0x007FC704 2 0x007FA058 T 0x007FA040 3 ... 4 0x007FC380 0x007FC398 5 6 7 ==? T? PC: 0x007FA004 entry = Hash(PC) if(entry.PC == PC && entry.pred == T) NextPC = entry.target else NextPC = PC+4 This is to avoid false positives on non-branch (or wrong branch) instructions! 1/23/2017 CS/COE 1541 term 2174
When to read? When to write? Ideally, we'd like to start fetching instructions from the "correct" place during the branch instruction's ID phase. When should we check the BTB then? During IF! What but how does it even know it's a branch— Remember that the BTB checks that the instruction PC matches the BTB entry, so it MUST be a branch instruction.* When do we write to the BTB? Well when do we have all the information needed to fill in a BTB entry? After ID, when the branch target and direction have both been computed. (nice optimization!) This also handles adding new entries – only written on branches. *unless we have an incoherent instruction cache and dynamic code modification ;) 1/23/2017 CS/COE 1541 term 2174
Nobody's perfect What happens if, at the end of ID, we find our prediction is wrong? Flush and start fetching from correct PC. But now the BTB is updated with new info as well. (It's updated even if we predicted correctly, too.) Let's make our predictions more accurate. The scheme we showed here has only a single bit to predict taken/not taken. It's.. not much information to go on. But adding more bits means more hardware. Let's strike a balance. 1/23/2017 CS/COE 1541 term 2174
2-bit branch predictor We can use 2 bits with a state machine to make better predictions. Strongly Taken Weakly Taken Weakly Not Taken Strongly Not Taken Green arrows = taken Red arrows = not taken The hysteresis (have to make two mistakes before switching decisions) allows for intermittent changes in branch behavior. 1/23/2017 CS/COE 1541 term 2174
3 bits? 4? 10? Is it worth adding more bits to the prediction probability? Empirically... not really. 2 bits + large number of BTB entries gets you ~93% accuracy! More bits don't help because branch behavior can be complex. What does help with prediction accuracy is more complex branch prediction methods. Two-level adaptive predictors... Tournament predictors... Hybrid predictors... Loop detection... Return stack buffers... Oh and then there are indirect jumps (jr) which can be a whole different kind of pain to deal with. 1/23/2017 CS/COE 1541 term 2174
2-level adaptive predictors A common technique today: each entry in the BTB has multiple 2-bit counters, selected among by using a branch history. # Branch PC History Branch Target 0x007FA004 010 0x007FA03C 01 10 00 11 Every entry in the BTB has its own set of 8 counters! Every time the branch is taken/not taken, a 1/0 is shifted into the history on the right side. This way, the history keeps track of the last three times we encountered this branch. This kind of predictor can reach 97% accuracy! 1/23/2017 CS/COE 1541 term 2174
Return stack buffers jal someFunc beq v0, $0, blah ... someFunc: A function return is a special kind of indirect branch. jr $ra on MIPS or ret on x86 both get the address from somewhere else. Since functions return to where they were called virtually every time, it makes sense to cache the return address on function calls. When we encounter the jal, push the return address. 40CC00 46280C 4AB108 000000 jal someFunc beq v0, $0, blah ... someFunc: jr $ra 4AB33C 4AB340 When we encounter the jr $ra, pop the return address. Easy! 4AB340 Stack overflows aren't an issue. This is just a prediction, after all. 1/23/2017 CS/COE 1541 term 2174