CS/COE 1541 (term 2174) Jarrett Billingsley Pipelining - Details CS/COE 1541 (term 2174) Jarrett Billingsley
Class Announcements Today's the last day of add/drop! Though by now you'd probably have a hard time switching classes. Homework due today! If you have a physical copy, please hand it in now. If you're turning it in digitally, it's due by midnight tonight. Please type it up – don't scan a paper assignment, as that's kinda the worst of both worlds... If you're still working on it: Don't worry too much about whether to stall the ID or EX phases during data hazards. 1/18/2017 CS/COE 1541 term 2174
Forwarding considerations 1/18/2017 CS/COE 1541 term 2174
The hardware For every forwarding path, you've gotta add more wires/muxes. These muxes are switched when there are data hazards. The ID phase might have fetched wrong register values – but that's OK! But there's an important issue to consider... Them's some big wires. EX MEM 32 EX→EX ALU MEM→EX Memory 1/18/2017 CS/COE 1541 term 2174
Oh dear What if we had more pipeline stages? Maybe 5 EX stages and 6 MEM stages? Should we connect every stage to every other stage? If you want "full" forwarding, the interconnect becomes the limiting factor. And what happens when you have more circuitry per stage? The stage takes longer. Which means... Your clock has to run slower. Which means... You might end up reducing performance! More circuitry also uses more power. Engineering is all about tradeoffs. This applies to any design. Eventually you reach a point of diminishing returns, when more complexity doesn't get you any more performance. 1/18/2017 CS/COE 1541 term 2174
How stalls work 1/18/2017 CS/COE 1541 term 2174
Watching a stall IF ID EX MEM WB Suppose we have an add that depends on an lw. IF ID EX MEM sub add lw WAIT! Ins. Decoder Register File ALU Memory Memory WB 1/9/2017 CS/COE 1541 term 2174
How does a stall happen? If the control detects a stall condition, it does the following: It stops fetching instructions (doesn't update the PC). It stops clocking the pipeline registers for the stalled stages. The stages after the stalled instructions are filled with nops. Just change the control signals in the pipeline registers! In this way, the stalled instructions will sit still. What happens as we make the pipeline deeper? What if we had 6 memory stages? How many cycles would a memory stall cost us? Oh dear. 1/18/2017 CS/COE 1541 term 2174
How flushes work 1/18/2017 CS/COE 1541 term 2174
What's a flush? We saw an example of a flush last time. blt s0,10,top 1 2 3 4 5 6 7 blt s0,10,top la a0,done_msg jal printf s0 < 10... OOPS! move a0,s0 EX ID IF IF ID IF MEM IF POW BOOM WB ID EX MEM WB 1/18/2017 CS/COE 1541 term 2174
Watching a flush IF ID EX MEM WB Let's watch the previous example. la move jal blt nop nop Ins. Decoder Register File ALU Memory Memory WB 1/9/2017 CS/COE 1541 term 2174
How do flushes work? If the control detects a flushing situation: Any "newer" instructions (those already in the pipeline) are transformed into nops. Any "older" instructions (those that came BEFORE the branch) are left alone to finish executing as normal. And just like stalls... As the pipeline gets longer, flushes get costlier. If you have to flush 13 instructions after a wrong branch, well crap. (actually this is exactly what happens in modern CPUs) Again, it's a balancing act. Do you make the pipeline deeper for speed, but make wrong branches unreasonably costly? 1/18/2017 CS/COE 1541 term 2174
Two memories! 1/18/2017 CS/COE 1541 term 2174
Memory CPU Data Memory CPU Program Memory Von Neumann vs. Harvard Historically there were two types of memory arrangements: Memory CPU Von Neumann Data Memory CPU Program Memory Harvard 1/18/2017 CS/COE 1541 term 2174
Striking a balance Each memory arrangement has pros and cons. Number of physical memories? Von Neumann wins. Read from two places at once? Harvard wins. Change program code? Von Neumann wins. Caching? Harvard wins. Obviously we only have a single system memory today, but... Internally, the CPU pretends like it has two! 1/18/2017 CS/COE 1541 term 2174
Caching Having one memory for program code and one for data solves the structural hazard of wanting to read/write to two places at once. But there's another big problem: Access time for modern RAM: 12ns Cycle length of 3.2 GHz CPU: 0.3 ns This means it would take 40 cycles to access RAM! These two problems are solved by using caches: smaller, but much faster memories integrated into the CPU itself. Caching is extremely important for pipelining – without it, stalls would be the norm, not the exception. 1/18/2017 CS/COE 1541 term 2174
Branch prediction 1/18/2017 CS/COE 1541 term 2174
Predicting the future, poorly We said previously that we can use some kind of statistical analysis to decide whether or not a conditional branch is taken. So we looked at a bunch of programs, and ran them, and watched what happened, and... It was found that on average, conditional branches were taken 2/3 of the time. Based on this info, should we predict that branches will be taken, or not taken? 1/18/2017 CS/COE 1541 term 2174
Kinds of branches Loops. Conditionals. Calculated jumps (switch statements). Virtual method calls. Depending on the program, each of these kinds will be more or less prevalent and behave differently. Depending on the inputs, they can behave very differently! We need something more robust. 1/18/2017 CS/COE 1541 term 2174