1 Lecture 9 Pipeline and Cache Peng Liu

1 Lecture 9 Pipeline and Cache Peng Liu liupeng@zju.edu.cn

2 Pipelining Review What makes it easy –all instructions are the same length –just a few instruction formats –memory operands appear only in loads and stores Hazards limit performance –Structural: need more HW resources –Data: need forwarding, compiler scheduling Data hazards must be handled carefully MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load)

3 MIPS Pipeline Data / Control Paths A (fast) Read Address Instruction Memory Add PC 4 0 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 1632 ALU 1 0 Shift left 2 Add Data Memory Address Write Data Read Data 1 0 IF/ID Sign Extend ID/EX EX/MEM MEM/WB Control 0 1 ALU cntrl RegWrite MemWriteMemRead MemtoReg RegDst ALUOp ALUSrc Branch PCSrc EX MEM WB

4 MIPS Pipeline Data / Control Paths (debug) Read Address Instruction Memory Add PC 4 0 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 1632 ALU 1 0 Shift left 2 Add Data Memory Address Write Data Read Data 1 0 IF/ID Sign Extend ID/EXEX/MEM MEM/WB 0 1 ALU cntrl RegWrite MemWriteMemRead MemtoReg RegDst ALUOp ALUSrc Branch PCSrc Control Instr Control Instr Control Instr EXMEMWB

5 MIPS Pipeline Control (pipelined debug) Read Address Instruction Memory Add PC 4 0 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 1632 ALU 1 0 Shift left 2 Add Data Memory Address Write Data Read Data 1 0 IF/ID Sign Extend ID/EXEX/MEM MEM/WB 0 1 ALU cntrl RegWrite MemWriteMemRead MemtoReg RegDst ALUOp ALUSrc Branch PCSrc MEM WB Instr Control EX Instr Control Instr Control

6 Control Hazards When the flow of instruction addresses is not what the pipeline expects; incurred by change of flow instructions –Conditional branches ( beq, bne ) –Unconditional branches ( j ) Possible solutions –Stall –Move decision point earlier in the pipeline –Predict –Delay decision (requires compiler support) Control hazards occur less frequently than data hazards; there is nothing as effective against control hazards as forwarding is for data hazards

7 The Last Type of Pipeline Hazards: Control Hazards Just like a data hazard on the program counter (PC) –Cannot fetch the next instruction if I don’t know its address (PC) Only worse: we use the PC at the very beginning of the pipeline –Makes forwarding very difficult Calculating the next PC for MIPS: –Most instructions: PC + 4 Can be calculated in IF stage and forward to next instructions –Branches Must first compare registers and calculate PC+4+offset Must fetch, decode, and execute instruciton –Jump Must calculate new PC based on offset Must fetch and decode instruction

8 Datapath Branch and Jump Hardware ID/EX Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 1632 ALU 1 0 Data Memory Address Write Data Read Data 1 0 IF/ID Sign Extend EX/MEM MEM/WB Control 0 1 ALU cntrl Forward Unit

9 Datapath Branch and Jump Hardware ID/EX Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 1632 ALU 1 0 Data Memory Address Write Data Read Data 1 0 IF/ID Sign Extend EX/MEM MEM/WB Control 0 1 ALU cntrl Forward Unit 0 1 Branch PCSrc Shift left 2 Add 0 1 Shift left 2 Jump PC+4[31-28]

10 stall Jumps Incur One Stall I n s t r. O r d e r j lw and ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Fortunately, jumps are very infrequent – only 2% of the SPECint instruction mix Jumps not decoded until ID, so one stall is needed

11 stall Review: Branches Incur Three Stalls I n s t r. O r d e r beq ALU IM Reg DMReg lw ALU IM Reg DMReg ALU and IM Reg DM Can fix branch hazard by waiting – stall – but affects throughput

12 Moving Branch Decisions Earlier in Pipe Move the branch decision hardware back to the EX stage –Reduces the number of stall cycles to two –Adds an and gate and a 2x1 mux to the EX timing path Add hardware to compute the branch target address and evaluate the branch decision to the ID stage –Reduces the number of stall cycles to one (like with jumps) –Computing branch target address can be done in parallel with RegFile read (done for all instructions – only used when needed) –Comparing the registers can’t be done until after RegFile read, so comparing and updating the PC adds a comparator, an and gate, and a 3x1 mux to the ID timing path –Need forwarding hardware in ID stage For longer pipelines, decision points are later in the pipeline, incurring more stalls, so we need a better solution

13 Early Branch Forwarding Issues Bypass of source operands from the EX/MEM if (IDcontrol.Branch and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = IF/ID.RegisterRs)) ForwardC = 1 if (IDcontrol.Branch and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = IF/ID.RegisterRt)) ForwardD = 1 Forwards the result from the second previous instr. to either input of the Compare MEM/WB “forwarding” is taken care of by the normal RegFile write before read operation If the instruction immediately before the branch produces one of the branch compare source operands, then a stall will be required since the EX stage ALU operation is occurring at the same time as the ID stage branch compare operation

14 Supporting ID Stage Branches Read Address Instruction Memory PC 4 0 1 Write Data Read Addr 1 Read Addr 2 Write Addr RegFile Read Data 1 ReadData 2 16 32 ALU 1 0 Shift left 2 Add Data Memory Address Write Data Read Data 1 0 IF/ID Sign Extend ID/EX EX/MEM MEM/WB Control 0 1 ALU cntrl Branch PCSrc Forward Unit Hazard Unit 0 10 Compare Forward Unit Add IF.Flush 0

15 Branch Prediction Resolve branch hazards by assuming a given outcome and proceeding without waiting to see the actual branch outcome 1.Predict not taken – always predict branches will not be taken, continue to fetch from the sequential instruction stream, only when branch is taken does the pipeline stall –If taken, flush instructions in the pipeline after the branch in IF, ID, and EX if branch logic in MEM – three stalls in IF if branch logic in ID – one stall –ensure that those flushed instructions haven’t changed machine state– automatic in the MIPS pipeline since machine state changing operations are at the tail end of the pipeline (MemWrite or RegWrite) –restart the pipeline at the branch destination

16 Flushing with Misprediction (Not Taken) 4 beq $1,$2,2 I n s t r. O r d e r ALU IM Reg DMReg ALU IM Reg DMReg 8 sub $4,$1,$5 To flush the IF stage instruction, add a IF.Flush control line that zeros the instruction field of the IF/ID pipeline register (transforming it into a noop )

17 flush Flushing with Misprediction (Not Taken) 4 beq $1,$2,2 I n s t r. O r d e r ALU IM Reg DMReg 16 and $6,$1,$7 20 or r8,$1,$9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg 8 sub $4,$1,$5 To flush the IF stage instruction, add a IF.Flush control line that zeros the instruction field of the IF/ID pipeline register (transforming it into a noop )

18 Branch Prediction, con’t Resolve branch hazards by statically assuming a given outcome and proceeding 2.Predict taken – always predict branches will be taken –Predict taken always incurs a stall (if branch destination hardware has been moved to the ID stage) As the branch penalty increases (for deeper pipelines), a simple static prediction scheme will hurt performance With more hardware, possible to try to predict branch behavior dynamically during program execution 3.Dynamic branch prediction – predict branches at run- time using run-time information

19 Dynamic Branch Prediction A branch prediction buffer (aka branch history table (BHT)) in the IF stage, addressed by the lower bits of the PC, contains a bit that tells whether the branch was taken the last time it was execute –Bit may predict incorrectly (may be from a different branch with the same low order PC bits, or may be a wrong prediction for this branch) but the doesn’t affect correctness, just performance –If the prediction is wrong, flush the incorrect instructions in pipeline, restart the pipeline with the right instructions, and invert the prediction bit The BHT predicts when a branch is taken, but does not tell where its taken to! –A branch target buffer (BTB) in the IF stage can cache the branch target address (or !even! the branch target instruction) so that a stall can be avoided

20 1-bit Prediction Accuracy 1-bit predictor in loop is incorrect twice when not taken For 10 times through the loop we have a 80% prediction accuracy for a branch that is taken 90% of the time –Assume predict_bit = 0 to start (indicating branch not taken) and loop control is at the bottom of the loop code 1.First time through the loop, the predictor mispredicts the branch since the branch is taken back to the top of the loop; invert prediction bit (predict_bit = 1) 2.As long as branch is taken (looping), prediction is correct 3.Exiting the loop, the predictor again mispredicts the branch since this time the branch is not taken falling out of the loop; invert prediction bit (predict_bit = 0) Loop: 1 st loop instr 2 nd loop instr. last loop instr bne $1,$2,Loop fall out instr

21 2-bit Predictors A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed. Predict Taken Predict Not Taken Predict Taken Predict Not Taken Taken Not taken Taken Loop: 1 st loop instr 2 nd loop instr. last loop instr bne $1,$2,Loop fall out instr

22 2-bit Predictors A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed Predict Taken Predict Not Taken Predict Taken Predict Not Taken Taken Not taken Taken Loop: 1 st loop instr 2 nd loop instr. last loop instr bne $1,$2,Loop fall out instr wrong on loop fall out 0 1 1 right 9 times right on 1 st iteration 0

23 Delayed Decision First, move the branch decision hardware and target address calculation to the ID pipeline stage A delayed branch always executes the next sequential instruction – the branch takes effect after that next instruction –MIPS software moves an instruction to immediately after the branch that is not affected by the branch (a safe instruction) thereby hiding the branch delay As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot. –Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches –Growth in available transistors has made dynamic approaches relatively cheaper

24 Scheduling Branch Delay Slots A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails add $1,$2,$3 if $2=0 then delay slot A. From before branchB. From branch targetC. From fall through add $1,$2,$3 if $1=0 then delay slot add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes if $2=0 then add $1,$2,$3 if $1=0 then sub $4,$5,$6 add $1,$2,$3 if $1=0 then sub $4,$5,$6

25 Compiler Optimizations & Data Hazards Compilers rearrange code to try to avoid load-use stalls –Also known as: filling the load delay slot Idea: find an independent instruction to space apart load & use –When successful, no stall is introduced dynamically Independent instruction from before or after the load-use –No data dependence to the load-use instructions –No control dependence to the load-use instructions –No exception issues When can’t fill the slot –Leave it empty, hardware will introduce the stall –In original MIPS (no HW stalls), a nop was needed in the slot

26 Code Scheduling to Avoid Stalls Recorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F;

27 Control Hazard Solutions Stall on branches –Wait until you know the answer (branch CPI becomes 3) –Control inserted in ID stage of the pipeline Predict not-taken –Assume branch is not taken and continue with PC + 4 –If branch is actually not taken, all is good –Otherwise, nullify misfetched instructions and restart from PC+4+offset Predict taken –Assume branch is taken –More complex, cause you also need to predict PC+4+offset –Still need to nullify instructions if assumption is wrong Most machines do some type of predictions (taken, not-taken)

28 Exceptions and Interrupts Unexpected events requiring changes in flow of control –Different ISAs use the terms differently Exception –Arises within the CPU –E.g., undefined opcode, overflow, syscall, … Interrupt –E.g., from an external I/O controller (network card) Dealing with them without sacrificing performance is hard

29 Handling Exceptions Something bad happens to an instruction –Need to stop the machine –Fix up the problem (with privileged code like the operating system) –Start the machine up again MIPS: exceptions managed by a System Control Coprocessor (CP0) –Save PC of offending (or interrupted) instruction In MIPS: Exception Program Counter (EPC) –Save indication of the problem In MIPS: Cause register We’ll assume 1-bit for the examples 0 for undefined opcode, 1 for overflow –Jump to handler code at hardwired address (e.g., 8000_0180)

30 An Alternate Mechanism Vectored Interrupts –Handler address determined by the cause –Avoids the software overhead of selecting the write code to run based on cause register Example: –Undefined opcode: C000_0000 –Overflow: C000_0020 – C000_0040 Instructions either –Deal with the interrupt, or –Jump to real handler

31 Handler Actions Read cause, and transfer to relevant handler –A software switch statement –May be necessary even if vectored interrupts are available Determine action required If restartable –Take corrective action –Use EPC to return to program Otherwise –Terminate program (e.g. segfault, …) –Report error using EPC, cause, …

32 Precise Exceptions Definition: precise exceptions –All previous instruction had completed –The faulting instruction was not started –None of the next instructions were started No changes to the architecture state (registers, memory) Why are precise exceptions desirable by OS developers? With a single cycle machine, precise exceptions are easy –Why?

33 Exceptions in a Pipeline Another form of control hazard Consider overflow on add in EX stage Add $1, $2, $1 –Prevent $1 from being clobbered by add –Complete previous instructions –Nullify subsequent instructions –Set Cause and EPC register values –Transfer control to handler Similar to mispredicted branch –Uses much of the same hardware

34 Nullifying Instructions Nullify = turn an instruction into a nop ( or pipeline bubble) –Reset its RegWrite and MemWrite signals Does not affect the sate of the system –Resets its branch and jump signals Does not cause unexpected flow control –Mark that is should not raise any exceptions of its own Does not cause unexpected flow control –Let if flow down the pipeline

35 Dealing with Multiple Exceptions Pipelining overlaps multiple instructions –Could have multiple exceptions at once Simple approach: deal with exception from earliest instruction –Flush subsequent instructions –Necessary for “precise” exceptions In more complex pipelines –Multiple instructions issued per cycle –Out-of-order completion –Maintaining precise exceptions is difficult!

36 Imprecise Exceptions Actually available in several processors Just stop pipeline and save pipeline state –Including exception causes Let the handler work out –Which instruction(s) had exceptions –Which to complete or flush May require “manual” completion Simplifies hardware, but more complex handler software Not feasible for complex multiple-issue out-of-order pipelines

37 Beyond the 5-sstage Pipeline Convert transistors to performance Use transistors to –Exploit parallelism –Or create it (speculate) Processor generations –Simple machine Reuse hardware –Pipelined Separate hardware for each stage –Super-scalar Multiple port mems, function units –Out-of-order Mega-ports, complex scheduling –Speculation –Multi-core processors Each design has more logic to accomplish same task (but faster)

38 Memory System Outline Memory systems basics –Memory technologies SRAM, DRAM, Disk Principle of locality –Spatial and temporal locality Memory Hierarchies –Exploiting locality

39 Random Access Memory(RAM) Dynamic Random Access Memory (DRAM) –High density, low power, cheap, but slow –Dynamic since data must be “refreshed” regularly (“leaky buckets”) –Contents are lost when power is lost Static Random Access Memory (SRAM) –Lower density (about 1/10 density of DRAM), higher cost –Static since data is held without refresh if power is on –Fast access time, often 2 to 10 times faster than DRAM Flash memory –Holds contents without power –Data written in blocks, generally slow –Very cheap

40 Programs Have Locality Principle of locality –Programs access a relatively small portion of the address space at any given time –Can tell what memory locations a program will reference in the future by looking at what it referenced recently in the past Two Types of Locality –Temporal Locality – if an item has been referenced recently, it will tend to be referenced again soon –Spatial Locality – if an item has been referenced recently, nearby items will tend to be referenced soon Nearby refers to memory addresses

41 Locality Examples Spatial Locality –Likely to reference data near recent references –Example: for (i=0; i<N; i++) a[i] = … Temporal Locality –Likely to reference same data that was referenced recently –Example: for (i=0; i<N; i++) a[i] = f(a[i-1]);

42 Taking Advantages of Locality Memory hierarchy: –Store everything on disk Or Flash in some systems –Copy recently accessed (and nearby) data to smaller DRAM memory DRAM is called main memory –Copy more recently accessed (and nearby) data to smaller SRAM memory Cache memory attached or close to CPU

43 Typical Memory Hierarchy: Everything is a Cache for Something Else Access timeCapacityManaged by 1 cycle~500BSoftware/compiler 1-3 cycles~64KBhardware 5-10 cycles1-10MBhardware ~100 cycles~10GBSoftware/OS 10 6 -10 7 cycles~100GBSoftware/OS

44 Caches A cache is an interim storage component –Functions as a buffer for larger, slower storage components Exploits principles of locality –Provide as much inexpensive storage space as possible –Offer access speed equivalent to the fastest memory For data in the cache Key is to have the right data cached Computer systems often use multiple caches Cache ideas are not limited to hardware designers –Example: Web caches widely used on the Internet

45 Memory Hierarchy Levels Block (aka line): unit of copying –May be multiple words If accessed data is present in upper level –Hit: access satisfied by upper level Hit ratio: hits/accesses If accessed data is absent –Miss: block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses= 1 - hit ratio –Then data supplied to CPU from upper level

46 Average Memory Access Times Need to define an average access time –Since some will be fast and some slow Access time = hit time + miss rate x miss penalty –The hope is that the hit time will be low and the miss rate low since the miss penalty is so much larger than the hit time Average Memory Access Time (AMAT) –Formula can be applied to any level of the hierarchy Access time for that level –Can be generalized for the entire hierarchy Average access time that the processor sees for a reference

47 How Processor Handles a Miss Assume that cache access occurs in 1 cycle –Hit is great, and basic pipeline is fine CPI penalty = miss rate x miss penalty A miss stalls the pipeline (for a instruction or data miss) –Stall the pipeline (you don’t have the data it needs) –Send the address that missed to the memory –Instruct main memory to perform a read and wait –When access completes, return the data to the processor –Restart the instruction

48 How to Build A Cache? Big question: locating data in the cache –I need to map a large address space into a small memory How do I do that? –Can build full associative lookup in hardware, but complex –Need to find a simple but effective solution –Two common techniques: Direct Mapped Set Associative Further questions –Block size (crucial for spatial locality) –Replacement policy (crucial for temporal locality) –Write policy (writes are always more complex)

49 Terminology Block – Minimum unit of information transfer between levels of the hierarchy –Block addressing varies by technology at each level –Blocks are moved one level at a time Hit – Data appears in a block in lower numbered level Hit rate – Percent of accesses found Hit time – Time to access at lower nuumbered level –Hit time = Cache access time + Time to determine hit/miss Miss – Data was not in lower numbered level and had to be fetched from a higher numbered level Miss rate – Percent of misses (1 – Hit rate) Miss penalty – Overhead in getting data from a higher numbered level –Miss penalty = higher level access time + Time to deliver to lower level + Cache replacement/forward to processor time –Miss penalty is usually much larger than the hit time

50 Direct Mapped Cache Location in cache determined by (main) memory address Direct mapped: only one choice –(Block address) modulo (#Blocks in cache)

51 Tags and Valid Bits How do we know which particular block is stored in a cache location? –Store block address as well as the data –Actually, only need the high-order bits –Called the tag What if there is no data in a location? –Valid bit: 1 = present, 0 = not present –Initially 0

52 Cache Example 8-blocks, 1 word/block, direct mapped Initial state IndexVTagData 000N 001N 010N 011N 100N 101N 110N 111N

53 Cache Example IndexVTagData 000N 001N 010N 011N 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/MissCache block 2210 110Miss110 Compulsory/Cold Miss

54 Cache Example IndexVTagData 000N 001N 010Y11Mem[11010] 011N 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/MissCache block 2611 010Miss010 Compulsory/Cold Miss

55 Cache Example IndexVTagData 000N 001N 010Y11Mem[11010] 011N 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/MissCache block 2210 110Hit110 2611 010Hit010 Hit

56 Cache Example IndexVTagData 000Y10Mem[10000] 001N 010Y11Mem[11010] 011Y00Mem[00011] 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/MissCache block 1610 000Miss000 300 011Miss011 1610 000Hit000

57 Cache Example IndexVTagData 000Y10Mem[10000] 001N 010Y10Mem[10010] 011Y00Mem[00011] 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/MissCache block 1811 010Miss010 Replacement

58 Cache Organization & Access Assumptions –32-bit address –4 Kbyte cache –1024 blocks, 1word/block Steps –Use index to read V, tag from cache –Compare read tag with tag from address –If match, return data & hit signal –Otherwise, return miss

59 Larger Block Size Motivation: exploit spatial locality & amortize overheads This example: 64 blocks, 16 bytes/block –To what block number does address 1200 map? –Block address = =75 –Block number = 75 modulo 64 = 11 What is the impact of larger on tag/index size? What is the impact of larger blocks on the cache overhead? –Overhead = tags & valid bits

60 Cache Block Example Assume a 2 n byte direct mapped cache with 2 m byte blocks –Byte select – The lower m bits –Cache index – The lower (n-m) bits of the memory address –Cache tag – The upper (32-n) bits of the memory address

61 Block Sizes Larger block sizes take advantage of spatial locality –Also incurs larger miss penalty since it takes longer to transfer the block into the cache –Large block can also increase the average time or the miss rate Tradeoff in selecting block size Average Access Time = Hit Time *(1-MR) + Miss Penalty * MR

62 Direct Mapped Problems: Conflict misses Two blocks that are used concurrently and map to same index –Only one can fit in the cache, regardless of cache size –No flexibility in placing 2 nd block elsewhere Thrashing –If accesses alternate, one block will replace the other before reuse –No benefit from caching Conflicts & thrashing can happen quite often

63 Fully Associate Cache Opposite extreme in that I has no cache index to hash –Use any available entry to store memory elements –No conflict misses, only capacity misses –Must compare cache tags of all entries to find the desired one

64 Acknowledgements These slides contain material from courses: –UCB CS152 –Stanford EE108B

65 Break HomeWork5 –5.3 –5.4 –5.6 –Reading Chapter 4.1-4.11 ， Chapter 5.1-5.3

1 Lecture 9 Pipeline and Cache Peng Liu

Similar presentations

Presentation on theme: "1 Lecture 9 Pipeline and Cache Peng Liu"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Lecture 9 Pipeline and Cache Peng Liu

Similar presentations

Presentation on theme: "1 Lecture 9 Pipeline and Cache Peng Liu"— Presentation transcript:

Similar presentations

About project

Feedback