UNIT III -PIPELINE
Outline Pipelining exercise What is data hazards Data hazard solutions Assignment 1 Solution
Pipeline Exercise Consider a nonpipelined machine with 6 execution stages of lengths 50 ns, 50 ns, 60 ns, 60 ns, 50 ns, and 50 ns. Find the instruction latency on this machine. How much time does it take to execute 100 instructions? Solution: Instruction latency = 50+50+60+60+50+50= 320 ns Time to execute 100 instructions = 100*320 = 32000 ns
Instruction latency = 65 ns Suppose we introduce pipelining on this machine. Assume that when introducing pipelining, the clock skew adds 5ns of overhead to each execution stage. What is the instruction latency on the pipelined machine? How much time does it take to execute 100 instructions? Solution: Note: the length of the pipe stages must all be the same, The length of pipelined stage = MAX(lengths of unpipelined stages) + overhead = 60 + 5 = 65 ns Instruction latency = 65 ns Time to execute 100 instructions = 65*6 + 65*99 = 390 + 6435 = 6825 ns
What is the speedup obtained from pipelining What is the speedup obtained from pipelining? (here we do not consider any stalls introduced by different types of hazards which we will look at in the next section) Solution: Speedup = Old Execution Time / New Execution Time = 32000 / 6825 = 4.69
Pipelining Hazards Hazards prevent next instruction from executing during its designated clock cycle Structural hazards caused by hardware resource conflicts Data hazards arise when an instruction depends on the results of a previous instruction Control hazards caused by change of control (e.g. jump)
Data Hazards Data hazards occur when data is used before it is ready The use of the result of the SUB instruction in the next three instructions causes a data hazard, since the register $2 is not written until after those instructions read it.
Data Hazards Read After Write (RAW) I: add r1,r2,r3 J: sub r4,r1,r3 InstrJ tries to read operand before InstrI writes it Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. Execution Order is: InstrI InstrJ I: add r1,r2,r3 J: sub r4,r1,r3
Data Hazards Write After Read (WAR) I: sub r4,r1,r3 J: add r1,r2,r3 InstrJ tries to write operand before InstrI reads i Gets wrong operand Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 Execution Order is: InstrI InstrJ I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7
Data Hazards Write After Write (WAW) InstrJ tries to write operand before InstrI writes it Leaves wrong result ( InstrI not InstrJ ) Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW later in more complicated pipes Execution Order is: InstrI InstrJ I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7
Data Hazards Solutions Solutions for Data Hazards Stalling Add bubbles Forwarding Connect new value directly to next stage Reordering
Data Hazard - Stalling
Data Hazards - Forwarding Key idea: connect new value directly to next stage Still read s0, but ignore in favor of new result Problem: what about load instructions?
Data Hazards - Forwarding STALL still required for load - data avail. after MEM MIPS architecture calls this delayed load, initial implementations required compiler to deal with this
Forwarding Key idea: connect data internally before it's stored IF/ID ID/EX EX/MEM MEM/WB How would you design the forwarding?
Data Hazards - Reordering Assuming we have data forwarding, what are the hazards in this code? lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) Reorder instructions to remove hazard: lw $t0, 0($t1) lw $t2, 4($t1) sw $t0, 4($t1) sw $t2, 0($t1)
Assignment 1 Solution
Question 1 You are trying to figure out whether to construct a new fabrication facility for your IBM Power5 chips. It costs $1.5 billion to build a new fabrication facility. The benefit of the new fabrication is that you predict that you will be able to sell 3 times as many chips at 2 times the price of the old chips. Assume the wafer has a diameter of 300 mm. In both the old and new fabrications, it costs $1000 to fabricate a wafer, and the packaging and testing cost is $20 per chip (after final testing). You were preciously selling the chips for 40% more than their cost. Assume a = 4, Wafer Yield = 1. The fabrication process’ parameters are shown in the following table. Chip Die size (mm2) Estimated defect rate (per cm2) Manufacturing feature size (nm) Transistors (millions) Old IBM Power5 389 .30 130 276 New IBM Power5 186 .70 -
Question 1 Cost Formula Summary
Question 1 a) [2 Points] With the old fabrication, how many dies can we get from a wafer before we test individual dies? (Round the result to an integer number) b) [2 Points] With the old fabrication, what is the die yield? c) [2 Points] What is the cost of the old Power5 chip?
Question 1 d) [2 Points] With the new fabrication, how many dies can we get from a wafer before we test individual dies? (Round the result to an integer number) e) [2 Points] With the new fabrication, what is the die yield? f) [2 Points] What is the cost of the new Power5 chip?
Question 1 g) [2 Points] What was the selling price of each old Power5 chip? h) [2 Points] What is the selling price of each new Power5 chip? What is the difference between the selling price and the cost of the new chip? i) [4 Points] Suppose 50% of the difference between the selling price and the chip cost is your profit. If you sold 500,000 old Power5 chips per month, how long would it take for the accumulated profit to recoup the costs of the new fabrication facility?
Question 2 An application running on a 1GHz pipelined processor has 55% load-store, 30% arithmetic, and 15% branch instructions. The individual CPIs of these instructions are 5, 4 and 4, respectively. a) [6 Points] Determine the overall CPI of this program execution on the given processor.
Question 2 b) A new embedded version of the processor is being modified to operate at 600 MHz. In this new version, the individual CPIs of load-store and arithmetic instructions are remaining unchanged. However, the individual CPI of branch instructions is getting stretched to 6 clock cycles. A new compiler is also developed for the new processor which eliminates 25% of load-store and 5% of arithmetic instructions for the given application (e.g., if there were 1 million load-store instructions before, now the program compiled by the new compiler would have only 750000). b1) [6 Points] Determine the overall CPI of this program execution on the new processor together with the new compiler technology.
Question 2 b) A new embedded version of the processor is being modified to operate at 600 MHz. In this new version, the individual CPIs of load-store and arithmetic instructions are remaining unchanged. However, the individual CPI of branch instructions is getting stretched to 6 clock cycles. A new compiler is also developed for the new processor which eliminates 25% of load-store and 5% of arithmetic instructions for the given application (e.g., if there were 1 million load-store instructions before, now the program compiled by the new compiler would have only 750000). b2) [8 Points] Determine the factor by which the application will run faster or slower on the new processor with the new compiler technology.
Question 3 Suppose there is a program which takes 200 seconds to execute. Of this time, 30% is used for multiplication, 60% for memory access instructions and 10% for other tasks. Suppose your goal is to enhance the performance of a processor by 2 times and there are two ways of doing so: either make multiply instructions run faster than before, or memory access instructions run faster than before, but not both. a) [6 Points] How much shall we improve the memory access instructions in order to achieve the performance enhancement? Is it possible? Thus, it is possible if we improve the memory access instructions by 6 times.
Question 3 Suppose there is a program which takes 200 seconds to execute. Of this time, 30% is used for multiplication, 60% for memory access instructions and 10% for other tasks. Suppose your goal is to enhance the performance of a processor by 2 times and there are two ways of doing so: either make multiply instructions run faster than before, or memory access instructions run faster than before, but not both. b) [6 Points] How much shall we improve the multiplication in order to achieve the performance enhancement? Is it possible? Thus, it is impossible.
Question 3 Suppose there is a program which takes 200 seconds to execute. Of this time, 30% is used for multiplication, 60% for memory access instructions and 10% for other tasks. Suppose your goal is to enhance the performance of a processor by 2 times and there are two ways of doing so: either make multiply instructions run faster than before, or memory access instructions run faster than before, but not both. c) [8 Points] Suppose we now improve the memory access by 5 times (Time(old execution) / Time(new execution) = 5). Unfortunately, the design makes multiplication slow down by 20% (i.e., Time(old execution) / Time(new execution) = 5/6). What will the overall speedup be?
Question 4 The following is a set of individual benchmark scores for each of the programs in the integer portion of the SPEC2000 benchmark. a) [6 Points] Calculate the speedup after the improvement using the arithmetic mean. Benchmark Score before improvement Score after improvement 164.gzip 10 12 175.vpr 14 16 176.gcc 23 28 181.mcf 36 40 186.crafty 9 197.parser 120 252.eon 25 253.perlbmk 18 21 254.gap 30 255.vortex 17 256.bzip2 7 300.twolf 38 42 Speedup = 31.5/19.92 = 1.58
Question 4 The following is a set of individual benchmark scores for each of the programs in the integer portion of the SPEC2000 benchmark. b) [6 Points] Calculate the speedup after the improvement using the geometric mean. Benchmark Score before improvement Score after improvement 164.gzip 10 12 175.vpr 14 16 176.gcc 23 28 181.mcf 36 40 186.crafty 9 197.parser 120 252.eon 25 253.perlbmk 18 21 254.gap 30 255.vortex 17 256.bzip2 7 300.twolf 38 42 Speedup = 24.42/17.39 = 1.40.
Question 4 c) [8 Points] Note that there is a difference between the above two improvement ratios, what is the main reason? The arithmetic mean is much more sensitive to large changes in one of the values in the set than the geometric mean. Most of the individual benchmarks see relatively small changes as we add the improvement to the architecture, but 197.parser improves by a factor of 10. This causes the arithmetic mean to increase by almost 60 percent, while the geometric mean increases by only 40 percent. This reduced sensitivity to individual values is why benchmarking experts prefer the geometric mean for averaging the results of multiple benchmarks, since one very good or very bad result in the set of benchmarks has less of an impact on the overall score.
Question 5 A given processor has 32 registers, uses 16-bit immediates, and has 142 instructions (corresponding to 142 operation codes) in its ISA. 20 percent of the instructions take one source register and have one destination register 30 percent of the instructions have two source registers and have one destination register 25 percent of the instructions have one source register and have one destination register, and take an immediate source operand 25 percent have one immediate source operand and one destination register a) [10 Points] For each of the four types of instructions, how many bits are required? Assume that the ISA requires that all instructions be a multiple of 8 bits in length, and the operation codes (opcodes) are of fixed length (i.e., the ISA does not use shorter opcodes for some instructions and longer opcodes for others). 8 bits are required to encode 142 instructions (128 < 142 < 256). 5 bits are required to encode 32 registers. 16 bits are required to encode each immediate. One source register, one destination register: 8 + 5 + 5 = 18 bits, which rounds up to 24 bits. Two source registers, one destination register: 8 + 5 + 5 + 5 = 23 bits, which rounds up to 24 bits. One source register, one destination register, and an immediate: 8 + 5 + 5 + 16 = 34 bits, which rounds up to 40 bits. One immediate, one destination register: 8 + 16 + 5 = 29 bits, which rounds up to 32 bits.
Question 5 A given processor has 32 registers, uses 16-bit immediates, and has 142 instructions (corresponding to 142 operation codes) in its ISA. 20 percent of the instructions take one source register and have one destination register 30 percent of the instructions have two source registers and have one destination register 25 percent of the instructions have one source register and have one destination register, and take an immediate source operand 25 percent have one immediate source operand and one destination register b) [10 Points] How much less memory does the program take up if a variable-length instruction set encoding is used as opposed to a fixed-length encoding? Since the longest instruction type requires 40 bits to encode, the fixed-length encoding will have 40 bits per instruction. For the variable-length encoding, the average number of bits per instruction is: 24 20% + 24 30% + 40 25% + 32 25% = 30 bits. (40 30) / 40 = 25%. Therefore, the variable-length encoding requires 25 percent less space than the fixed-length encoding for this program.