UNIT III -PIPELINE.

Slides:



Advertisements
Similar presentations
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Advertisements

COMP381 by M. Hamdi 1 (Recap) Pipeline Hazards. COMP381 by M. Hamdi 2 I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
CS1104: Computer Organisation School of Computing National University of Singapore.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Goal: Describe Pipelining
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Chapter Six 1.
Instruction-Level Parallelism (ILP)
Computer Architecture Pipelines & Superscalars. Pipelines Data Hazards Code: lw $4, 0($1) add $15, $1, $1 sub$2, $1, $3 and $12, $2, $5 or $13, $6, $2.
COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.
Chapter 12 Pipelining Strategies Performance Hazards.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
Pipelining Andreas Klappenecker CPSC321 Computer Architecture.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
EENG449b/Savvides Lec 4.1 1/22/04 January 22, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
DLX Instruction Format
Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
Appendix A Pipelining: Basic and Intermediate Concepts
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.
CPE 731 Advanced Computer Architecture Pipelining Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Pipelining Example Laundry Example: Three Stages
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
10/11: Lecture Topics Execution cycle Introduction to pipelining
Introduction to Computer Organization Pipelining.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.
CS203 – Advanced Computer Architecture Pipelining Review.
Pipelining 7/12/2013.
Review: Instruction Set Evolution
Pipelining: Hazards Ver. Jan 14, 2014
Single Clock Datapath With Control
Pipeline Implementation (4.6)
\course\cpeg323-08F\Topic6b-323
Appendix A - Pipelining
Morgan Kaufmann Publishers The Processor
\course\cpeg323-05F\Topic6b-323
Pipeline control unit (highly abstracted)
Instruction Execution Cycle
September 17 Test 1 pre(re)view Fang-Yi will demonstrate Spim
Pipeline control unit (highly abstracted)
Pipeline Control unit (highly abstracted)
Introduction to Computer Organization and Architecture
Throughput = #instructions per unit time (seconds/cycles etc.)
Pipelining Hazards.
Presentation transcript:

UNIT III -PIPELINE

Outline Pipelining exercise What is data hazards Data hazard solutions Assignment 1 Solution

Pipeline Exercise Consider a nonpipelined machine with 6 execution stages of lengths 50 ns, 50 ns, 60 ns, 60 ns, 50 ns, and 50 ns. Find the instruction latency on this machine. How much time does it take to execute 100 instructions? Solution: Instruction latency = 50+50+60+60+50+50= 320 ns Time to execute 100 instructions = 100*320 = 32000 ns

Instruction latency = 65 ns Suppose we introduce pipelining on this machine. Assume that when introducing pipelining, the clock skew adds 5ns of overhead to each execution stage. What is the instruction latency on the pipelined machine? How much time does it take to execute 100 instructions? Solution: Note: the length of the pipe stages must all be the same, The length of pipelined stage = MAX(lengths of unpipelined stages) + overhead = 60 + 5 = 65 ns Instruction latency =  65 ns Time to execute 100 instructions = 65*6 + 65*99  = 390 + 6435 = 6825 ns

What is the speedup obtained from pipelining What is the speedup obtained from pipelining? (here we do not consider any stalls introduced by different types of hazards which we will look at in the next section) Solution: Speedup = Old Execution Time / New Execution Time = 32000 / 6825 = 4.69

Pipelining Hazards Hazards prevent next instruction from executing during its designated clock cycle Structural hazards caused by hardware resource conflicts Data hazards arise when an instruction depends on the results of a previous instruction Control hazards caused by change of control (e.g. jump)

Data Hazards Data hazards occur when data is used before it is ready The use of the result of the SUB instruction in the next three instructions causes a data hazard, since the register $2 is not written until after those instructions read it.

Data Hazards Read After Write (RAW) I: add r1,r2,r3 J: sub r4,r1,r3 InstrJ tries to read operand before InstrI writes it Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. Execution Order is: InstrI InstrJ I: add r1,r2,r3 J: sub r4,r1,r3

Data Hazards Write After Read (WAR) I: sub r4,r1,r3 J: add r1,r2,r3 InstrJ tries to write operand before InstrI reads i Gets wrong operand Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 Execution Order is: InstrI InstrJ I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Data Hazards Write After Write (WAW) InstrJ tries to write operand before InstrI writes it Leaves wrong result ( InstrI not InstrJ ) Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW later in more complicated pipes Execution Order is: InstrI InstrJ I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Data Hazards Solutions Solutions for Data Hazards Stalling Add bubbles Forwarding Connect new value directly to next stage Reordering

Data Hazard - Stalling

Data Hazards - Forwarding Key idea: connect new value directly to next stage Still read s0, but ignore in favor of new result Problem: what about load instructions?

Data Hazards - Forwarding STALL still required for load - data avail. after MEM MIPS architecture calls this delayed load, initial implementations required compiler to deal with this

Forwarding Key idea: connect data internally before it's stored IF/ID ID/EX EX/MEM MEM/WB How would you design the forwarding?

Data Hazards - Reordering Assuming we have data forwarding, what are the hazards in this code? lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) Reorder instructions to remove hazard: lw $t0, 0($t1) lw $t2, 4($t1) sw $t0, 4($t1) sw $t2, 0($t1)

Assignment 1 Solution

Question 1 You are trying to figure out whether to construct a new fabrication facility for your IBM Power5 chips. It costs $1.5 billion to build a new fabrication facility. The benefit of the new fabrication is that you predict that you will be able to sell 3 times as many chips at 2 times the price of the old chips. Assume the wafer has a diameter of 300 mm. In both the old and new fabrications, it costs $1000 to fabricate a wafer, and the packaging and testing cost is $20 per chip (after final testing). You were preciously selling the chips for 40% more than their cost. Assume a = 4, Wafer Yield = 1. The fabrication process’ parameters are shown in the following table. Chip Die size (mm2) Estimated defect rate (per cm2) Manufacturing feature size (nm) Transistors (millions) Old IBM Power5 389 .30 130 276 New IBM Power5 186 .70 -

Question 1 Cost Formula Summary

Question 1 a) [2 Points] With the old fabrication, how many dies can we get from a wafer before we test individual dies? (Round the result to an integer number) b) [2 Points] With the old fabrication, what is the die yield? c) [2 Points] What is the cost of the old Power5 chip?

Question 1 d) [2 Points] With the new fabrication, how many dies can we get from a wafer before we test individual dies? (Round the result to an integer number) e) [2 Points] With the new fabrication, what is the die yield? f) [2 Points] What is the cost of the new Power5 chip?

Question 1 g) [2 Points] What was the selling price of each old Power5 chip? h) [2 Points] What is the selling price of each new Power5 chip? What is the difference between the selling price and the cost of the new chip? i) [4 Points] Suppose 50% of the difference between the selling price and the chip cost is your profit. If you sold 500,000 old Power5 chips per month, how long would it take for the accumulated profit to recoup the costs of the new fabrication facility?  

Question 2 An application running on a 1GHz pipelined processor has 55% load-store, 30% arithmetic, and 15% branch instructions. The individual CPIs of these instructions are 5, 4 and 4, respectively. a) [6 Points] Determine the overall CPI of this program execution on the given processor.

Question 2 b) A new embedded version of the processor is being modified to operate at 600 MHz. In this new version, the individual CPIs of load-store and arithmetic instructions are remaining unchanged. However, the individual CPI of branch instructions is getting stretched to 6 clock cycles. A new compiler is also developed for the new processor which eliminates 25% of load-store and 5% of arithmetic instructions for the given application (e.g., if there were 1 million load-store instructions before, now the program compiled by the new compiler would have only 750000). b1) [6 Points] Determine the overall CPI of this program execution on the new processor together with the new compiler technology.

Question 2 b) A new embedded version of the processor is being modified to operate at 600 MHz. In this new version, the individual CPIs of load-store and arithmetic instructions are remaining unchanged. However, the individual CPI of branch instructions is getting stretched to 6 clock cycles. A new compiler is also developed for the new processor which eliminates 25% of load-store and 5% of arithmetic instructions for the given application (e.g., if there were 1 million load-store instructions before, now the program compiled by the new compiler would have only 750000). b2) [8 Points] Determine the factor by which the application will run faster or slower on the new processor with the new compiler technology.

Question 3 Suppose there is a program which takes 200 seconds to execute. Of this time, 30% is used for multiplication, 60% for memory access instructions and 10% for other tasks. Suppose your goal is to enhance the performance of a processor by 2 times and there are two ways of doing so: either make multiply instructions run faster than before, or memory access instructions run faster than before, but not both. a) [6 Points] How much shall we improve the memory access instructions in order to achieve the performance enhancement? Is it possible? Thus, it is possible if we improve the memory access instructions by 6 times.

Question 3 Suppose there is a program which takes 200 seconds to execute. Of this time, 30% is used for multiplication, 60% for memory access instructions and 10% for other tasks. Suppose your goal is to enhance the performance of a processor by 2 times and there are two ways of doing so: either make multiply instructions run faster than before, or memory access instructions run faster than before, but not both. b) [6 Points] How much shall we improve the multiplication in order to achieve the performance enhancement? Is it possible? Thus, it is impossible.

Question 3 Suppose there is a program which takes 200 seconds to execute. Of this time, 30% is used for multiplication, 60% for memory access instructions and 10% for other tasks. Suppose your goal is to enhance the performance of a processor by 2 times and there are two ways of doing so: either make multiply instructions run faster than before, or memory access instructions run faster than before, but not both. c) [8 Points] Suppose we now improve the memory access by 5 times (Time(old execution) / Time(new execution) = 5). Unfortunately, the design makes multiplication slow down by 20% (i.e., Time(old execution) / Time(new execution) = 5/6). What will the overall speedup be?

Question 4 The following is a set of individual benchmark scores for each of the programs in the integer portion of the SPEC2000 benchmark. a) [6 Points] Calculate the speedup after the improvement using the arithmetic mean. Benchmark Score before improvement Score after improvement 164.gzip 10 12 175.vpr 14 16 176.gcc 23 28 181.mcf 36 40 186.crafty 9 197.parser 120 252.eon 25 253.perlbmk 18 21 254.gap 30 255.vortex 17 256.bzip2 7 300.twolf 38 42 Speedup = 31.5/19.92 = 1.58

Question 4 The following is a set of individual benchmark scores for each of the programs in the integer portion of the SPEC2000 benchmark. b) [6 Points] Calculate the speedup after the improvement using the geometric mean. Benchmark Score before improvement Score after improvement 164.gzip 10 12 175.vpr 14 16 176.gcc 23 28 181.mcf 36 40 186.crafty 9 197.parser 120 252.eon 25 253.perlbmk 18 21 254.gap 30 255.vortex 17 256.bzip2 7 300.twolf 38 42 Speedup = 24.42/17.39 = 1.40.

Question 4 c) [8 Points] Note that there is a difference between the above two improvement ratios, what is the main reason? The arithmetic mean is much more sensitive to large changes in one of the values in the set than the geometric mean. Most of the individual benchmarks see relatively small changes as we add the improvement to the architecture, but 197.parser improves by a factor of 10. This causes the arithmetic mean to increase by almost 60 percent, while the geometric mean increases by only 40 percent. This reduced sensitivity to individual values is why benchmarking experts prefer the geometric mean for averaging the results of multiple benchmarks, since one very good or very bad result in the set of benchmarks has less of an impact on the overall score.

Question 5 A given processor has 32 registers, uses 16-bit immediates, and has 142 instructions (corresponding to 142 operation codes) in its ISA. 20 percent of the instructions take one source register and have one destination register 30 percent of the instructions have two source registers and have one destination register 25 percent of the instructions have one source register and have one destination register, and take an immediate source operand 25 percent have one immediate source operand and one destination register a) [10 Points] For each of the four types of instructions, how many bits are required? Assume that the ISA requires that all instructions be a multiple of 8 bits in length, and the operation codes (opcodes) are of fixed length (i.e., the ISA does not use shorter opcodes for some instructions and longer opcodes for others). 8 bits are required to encode 142 instructions (128 < 142 < 256). 5 bits are required to encode 32 registers. 16 bits are required to encode each immediate. One source register, one destination register: 8 + 5 + 5 = 18 bits, which rounds up to 24 bits. Two source registers, one destination register: 8 + 5 + 5 + 5 = 23 bits, which rounds up to 24 bits. One source register, one destination register, and an immediate: 8 + 5 + 5 + 16 = 34 bits, which rounds up to 40 bits. One immediate, one destination register: 8 + 16 + 5 = 29 bits, which rounds up to 32 bits.

Question 5 A given processor has 32 registers, uses 16-bit immediates, and has 142 instructions (corresponding to 142 operation codes) in its ISA. 20 percent of the instructions take one source register and have one destination register 30 percent of the instructions have two source registers and have one destination register 25 percent of the instructions have one source register and have one destination register, and take an immediate source operand 25 percent have one immediate source operand and one destination register b) [10 Points] How much less memory does the program take up if a variable-length instruction set encoding is used as opposed to a fixed-length encoding? Since the longest instruction type requires 40 bits to encode, the fixed-length encoding will have 40 bits per instruction. For the variable-length encoding, the average number of bits per instruction is: 24  20% + 24  30% + 40  25% + 32  25% = 30 bits. (40  30) / 40 = 25%. Therefore, the variable-length encoding requires 25 percent less space than the fixed-length encoding for this program.