Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cs 61C L25 pipeline.1 Patterson Spring 99 ©UCB CS61C Introduction to Pipelining Lecture 25 April 28, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

Similar presentations


Presentation on theme: "Cs 61C L25 pipeline.1 Patterson Spring 99 ©UCB CS61C Introduction to Pipelining Lecture 25 April 28, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)"— Presentation transcript:

1 cs 61C L25 pipeline.1 Patterson Spring 99 ©UCB CS61C Introduction to Pipelining Lecture 25 April 28, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html

2 cs 61C L25 pipeline.2 Patterson Spring 99 ©UCB Outline °Review Parameter Passing on Stacks °Pipelining Analogy °Pipelining Instruction Execution °Administrivia, “What’s this Stuff Bad for?” °Hazards to Pipelining °Solutions to Hazards °Advanced Pipelining Concepts by Analogy °Conclusion

3 cs 61C L25 pipeline.3 Patterson Spring 99 ©UCB Review 1/1 °Every machine has a convention for how arguments are passed. °In MIPS, where do the arguments go if you are passing more than 4 words? Stack! °It is sometimes useful to have a variable number of arguments. The C convention is to use “...” *fmt is used to determine the number of variables and their types.

4 cs 61C L25 pipeline.4 Patterson Spring 99 ©UCB Pipelining is Natural! Laundry Example °Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away °Washer takes 30 minutes °Dryer takes 30 minutes °“Folder” takes 30 minutes °“Stasher” takes 30 minutes to put clothes into drawers ABCD

5 cs 61C L25 pipeline.5 Patterson Spring 99 ©UCB Sequential Laundry °Sequential laundry takes 8 hours for 4 loads 30 TaskOrderTaskOrder B C D A Time 30 6 PM 7 8 9 10 11 12 1 2 AM

6 cs 61C L25 pipeline.6 Patterson Spring 99 ©UCB Pipelined Laundry: Start work ASAP °Pipelined laundry takes 3.5 hours for 4 loads! TaskOrderTaskOrder 12 2 AM 6 PM 7 8 9 10 11 1 Time B C D A 30

7 cs 61C L25 pipeline.7 Patterson Spring 99 ©UCB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks operating simultaneously using different resources °Potential speedup = Number pipe stages °Time to “fill” pipeline and time to “drain” it reduces speedup: 2.3X v. 4X in this example 6 PM 789 Time B C D A 30 TaskOrderTaskOrder

8 cs 61C L25 pipeline.8 Patterson Spring 99 ©UCB Pipelining Lessons °Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline? °Pipeline rate limited by slowest pipeline stage °Unbalanced lengths of pipe stages also reduces speedup 6 PM 789 Time B C D A 30 TaskOrderTaskOrder

9 cs 61C L25 pipeline.9 Patterson Spring 99 ©UCB Review: Steps in Executing MIPS (Lec. 20) 1) Ifetch: Fetch Instruction, Increment PC 2) Decode Instruction, Read Registers 3) Execute: Mem-ref:Calculate Address Arith-log: Perform Operation Branch:Compare if operands == 4) Memory: Load:Read Data from Memory Store:Write Data to Memory 5) Write Back: Write Data to Register Branch:if operands ==, Change PC

10 cs 61C L25 pipeline.10 Patterson Spring 99 ©UCB Pipelined Execution Representation °To simplify pipeline, every instruction takes same number of steps °Steps also called pipeline “stages” IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB Program Flow Time

11 cs 61C L25 pipeline.11 Patterson Spring 99 ©UCB Review: A Datapath for MIPS (Lec. 20) Data Cache PCRegisters ALU Instruction Cache Stage 1Stage 2Stage 3(Stage 4) Stage 5 IFtchDcdExecMemWB °Use data path figure to represent pipeline ALU I$ Reg D$Reg

12 cs 61C L25 pipeline.12 Patterson Spring 99 ©UCB ALU I$ Reg D$Reg I$ Graphical Pipeline Representation I n s t r. O r d e r Time (clock cycles) Load Add Store Sub Or ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU Reg D$Reg ALU I$ Reg D$Reg (right half highlight means read, left half write)

13 cs 61C L25 pipeline.13 Patterson Spring 99 ©UCB Example °Suppose 2 ns for memory access, 2 ns for ALU operation, and 1 ns for register file read or write °Nonpipelined Execution: lw : IF + Read Reg + ALU + Memory + Write Reg = 2 + 1 + 2 + 2 + 1 = 8 ns add: IF + Read Reg + ALU + Write Reg = 2 + 1 + 2 + 1 = 6 ns °Pipelined Execution: Max(IF,Read Reg,ALU, Memory,Write Reg) = 2 ns

14 cs 61C L25 pipeline.14 Patterson Spring 99 ©UCB Administrivia °Project 6 (last): Due Today °Next Readings: 7.5 °11th homework (last): Due Friday 4/30 7PM Exercises 2.6, 2.13, 6.1, 6.3, 6.4

15 cs 61C L25 pipeline.15 Patterson Spring 99 ©UCB Administrivia: Rest of 61C F4/30Review: Caches/TLB/VM; Section 7.5 M5/3Deadline to correct your grade record W5/5Review: Interrupts / Polling; A.7 F5/761C Summary / Your Cal heritage / HKN Course Evalution (Due: Final 61C Survey in lab; Return) Sun 5/9 Final Review starting 2PM (1 Pimintel) W5/12 Final (5PM 1 Pimintel) Need Alternative Final? Contact mds@cory

16 cs 61C L25 pipeline.16 Patterson Spring 99 ©UCB “What’s This Stuff (Potentially) Bad For?” Linking Entertainment to Violence 100s of studies in recent decades have revealed a direct correlation between exposure to media violence--including video games--and increased aggression. "We are reaching that stage of desensitization at which the inflicting of pain and suffering has become a source of entertainment; vicarious pleasure rather than revulsion. We are learning to kill, and we are learning to like it." Like the tobacco industry, “the evidence is there." The 14-year-old boy who opened fire on a prayer group in a Ky. school foyer in 1997 was a video-game expert. He had never fired a pistol before, but in the ensuing melee, he fired 8 shots, hit 8 people, and killed 3. The average law enforcement officer in the United States, at a distance of 7 yards, hits fewer than 1 in 5 shots. Because of freedom of speech is a value that we don't want to compromise, “it really comes down to the people creating these games. That's where the responsibility lies." N.Y. Times, 4/26/99

17 cs 61C L25 pipeline.17 Patterson Spring 99 ©UCB Pipeline Hazard: Matching socks in later load A depends on D; stall since folder tied up TaskOrderTaskOrder 12 2 AM 6 PM 7 8 9 10 11 1 Time B C D A E F bubble 30

18 cs 61C L25 pipeline.18 Patterson Spring 99 ©UCB Problems for Computers °Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard “bubbles” in the pipeline Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)

19 cs 61C L25 pipeline.19 Patterson Spring 99 ©UCB $ Single Cache is a Structural Hazard Load Instr 1 Instr 2 Instr 3 Instr 4 ALU $ Reg $ ALU $ Reg $ ALU $ Reg $ ALU Reg $ ALU $ Reg $ Read same memory twice in same clock cycle I n s t r. O r d e r Time (clock cycles)

20 cs 61C L25 pipeline.20 Patterson Spring 99 ©UCB Structural Hazards limit performance °Example: if 1.3 memory accesses per instruction (30% of instructions executed loads and stores) and only one memory access per cycle then Average CPI ≥ 1.3 Otherwise resource is more than 100% utilized

21 cs 61C L25 pipeline.21 Patterson Spring 99 ©UCB °Stall: wait until decision is clear Move up decision to 2nd stage by adding hardware to check registers as being read °Impact: 2 clock cycles per branch instruction  slow Control Hazard Solutions Add Beq Load ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU Reg D$Reg I$ I n s t r. O r d e r Time (clock cycles) bub ble

22 cs 61C L25 pipeline.22 Patterson Spring 99 ©UCB °Predict: guess one direction, then back up if wrong For example, Predict not taken °Impact: 1 clock per branch instruction if right, 2 if wrong (right ~ 50% of time) °More dynamic scheme: history of 1 branch (~ 90%) Control Hazard Solutions Add Beq Load ALU I$ Reg D$Reg ALU I$ Reg D$Reg I$ ALU Reg D$Reg I n s t r. O r d e r Time (clock cycles)

23 cs 61C L25 pipeline.23 Patterson Spring 99 ©UCB °Redefine branch behavior (takes place after next instruction) “delayed branch” °Impact: 1 clock cycles per branch instruction if can find instruction to put in “slot” (≈ 50% of time) Control Hazard Solutions Add Beq Misc ALU I$ Reg D$Reg ALU I$ Reg D$Reg I$ ALU Reg D$Reg Load I$ ALU Reg D$Reg I n s t r. O r d e r Time (clock cycles)

24 cs 61C L25 pipeline.24 Patterson Spring 99 ©UCB Example Nondelayed vs. Delayed Branch add $1,$2,$3 sub $4, $5,$6 beq $1, $4, Exit or $8, $9,$10 xor $10, $1,$11 Nondelayed Branch Exit: add $1,$2,$3 sub $4, $5,$6 beq $1, $4, Exit or $8, $9,$10 xor $10, $1,$11 Delayed Branch Exit:

25 cs 61C L25 pipeline.25 Patterson Spring 99 ©UCB Data Hazard on Register $1 add $1,$2,$3 sub $4, $1,$3 and $6, $1,$7 or $8, $1,$9 xor $10, $1,$11

26 cs 61C L25 pipeline.26 Patterson Spring 99 ©UCB Dependencies backwards in time are hazards Data Hazard on $1: add $1,$2,$3 sub $4,$1,$3 and $6,$1,$7 or $8,$1,$9 xor $10,r1,$11 IFID/RFEXMEMWB ALU I$ Reg D$ Reg ALU I$ Reg D$Reg ALU I$ Reg D$Reg I$ ALU Reg D$Reg ALU I$ Reg D$Reg I n s t r. O r d e r Time (clock cycles)

27 cs 61C L25 pipeline.27 Patterson Spring 99 ©UCB “Forward” result from one stage to another “or” OK if define read/write properly Data Hazard Solution: add $1,$2,$3 sub $4,$1,$3 and $6,$1,$7 or $8,$1,$9 xor $10,r1,$11 IFID/RFEXMEMWB ALU I$ Reg D$ Reg ALU I$ Reg D$Reg ALU I$ Reg D$Reg I$ ALU Reg D$Reg ALU I$ Reg D$Reg I n s t r. O r d e r Time (clock cycles)

28 cs 61C L25 pipeline.28 Patterson Spring 99 ©UCB Dependencies backwards in time are hazards Can’t solve with forwarding Must stall instruction dependent on loads Forwarding (or Bypassing): What about Loads lw $1,0($2) sub $4,$1,$3 IFID/RFEXMEMWB ALU I$ Reg D$ Reg ALU I$ Reg D$Reg

29 cs 61C L25 pipeline.29 Patterson Spring 99 ©UCB Must stall (insert bubble in) pipeline Data Hazard Even with Forwarding lw $1, 0($2) sub $4,$1,$6 and $6,$1,$7 or $8,$1,$9 IFID/RFEXMEMWB ALU I$ Reg D$ Reg ALU I$ Reg D$Reg ALU I$ Reg D$Reg I$ ALU Reg D$ Time (clock cycles) bub ble

30 cs 61C L25 pipeline.30 Patterson Spring 99 ©UCB Try producing fast code for a = b + c; d = e – f; a, b, c, d,e, and f in memory Slow code: lw $2,b lw $3,c add $1,$2,$3 sw $1,a lw $5,e lw $6,f sub $4,$5,$6 sw$4,d Software Scheduling to Avoid Load Hazards Fast code: lw $2,b lw $3,c lw $5,e add $1,$2,$3 lw $6,f sw $1,a sub $4,$5,$6 sw$4,d

31 cs 61C L25 pipeline.31 Patterson Spring 99 ©UCB Advanced Pipelining Concepts (if time) °“Out-of-order” Execution °“Superscalar” execution °State-of-the-Art Microprocessor

32 cs 61C L25 pipeline.32 Patterson Spring 99 ©UCB Review Pipeline Hazard: Stall is dependency A depends on D; stall since folder tied up TaskOrderTaskOrder 12 2 AM 6 PM 7 8 9 10 11 1 Time B C D A E F bubble 30

33 cs 61C L25 pipeline.33 Patterson Spring 99 ©UCB Out-of-Order Laundry: Don’t Wait A depends on D; rest continue; need more resources to allow out-of-order TaskOrderTaskOrder 12 2 AM 6 PM 7 8 9 10 11 1 Time B C D A 30 E F bubble

34 cs 61C L25 pipeline.34 Patterson Spring 99 ©UCB Superscalar Laundry: Parallel per stage More resources, HW to match mix of parallel tasks? TaskOrderTaskOrder 12 2 AM 6 PM 7 8 9 10 11 1 Time B C D A E F (light clothing) (dark clothing) (very dirty clothing) (light clothing) (dark clothing) (very dirty clothing) 30

35 cs 61C L25 pipeline.35 Patterson Spring 99 ©UCB Superscalar Laundry: Mismatch Mix Task mix underutilizes extra resources TaskOrderTaskOrder 12 2 AM 6 PM 7 8 9 10 11 1 Time 30 (light clothing) (dark clothing) (light clothing) A B D C

36 cs 61C L25 pipeline.36 Patterson Spring 99 ©UCB State of the Art: Alpha 21264 °15 Million transistors °2 64KB caches on chip; 16MB L2 cache off chip °Clock cycle time 600 MHz (Fastest Cray Supercomputer: T90 2.2 nsec) 90 watts per chip! °Superscalar: fetch up to 6 instructions/clock cycle, retires up to 4 instruction/clock cycle °Execution out-of-order

37 cs 61C L25 pipeline.37 Patterson Spring 99 ©UCB Summary 1/2: Pipelining Introduction °Pipelining is a fundamental concept Multiple steps using distinct resources Exploiting parallelism in instructions °What makes it easy? (MIPS vs. 80x86) All instructions are the same length  simple instruction fetch Just a few instruction formats  read registers before decode instruction Memory operands only in loads and stores  fewer pipeline stages Data aligned  1 memory access / load, store

38 cs 61C L25 pipeline.38 Patterson Spring 99 ©UCB Summary 2/2: Pipelining Introduction °What makes it hard? °Structural hazards: suppose we had only one cache?  Need more HW resources °Control hazards: need to worry about branch instructions?  Branch prediction, delayed branch °Data hazards: an instruction depends on a previous instruction?  need forwarding, compiler scheduling


Download ppt "Cs 61C L25 pipeline.1 Patterson Spring 99 ©UCB CS61C Introduction to Pipelining Lecture 25 April 28, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)"

Similar presentations


Ads by Google