1 השעון Hertz=1/sec מחשב פנטיום במהירות של פירושו שהוא מבצע 8^10 2 מחזורי שעון בשניה. כל מחזור שעון לוקח 200MHZ 510^-9=5nanosecond כמה לוקחת פקודה בימינו?

Slides:

Advertisements

Similar presentations

Morgan Kaufmann Publishers The Processor

Advertisements

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Pipeline Hazards See: P&H Chapter 4.7.

ECE 445 – Computer Organization

Part 2 - Data Hazards and Forwarding 3/24/04++

Review: MIPS Pipeline Data and Control Paths

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

השעון Hertz=1/sec מחשב פנטיום במהירות של פירושו שהוא מבצע 8^10 *2 מחזורי שעון בשניה. כל מחזור שעון לוקח 200MHZ 5*10^-9=5nanosecond כמה לוקחת פקודה בימינו?

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

Chapter Six Enhancing Performance with Pipelining

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

 The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent.

1 השעון Hertz=1/sec מחשב פנטיום במהירות של פירושו שהוא מבצע 8^10 *2 מחזורי שעון בשניה. כל מחזור שעון לוקח 200MHZ 5*10^-9=5nanosecond כמה לוקחת פקודה בימינו?

1 CSE SUNY New Paltz Chapter Six Enhancing Performance with Pipelining.

1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia.

Exp. 6 Solving Data Hazards. Where to find ALU Result Instruction memory Data memory 1010 PC ALU Registers Rd Rt 0101 IF/IDID/EXEX/MEMMEM/WB.

Lecture 28: Chapter 4 Today’s topic –Data Hazards –Forwarding 1.

Computer Architecture - A Pipelined Datapath A Pipelined Datapath  Resisters are used to save data between stages. 1/14.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

1 Stalls and flushes  So far, we have discussed data hazards that can occur in pipelined CPUs if some instructions depend upon others that are still executing.

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Pipeline Data Hazards: Detection and Circumvention Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly.

Pipelined Datapath and Control

CPE432 Chapter 4B.1Dr. W. Abu-Sufah, UJ Chapter 4B: The Processor, Part B-2 Read Section 4.7 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.

Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2012 Revised from original slides provided by MKP.

Basic Pipelining & MIPS Pipelining Chapter 6 [Computer Organization and Design, © 2007 Patterson (UCB) & Hennessy (Stanford), & Slides Adapted from: Mary.

CMPE 421 Parallel Computer Architecture

CMPE 421 Parallel Computer Architecture Part 2: Hardware Solution: Forwarding.

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,

2/15/02CSE Data Hazzards Data Hazards in the Pipelined Implementation.

CSE431 L07 Overcoming Data Hazards.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 07: Overcoming Data Hazards Mary Jane Irwin (

CSIE30300 Computer Architecture Unit 05: Overcoming Data Hazards Hsin-Chou Chi [Adapted from material by and

CPE432 Chapter 4B.1Dr. W. Abu-Sufah, UJ Chapter 4B: The Processor, Part B-1 Read Sections 4.7 Adapted from Slides by Prof. Mary Jane Irwin, Penn State.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

EECS 322 April 10, 2000 EECS 322 Computer Architecture Pipeline Control, Data Hazards and Branch Hazards.

CSE 340 Computer Architecture Spring 2016 Overcoming Data Hazards.

Computer Organization

Stalling delays the entire pipeline

Note how everything goes left to right, except …

CDA 3101 Spring 2016 Introduction to Computer Organization

Single Clock Datapath With Control

ECS 154B Computer Architecture II Spring 2009

ECS 154B Computer Architecture II Spring 2009

ECE232: Hardware Organization and Design

Forwarding Now, we’ll introduce some problems that data hazards can cause for our pipelined processor, and show how to handle them with forwarding.

Chapter 4 The Processor Part 3

Review: MIPS Pipeline Data and Control Paths

CSCI206 - Computer Organization & Programming

Computer Architecture MIPS Pipeline

Pipelining review.

Single-cycle datapath, slightly rearranged

Computer Organization CS224

Pipelining in more detail

Data Hazards Data Hazard

Pipeline control unit (highly abstracted)

The Processor Lecture 3.6: Control Hazards

The Processor Lecture 3.4: Pipelining Datapath and Control

The Processor Lecture 3.5: Data Hazards

Pipeline control unit (highly abstracted)

Pipeline Control unit (highly abstracted)

Pipelining (II).

Morgan Kaufmann Publishers The Processor

Introduction to Computer Organization and Architecture

Pipelining - 1.

Stalls and flushes Last time, we discussed data hazards that can occur in pipelined CPUs if some instructions depend upon others that are still executing.

©2003 Craig Zilles (derived from slides by Howard Huang)

ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.

Presentation transcript:

1 השעון Hertz=1/sec מחשב פנטיום במהירות של פירושו שהוא מבצע 8^10 *2 מחזורי שעון בשניה. כל מחזור שעון לוקח 200MHZ 5*10^-9=5nanosecond כמה לוקחת פקודה בימינו?

2 PIPELINE הרעיון מאחורי Never waste time !!!

3 חלוקה לשלבים

4 הוספת הרגיסטרים

5 Instruction memory Address Add Add result Shift left 2 I n s t r u c t i o n IF/IDEX/MEMMEM/WB M u x 0 1 Add PC 0 Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero ID/EX Instruction fetch lw A IF/ID

6 ID/EX

7 EX/MEM

8 Instruction memory Address Add Add result Shift left 2 I n s t r u c t i o n IF/IDEX/MEM M u x 0 1 Ad d PC 0 Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data Data memory 1 ALU result M u x ALU Zero ID/EXMEM/WB Memory lw Address MEM/WB

9

10 A correction !!! תיקון Keep the right Rd all the way!

11 So here is the updated CPU;

12 תצוגה גרפית

13 Control

14 קווי הבקרה

15 Datapath with Control

16 דוגמא A demonstration of a sequence of instructions: Lw $10,20($1) Sub $11,$2,$3 And $12,$4,$5 Or $13,$6,$7 Add $14,$8,$9

17

18 ID: and $12, $4, $5

19

20

21

22 An example for data hazards: sub$2, $1, $3 and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2)

23 An example for data hazards: sub$2, $1, $3 and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) An example for data hazards: Register $2 is updated only at the WB phase, i.e., the 5th clock cycle (actually at the end of the 5th clock cycle). However, we try to use it at the 3rd clock cycle when we read $2 at the decode phase of the and instruction

24 Graphic representation of data hazards:

25 sub$2, $1, $3 nop and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Solving data hazards by adding nops

26 Solving data hazards by adding nops Program execution order (in instructions) and $12, $2, $ or $13, $6, $2 add $14, $2, $ sw $15, 100($2 5 2 ) g IMReg IM Reg DMReg IMDMReg IMDMRe Reg Reg Reg DM IMRegDM R e g IMRegDM Reg IMRegDM R e g IMRegDM R e g sub $2, $1, $3 nop

27 The internal structure of the Register File 32 Read data 2 write data Read data Rd reg 2 (= Rt) Rd reg 1 (= Rs) RegWrite Wr reg (= Rd) 32 E קוראים משתי היציאות בוזמנית ערכים של שני רגיסטרים שונים כותבים לאחד הרגיסטרים האחרים (בעליית השעון הבאה)

28 We could earn 1 ck cycle if GPR is “transparent” Program execution order (in instructions) and $12, $2, $ or $13, $6, $2 add $14, $2, $ sw $15, 100($2 5 2 ) g IMRegDM R e g IMRegDM Reg IMRegDM R e g sub $2, $1, $3 nop We could earn 1 ck cycle if GPR is “transparent”, i.e, we could see the write data to the GPR at the GPR outputs (if the write address equals the read address), i.e., during Ck #5.

29 The internal structure of the modified Register File. We ‘bypass” the input data (the write data) to the read data1 output whenever Rs=Rd/Rt (i.e., whenever read reg1=write reg but not zero). We “bypass” the input data (the write data) to the read data2 output whenever Rt=Rd/Rt (i.e., whenever read reg2=write reg, but not zero). 32 Read data 2 write data Read data Rd reg 2 (= Rt) Rd reg 1 (= Rs) RegWrite Wr reg (= Rd) 32 E קוראים משתי היציאות בוזמנית ערכים של שני רגיסטרים שונים כותבים לאחד הרגיסטרים האחרים (בעליית השעון הבאה) write data 32 write data Wr reg 5 5

30 sub$2, $1, $3 nop and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) After doing that change we only need 2 nops After the change the WB of an early instruction can happen at the same time with the read reg (decode) phase of a newer instruction (3 with two other instructions in between). In case we have a data hazard, we need to add only two nop instructions. Unfortunately, this happens too often. We need a better solution!

31 Graphic representation of data hazards:

32 Forwarding – גניבת הערכים

33 Forwarding (done at the execute phase) If ID/EX.Rs=EX/MEM.Rd, i.e., the Rd of the previous instruction equals the Rs of the current instruction (which is in the “decode” phase), then we use the “ALUout” of the previous instruction instead of the output of the GPR. If ID/EX.Rs=MEM/WB.Rd, i.e., the Rd of the previous instruction equals the Rs of the current instruction (which is in the “decode” phase), then we use the “ALUout” of the previous instruction instead of the output of the GPR. [ similarly, compare also ID/EX.Rt to MEM/WB.Rd ] Similarly, compare also ID/EX.Rt to EX/MEM.Rd and to MEM/WB.Rd

34 Data hazard from previous instruction: ALU Src A: If (ID/EX.Rs = = EX/MEM.Rd) use the “ ALUOut ” instead of Rs I.e., if Rs of the current executing instruction = = Rd of the previous instruction The actual equations are: if ((EX/MEM.RegWrite = = ‘ 1 ” )&& (EX/MEM.Rd <> 0)&& (ID/EX.Rs = = EX/MEM.Rd)) => ForwardA= “ 1,0 ” ALU Src B: If (ID/EX.Rt = = EX/MEM.Rd) use the “ ALUOut ” instead of Rt I.e., if Rt of the current executing instruction = = Rd of the previous instruction The actual equations are: if ((EX/MEM.RegWrite = = ‘ 1 ” )&& (EX/MEM.Rd <> 0)&& (ID/EX.Rt = = EX/MEM.Rd)) => ForwardB= “ 1,0 ”

35 Data hazard from 2 instructions back: ALU Src A: If (ID/EX.Rs = = MEM/WB.Rd) use the GPR “ write data ” instead of Rs I.e., if Rs of the current executing instruction = = Rd of 2 instructions ago The actual equations are: if ((MEM/WB.RegWrite = = ‘ 1 ” )&& (MEM/WB.Rd <> 0)&& (ID/EX.Rs = = MEM/WB.Rd)) => ForwardA= “ 0,1 ” ALU Src B: If (ID/EX.Rt = = MEM/WB.Rd) use the GPR “ write data ” instead of Rt I.e., if Rt of the current executing instruction = = Rd of 2 instructions ago The actual equations are: if ((MEM/WB.RegWrite = = ‘ 1 ” )&& (MEM/WB.Rd <> 0)&& (ID/EX.Rt = = MEM/WB.Rd)) => ForwardB= “ 0,1 ” Double hazard: If there is a hazard from previous inst and the instruction before that?We should chhose the data from the previous instruction, it is up to date ( “ newer ” )!

36 An example for forwarding דוגמא Sub $2, $1, $3 And $4, $2, $5 needs forwarding from the previous instruction Or$4, $4, $2 needs forwarding from two instructions back Add $9, $4, $2 needs forwarding from 3 instructions back (thru the “ transparent ” GPR) Here we discuss the $2 register only (The first two cases are handled in the execute phase, the last one, in the decode phase).

37 An example for forwarding דוגמא Sub $2, $1, $3 And $4, $2, $5 Or$4, $4, $2 needs forwarding from the previous instruction Add $9, $4, $2 needs forwarding from the previous instruction Here we discuss the $4 register and there are two case (the 2nd one in purple)

38 Since Rs=2 and Rd of previous inst. was 2, we use ALUout instead of Rs

39 In red we see forwarding from two instructions back (Mem->Exec.), in purple, from previous instruction (WB->Exec.), in blue, from 3 instructions back (WB- >Decode).

40 The solution does not work for lw - לא תמיד הפתרון עובד (in lw we do not have the data in the pipe!, it comes from the data memory!) If the previous instruction was lw to a register and we try to use the register in the current instruction, we have a problem, since we cannot go back in time! One solution is to avoid such cases by adding a nop (by the Assembler) whenever Rt of the lw is equal to Rs or Rt of the following instruction.

41 Another h/w solution is to add Bubbles, i.e., add nop by hardware “nop” We need to hold IF/ID for one ck cycle and insert a “ nop: into ID/EX. This is equal to adding a nop instruction by the Assembler.

42 Hazard detection unit We need to hold the IF/ID and PC for one ck cycle and insert a “ nop: into ID/EX. This is equal to adding a nop instruction by the Assembler. If (ID/EX.MemRd)&& ( (ID/EX.Rt= =IF/ID.Rs) || (ID/EX.Rt= =IF/ID.Rs) ) we must “ stall ” the pipeline! This means that prev. inst was lw and it was to the current Rs or Rt. (of course if one of them is not used, don ’ t stall) Holding means ” freeze ” the IF/ID and the PC for 1 clock cycle Hold the IF/ID by not giving a IF/IDWrire signal and do not increment the PC (which already points at the nex instruction) by not giving the PCWrite signal. Inserting a nop is by clearing all control signals. Rt from prev. inst. Rs, Rt of current inst. identifies lw

43 An example for lw hazard detection דוגמא lw $2, 20($1) And $4, $2, $5 Or$4, $4, $2 Add $9, $4, $2

44

45 The lw instruction is in the WB phase. $2 is “ being written ”. We can use $2 in the Execute phase of the and instruction, with the help of forwarding.

46

47 Just to remind us how branch is handled we show again the Datapath with Control

48 Branch Hazards Here we calc.Rs-Rt Here we decide to branch (switching the address to the PC and issuing PCWrite Cond) These 3 instructions should be “killed” before they do harm, I.e., change any register. In cc5 we already use the new PC calculated by the branch. (PC=72)

49 The situation was better if we some how “moved” the branch address calculation one ck earlier. This is easy to dosince sign extension and shift are only wires. We just need to move the branch address ALU 1 register to the left. Rverything happens 1 ck earlier and so we’ll have to “kill” only two instructions. Next, we’ll add a fast comparator which will compare Rs and Rt at the same ck cycle of the “decode” phase. (Instead of using the ALU to calc. Rs-Rt, we’ll built a simple and fast xor circuit). This means extra h/w but now we earned one more ck cycle. So, we have to kill only a single instruction. Killing an instruction also called “flushing” the pipeline, is easily done by clraing the IF/ID register of the instruction following the branch (if the branch is successful)

50 Flushing

51 An example for flushing דוגמא sub$10, $4, $8 beq $1, $3, 7 and$12, $2, $5 lw $4, 50($7)

52

53 Summary of hazards Data hazards: * Forward from previous instruction * Forward from two instructions ago * (Forward thru “ transparent ” GPR = from 3 instructions ago) * If we cannot forward, (after lw) we stall the pipe by inserting a nop and freezing IF/ID and PC for 1 ck cycle Control hazards: * If branch is successful we flush the instruction following the branch (which is at the IF/ID register. We just clear the register) Notes: In the real MIPS CPU, no flush was employed. This give the compiler the opportunity to put useful instructions following the branch. This explains why the simulator always performs the instruction following the branch.this is called a delayed branch. Also, in the real MIPS CPU no lw stall was used. Again this give some freedom to the compiler to choose whether to put a nop following lw or some useful instruction. This is called a delayed load.

54 Other slides

55 Handling exceptions