1 The single cycle CPU. 2 Performance of Single-Cycle Machines Memory Unit 2 ns ALU and Adders 2 ns Register file (Read or Write) 1 ns Class Fetch Decode.

1 1 The single cycle CPU

2 2 Performance of Single-Cycle Machines Memory Unit 2 ns ALU and Adders 2 ns Register file (Read or Write) 1 ns Class Fetch Decode ALU Memory Write Back Total R-format 21 2 0 1 6 LW 21 2 2 1 8 SW 21 2 2 7ns Branch 21 2 5ns Jump 2 2ns

3 3 מה היה קורה עם cycle של השעון היה באורך משתנה נשווה לגבי תוכנית עם התערובת הבאה של פקודות: Rtype: 44%, LW: 24%, SW: 12% BRANCH: 18%, JUMP: 2% I - מספר פקודות בתוכנית T - אורך מחזור שעון CPI - מספר מחזורים לפקודה = 1 Execution=I*T*CPI= 8*24%+7*12%+6*44%+5*18%+2*2%=6.3 ns

4 4 התוצאה EXE Single cycle T single clock * I T single clock 8 EXE Variable T variable clock * I T variable clock 6.3 יחס של 1.27. היחס יהיה יותר גרוע כאשר נממש פקודות מסובכות כמו פעולות עם floating point הפתרון: אינו שעון בגודל משתנה - מסובך מבחינת הבניה. הפתרון: פקודה לוקחת מספר משתנה של cycles.

5 5 Multicycle Approach הרעיון מאחורי שיטת ה- Multicycle: חיסכון בזמן: כל פקודה תקח את מספר היחידות השעון הנחוצות לה. חיסכון ברכיבים: שימוש באותו רכיב בשלבים שונים של הפקודה.

6 6 שיטת הבניה של ארכיטקטורת ה- Multicycle חלק את הפקודה לשלבים. כל שלב cycle: - אזן את כמות העבודה הנדרשת בכל שלב. - הקטן את כמות העבודה הנדרשת בכל שלב - כל שלב יבצע רק פעולה אחת פונקצינאלית. בסיום כל מחזור שעון: - שמור את הערכים עבור השלבים הבאים. - הוסף לביצוע משימה זו רגיסטרים פנימיים נוספים.

7 7 PC D. Mem data D.Mem adrs 0x400000 Rs, RtALU inputs ALU output (address) Memory output fetch Write backdecode execute Mem data memory I.Mem data PC IR A,B ALUout Mem data MDR fetch Write back decode execute memory Timing of a lw instruction in a single cycle CPU Timing of a lw instruction in a multi-cycle CPU 2ns We want to replace a long single CK cycle with 5 short ones: 1ns2ns 1ns 0x400000 Instruction in IR ALU calculates something 01345=(0)2

8 8 Therefore we should add registers to the single cycle CPU shown below: 5 [25:21]=Rs 5 [20:16]=Rt Reg File Instruction Memory PCALU Adde r 4 ck 16 [15:0] 5 Sext 16->32 Data Memory Rd Address D.In D. Out

9 9 Adding registers to “split” the instruction to 5 stages: 5 [25:21]=Rs 5 [20:16]=Rt Reg File Instruction Memory PCALU Adde r 4 ck 16 [15:0] 5 Sext 16->32 Data Memory Rd Address D.In D. Out IR ck A B ALUoutMDR PCWrite 2 0 3 4 1 5

10 10 Here is the book’s version of the multi-cycle CPU: Only PC and IR have write enable signals All other registers hold data for a single cycle

11 11 Here is our version of A mult--cycle CPU capable of R-type & lw/sw & branch instructions 5 IR[20:16]=Rt Reg File Instruction & data Memory PC ALU 4 ck 16 IR[15:0] 5 Sext 16->32 5 IR[25:21]=Rs Rd IR ck IR ck ALUout ck A B << 2

12 12 Let us explain the multi-cycle CPU First we’ll look at a CPU capable of performing only R-type instructions Then, we’ll add the lw instruction And the sw instruction Then, the beq instruction And finally, the j instruction

13 13 Let us remind ourselves how works a single cycle CPU capable of performing R-type instructions. Here you see the data-path and the timing of an R-type instruction. 5 [25:21]=Rs 5 [20:16]=Rt 5 [15:11]=Rd Reg File Instruction Memory PCALU Adde r 4 ck 6 [31:26] 6 [5:0]= funct

14 14 A single cycle CPU demo: R-type instruction 5 [25:21]=Rs 5 [20:16]=Rt 5 [15:11]=Rd Reg File Instruction Memory PC ALU ck 4

15 15 A multi cycle CPU capable of performing R-type instructions 5 IR[20:16]=Rt Reg File Instruction & data Memory PC ALU ck 5 5 IR[25:21]=Rs Rd IR ck ALUout ck A B

16 16 A multi cycle CPU capable of R-type & instructions fetch 5 IR[20:16]=Rt Reg File Instruction & data Memory PC ALU ck 5 5 IR[25:21]=Rs Rd IR ck ALUout ck A B 0 1

17 17 A multi cycle CPU capable of R-type & instructions decode 5 IR[20:16]=Rt Reg File Instruction & data Memory PC ALU ck 5 5 IR[25:21]=Rs Rd IR ck ALUout ck A B 1 2

18 18 A multi cycle CPU capable of R-type & instructions execute 5 IR[20:16]=Rt Reg File Instruction & data Memory PC ck 5 5 IR[25:21]=Rs Rd IR ck ALUout ck A B ALU 2 3

19 19 A multi cycle CPU capable of R-type & instructions write back 5 IR[20:16]=Rt Reg File Instruction & data Memory PC ALU ck 5 5 IR[25:21]=Rs Rd IR ck ALUout ck A B Rd ck 3 4

20 20 PC GPR input 0x400000 Rs, RtALU inputs ALU output (Data = result of cala.) Memory output = the instruction fetch decode executeWrite Back Inst. Mem data Mem data IR A,B ALUout fetch Write back decode execute Timing of an R-type instruction in a single cycle CPU Timing of an R-type instruction in a multi-cycle CPU 34 (=0)012 PC Previous inst.Current instruction

21 21 Mem data IR A,B ALUout fetch Write back decode execute GPR outputs ALU output IR=M ( PC ) A= Rs, B= Rt ALUuot= A op B IRWrite At the rising edge of CK: Rd=ALUout R-Type instruction takes 4 CKs PC Previous inst. Current instruction next inst. IR=M(PC) A= Rs, B= Rt ALUout = A op B Rd=ALUout The state diagram:

22 22 A multi-cycle CPU capable of R-type instructions (PC calc. ) 5 IR[20:16]=Rt Reg File Instruction & data Memory PC ALU 4 ck 5 5 IR[25:21]=Rs Rd IR ck ALUout ck A B

23 23 Mem data IR A,B ALUout fetch Write back decode execute GPR outputs ALU output ALUuot = A op B At the rising edge of CK: Rd=ALUout PC = PC+4 PC next PC = current PC+4current PC next inst.Previous inst. current instruction PCWrite

24 24 A multi cycle CPU capable of R-type & instructions fetch 5 IR[20:16]=Rt Reg File Instruction Memory PC ALU ck 5 5 IR[25:21]=Rs Rd IR ck ALUout ck A B ALU 4

25 25 Fetch WBR ALU Decode 1 6 0 7 R-type The state diagram of a CPU capable of R-type instructions only IR=M(PC) PC = PC+4 ALUout=A op B A=Rs B=Rt Rd = ALUout

26 26 Fetch WBR Load ALU AdrCmp Decode WB 1 2 6 0 74 3 lw R-type lw The state diagram of a CPU capable of R-type and lw instructions ALUout= A+sext(imm) MDR = M(ALUout) Rt = MDR

27 27 We added registers to “split” the instruction to 5 stages. Let’s discuss the lw instruction 5 [25:21]=Rs 5 [20:16]=Rt Reg File Instruction Memory PCALU Adde r 4 ck 16 [15:0] 5 Sext 16->32 Data Memory Rd Address D.In D. Out IR ck A B ALUoutMDR PCWrite 2 0 3 4 1 5

28 28 First we draw a multi-cycle CPU capable of R-type & lw instructions: 5 IR[20:16]=Rt Reg File Instruction Memory PC ALU 4 ck 16 IR[15:0] 5 Sext 16->32 5 IR[25:21]=Rs Rd IR ck MDR ck ALUout ck A B ALU We just moved the data memoryAll parts related to lw only are blue Data Memory

29 29 A multi-cycle CPU capable of R-type & lw instructions fetch 5 IR[20:16]=Rt Reg File Instruction Memory PC ALU 4 ck 16 IR[15:0] 5 Sext 16->32 5 IR[25:21]=Rs Rd IR ck MDR ck ALUout ck A B ALU Data Memory

30 30 A multi-cycle CPU capable of R-type & lw instructions decode 5 IR[20:16]=Rt Reg File Instruction Memory PC ALU 4 ck 16 IR[15:0] 5 Sext 16->32 5 IR[25:21]=Rs Rd IR ck MDE ck ALUout ck A B << 2 Data Memory

31 31 A multi-cycle CPU capable of R-type & lw instructions AdrCmp 5 IR[20:16]=Rt Reg File Instruction Memory PC ALU 4 ck 16 IR[15:0] 5 Sext 16->32 5 IR[25:21]=Rs Rd IR ck MDR ck ALUout ck A B ALU Data Memory

32 32 A multi-cycle CPU capable of R-type & lw instructions memory 5 IR[20:16]=Rt Reg File Instruction Memory PC ALU 4 ck 16 IR[15:0] 5 Sext 16->32 5 IR[25:21]=Rs Rd Branch Address IR ck MDR ck ALUout ck A B << 2 Data Memory

33 33 A multi-cycle CPU capable of R-type & lw instructions WB 5 IR[20:16]=Rt Reg File Instruction Memory PC ALU 4 ck 16 IR[15:0] 5 Sext 16->32 5 IR[25:21]=Rs Rd IR ck MDR ck ALUout ck A B Data Memory ck Rt

34 34 Can we unite the Instruction & Data memories? (They are not used simultaneously as in the single cycle CPU) 5 IR[20:16]=Rt Reg File Instruction Memory PC ALU 4 ck 16 IR[15:0] 5 Sext 16->32 5 IR[25:21]=Rs Rd IR ck MDR ck ALUout ck A B Data Memory ck

35 35 So here is a multi-cycle CPU capable of R-type & lw instructions using a single memory for instructions & data 5 IR[20:16]=Rt Reg File PC ALU 4 ck 16 IR[15:0] 5 Sext 16->32 5 IR[25:21]=Rs Rd IR ck MDR ck ALUout ck A B Instruction & data Memory

36 36 PC D. Mem data D.Mem adrs 0x400000 Rs, RtALU inputs ALU output (address) Memory output fetch Write backdecode execute Mem data memory I.Mem data PC IR A,B ALUout Mem data MDR fetch Write back decode execute memory Timing of a lw instruction in a single cycle CPU Timing of a lw instruction in a multi-cycle CPU PC+4 Previous inst. current instruction Data address Data to Rt

37 37 Mem data IR A,B ALUout Mem data MDR fetch Write back decode execute memory GPR outputs ALU output IR=M ( PC ) PC= PC+4 A= Rs, B= Rt ALUuot= A+sext(imm) MDR=M(ALUout) At the rising edge of CK: Rt=MDR PC Previous inst. current instruction Data address Data to Rt PCWrite, IRWrite

38 38 Fetch WBR Load ALU AdrCmp Decode WB 1 2 6 0 74 3 lw R-type The state diagram of a CPU capable of R-type and lw instructions ALUout= A+sext(imm) MDR = M(ALUout) Rt = MDR IR=M(PC) PC = PC+4 ALUout=A op B A=Rs B=Rt Rd = ALUout

39 39 A multi-cycle CPU capable of R-type & lw & sw instructions 5 IR[20:16]=Rt Reg File Instruction & data Memory PC ALU 4 ck 16 IR[15:0] 5 Sext 16->32 5 IR[25:21]=Rs Rd Branch Address IR ck MDR ck ALUout ck A B << 2 lw sw

40 40 Fetch WBR Load ALU AdrCmp Store Decode WB 1 5 2 6 0 74 3 lw+sw R-type swlw The state diagram of a CPU capable of R-type and lw and sw instructions M(ALUout)=B IR=M(PC) PC = PC+4 ALUout=A op B A=Rs B=Rt Rd = ALUout ALUout= A+sext(imm) MDR = M(ALUout) Rt = MDR

41 41 A multi-cycle CPU capable of R-type & lw/sw & branch instructions 5 IR[20:16]=Rt Reg File Instruction & data Memory PC ALU 4 ck 16 IR[15:0] 5 Sext 16->32 5 IR[25:21]=Rs Rd IR ck IR ck ALUout ck A B << 2

42 42 Calc PC=PC+sext(imm)<<2 Adding the instruction beq to the state diagram: Calc Rs -Rt (just to produce the zero signal) Fetch WBR Load Branch ALU AdrCmp Store Decode WB 1 5 28 6 0 74 3 lw+sw R-type beq zero swlw not zero

43 43 Adding the instruction beq to the state diagram, a more efficient way: Let’s use the decode state in which the ALU is doing nothing to compute the branch address. We’ll have to store it for 1 more CK cycle, until we know whether to branch or not! (We store it in the ALUout reg.) Fetch WBR Load Branch ALU AdrCmp Store Decode WB 1 5 28 6 0 74 3 lw+sw R-type beq swlw Calc ALUout=PC+sext(imm)<<2 Calc Rs - Rt. If zero, load the PC with ALUout data, else do not load the PC

44 44 A multi-cycle CPU capable of R-type & lw/sw & branch instructions 5 IR[20:16]=Rt Reg File Instruction & data Memory PC ALU 4 ck 16 IR[15:0] 5 Sext 16->32 5 IR[25:21]=Rs Rd Branch Address IR ck IR ck ALUout ck A B <<2 PC+4

45 45 Fetch Jump WBR Load Branch ALU AdrCmp Store Decode WB 1 5 28 6 9 0 74 3 lw+sw R-type beq j swlw Adding the instruction j to the state diagram: PC = PC[31:28] || IR[25:0]<<2

46 46 A multi-cycle CPU capable of R-type & lw/sw & branch & jump instructions 5 IR[20:16]=Rt Reg File Instruction & data Memory PC ALU 4 ck 16 IR[15:0] 5 Sext 16->32 5 IR[25:21]=Rs Rd Branch Address IR ck IR ck ALUout ck A B <<2 PC+4= next address Jump address IR[25:0] <<2 + PC[31:28]

47 47 סיכום שלבי הפקודות השונות 5 2896 1 0 74 3

48 48 MultiCycle implementation with Control

49 Final State Machine

50 50 Fetch Jump WBR Load Branch ALU AdrCmp Store Decode WB 1 5 28 6 9 0 74 3 lw+sw R-type beq j swlw The final state diagram:

51 51

52 52 MultiCycle implementation with Control

53 53 Implementation: Finite State Machine for Control (The book’s version)

54 54 Opcode= IR[31:26] zero, neg, etc. next state current state control signalsnext state calculation Outputs decoder State reg ck The Control Finite State Machine: For 10 states coded 0-9, we need 4 bits, i.e., [S3,S2,S1,S0]

55 55 The control signals decoder We just implement the table of slide 54: Let’s look at ALUSrcA: it is “0” in states 0 and 1 and it is “1” in states 2, 6 and 8. In all other states we don’t care. let’s look at PCWrite: it is “1” in states 0 and 9. In all other states it must be “0”. And so, we’ll fill the table below and build the decoder.

56 56 The state machine “next state calc.” logic R-type=000000, lw=100011, sw=101011, beq=000100, bne=000101, lui=001111, j=0000010, jal=000011, addi=001000 Fetch 0 Jump 9 WBR 7 Load 3 Branch 8 ALU 6 AdrCmp 2 Store 5 Decode 1 WB 4 lw+sw R-type beq j swlw IR31IR30IR29IR28IR27IR26 opcode S3S2S1S0 current state S3S2S1S0 next state X0XXXXX0000001 00010110000000 X X1 0X XXX XXX X 0010 0010 0011 0101 10XXXXX0010010 R-type lw sw lw+sw

57 57 Opcode = IR[31:26] next state current state control signalsnext state calculation Outputs decoder State reg ck The Control Finite State Machine: Meally machine PCWrite PCWriteCond zero Moore machine to PC

58 58 Finite State Machine for Control 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1

59 59 ROM = "Read Only Memory" –values of memory locations are fixed ahead of time A ROM can be used to implement a truth table –if the address is m-bits, we can address 2 m entries in the ROM. –our outputs are the bits of data that the address points to. m is the "heigth", and n is the "width" ROM Implementation mn 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1

60 60 How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i.e., 2 10 = 1024 different addresses) How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs ROM is 2 10 x 20 = 20K bits (and a rather unusual size) Rather wasteful, since for lots of the entries, the outputs are the same — i.e., opcode is often ignored ROM Implementation

61 61 Break up the table into two parts — 4 state bits tell you the 16 outputs, 2 4 x 16 bits of ROM — 10 bits tell you the 4 next state bits, 2 10 x 4 bits of ROM — Total: 4.3K bits of ROM PLA is much smaller — can share product terms — only need entries that produce an active output — can take into account don't cares Size is (#inputs  #product-terms) + (#outputs  #product-terms) For this example = (10x17)+(20x17) = 460 PLA cells PLA cells usually about the size of a ROM cell (slightly bigger) ROM vs PLA

62 62 End of multi-cycle implementation

