1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005

2 Review Questions Question 2, Fall 2004 (Multi-cycle CPU) Question 3, Summer 2004 (Pipeline CPU) Question 1, Fall 2004 (Based on lab 7 pipeline) Carry Look Ahead Adder Question 5, Summer 2004 (CLA)

3 Example - Multicycle CPU Modifications to the 2 nd Edition CU (state diagram) and DPU. –Mr Trojan already modified DPU Notice: Standalone registers ( MDR, ALUout ) are fast even though RegFile is not. –Standalone = instantaneous –Register File = ½ clock So we want to skip states 4 ( lw ) & 7 (r-type) –implement “posted-write” (next page)

4 Posted-Write We needed states 4 & 7 because Register Writing takes ½ clock –But we already have the data stored in MDR and ALUout for these states. –Can we delay writing until the beginning of the next instruction? (state 0) –What about control signals? This is a “Posted-Write” –a write operation “posted” (scheduled) to occur later

5 Posted-Write Implementation Well, we just save the control signals for 1 extra clock with Flip-Flops! –RegDst, RegWrite, MemToReg Now the signals are available for 1 clock extra

6 Questions DPU modifications are complete, modify the CU to implement register posted-write. –DPU and CU next pages What justification did Mr. T tell his boss for using Positive Edge-triggered flip-flops? –The design team says that positive-edged FF’s cost extra. Can Mr. Trojan use negative- edged FF instead?

9 When to load FF Ms. Bruin suggested a RegWrite_FF_Write as shown below. Comment on the design and its necessity.

10 Posted Write for sw Ms. Bruin was given another chance by the lead engineer. She tried to copy Mr. Trojan and suggested saving a clock in the sw instruction by skipping state 5 and adding the following 2 FF. Advice?

11 Example – Pipeline CPU A new 4-stage Pipeline MEM before EX No spurious stalls New R-Type instr. (ex.:addm,…) –Use memory operand as a source operand Writing to RegFile takes very little time => No separate WB stage Memory : One read port Beq in Ex stage EAC not possible => Revised lw and sw

12 New 4-stage Pipeline …. addm Investigate data dependencies and implement HDU and FU Avoid any spurious stalls. (really dependent) No internal forwarding in memory –Cannot write and read to/from memory simultaneously.

14 New 4-stage Pipeline …. (sw, lw) BEQ is executing after _____stage in ____ stage. Where should we execute sw? Where should we execute lw? beq rs,rt, Target; sw rt, (rs); MEM[(rs)]<= (rt)

15 New 4-stage Pipeline …. (Hazard and stalling) Regular pipeline4-stage pipeline Dependencies/RAW hazards for register operand Instruction to activate MemRead Stalling instruction in ______ stage Condition of stalling

16 New 4-stage Pipeline …. (Hazard and stalling…)

17 sw $1, ($2); lw $4, ($2); addm $8, ($2), $4; subm $16, ($8), $4;

21 Lab7, modified Now implement SUB3 and SUB6 instructions ( SUB3 in EX1 and EX2 ). –still have NOP Optimize performance by performing SUB3 in EX1 or EX2 (i.e. minimize stalling) The new stalling policy: –Never stall SUB3 and stall SUB6 iff it is dependent on the preceding instruction.

23 Logic Blocks Postponing logic –assertions to perform SUB3 in EX1 or EX2 –prefer EX1 so data is available to forward. HDU –Stall only dependant SUB6 instructions FU1 and FU2 –forwarding from EX2→EX1 and WB→EX2

24 Stall vs. Flushing When do you flush and when do you stall? –How many instructions do you flush at a time? –How many instructions in the pipe do you stall? –Do flushing & stalling have anything in common? –Which of them result in producing bubbles? –Is the penalty due to flushing / stalling more severe in deeper pipelines? (say 7-10 stages) –How do delay slots affect the penalty?

25 1-bit CLA adder (+) AB Cin S pg p: propagator => p = A+B (If either A or B is 1, Cin = 1 causes Cout = 1) g: generator => g = AB (If both A and B are 1, Cout = 1 for sure) p, g are generated in 1 gate delay after we have A, B. Note that Cin is not needed to produce p and g. S is generated in 2 gate delay after we get Cin (SOP).

26 4-bit CLA (+) A0B0 C0 (+) A1B1 (+) A2B2 (+) A3B3 CLL (carry look-ahead logic) p0 g0 p1 g1 p2 g2 p3 g3 C1 C2C3 S3 S2 S1S0 The CLL takes p,g from all 4 bits and C0 as input to generate all Cs in 2 gate delay. C1=g0+p0C0, C2=g1+p1g0+p1p0C0, C3=g2+p2g1+p2p1g0+p2p1p0c0, C4=g3+p3g2+p3p2g1+p3p2p1g0+p3p2p1p0c0 (Note: C4 is too complicated, however it is a 2-level SOP representation)

27 4-bit CLA (+) A0B0 C0 (+) A1B1 (+) A2B2 (+) A3B3 CLL (carry look-ahead logic) p0 g0 p1 g1 p2 g2 p3 g3 Given A,B’s, all p,g’s are generated in 1 gate delay in parallel. C1 C2C3 Given all p,g’s, all C’s are generated in 2 gate delay in parallel. S3 S2 S1S0 Given all C’s, all S’s are generated in 2 gate delay in parallel. Key virtue of CLA: sequential operation in RCA is broken into parallel operation!!

28 16-bit CLA Same as before, p,g’s are generated in parallel in 1 gate delay The second-tier CLL takes the P,G’s from first-tier CLLs and C0 to generate “seed C’s” for first-tier CLLs in 2 gate delay. (note that the logic for generating “seed C’s” from P,G’s is exactly the same to generating C’s from p,g’s!) With the seed C’s as input, the first-tier CLLs use Cin and p,g’s to generate C’s in 2 gate delay With all C’s in place, S’s are calculated in 2 gate delay Therefore, totally 1+2+2+2+2=9 gate delay to finish the whole thing!! Now, without input carry, the first-tier CLL cannot generate C’s…… Instead they generate P,G’s (group propagator and group generator) in 2 gate delay P => This group will propagate the input carry to the group P=p0p1p2p3 G => This group will generate an output carry G=g3+p3g2+p3p2g1+p3p2p1g0

29 Example - 64bit-CLA S 39 takes longer to become valid. List of primary and intermediate signals in producing S 39 : (Back tracking: S 39 = A 39 B 39 C 39, S 39< -C 39< -C 36 …) –Do we need P 39_36* and G 39_36* ? –Primary inputs: –Gate delay to generate p 38_0, g 38_0 : –Gate delay for second level P*, G*: –Gate delay for second level P**, G**: –Gate delay C 32 : –p 38,p 37,p 36, and g 38,g 37,g 36 –C 32  C 36 C 39 Delay:

30 Other Topics Usually there is a question on non-linear pipeline. Please make sure that you are comfortable with cache and virtual memory organization.

1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

Similar presentations

Presentation on theme: "1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

Similar presentations

Presentation on theme: "1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005."— Presentation transcript:

Similar presentations

About project

Feedback