Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Architecture

Similar presentations


Presentation on theme: "Computer Architecture"— Presentation transcript:

1 Computer Architecture
Goal: Build the best possible “processor” Here’s a piece of silicon Here some of its properties Tell me what to build 1. Understand your building blocks 2. Understand what is “best” means 3. Take into account design/production time A. Moshovos © ECE Fall ‘07 ECE Toronto

2 Track Record A. Moshovos © ECE Fall ‘07 ECE Toronto

3 Evolution? A. Moshovos © ECE Fall ‘07 ECE Toronto

4 Modern Designs A. Moshovos © ECE Fall ‘07 ECE Toronto

5 Understanding your Building Blocks
A. Moshovos © ECE Fall ‘07 ECE Toronto

6 Moore’s Law A. Moshovos © ECE Fall ‘07 ECE Toronto

7 Moore’s Law in Practice
A. Moshovos © ECE Fall ‘07 ECE Toronto

8 The other Moore’s Law A. Moshovos © ECE Fall ‘07 ECE Toronto

9 Technology Scaling A. Moshovos © ECE Fall ‘07 ECE Toronto

10 Ideal Shrink vs. New Design
A. Moshovos © ECE Fall ‘07 ECE Toronto

11 Understanding what is Best
A. Moshovos © ECE Fall ‘07 ECE Toronto

12 Why Study Computer Architecture
A. Moshovos © ECE Fall ‘07 ECE Toronto

13 Why Study Computer Architecture
A. Moshovos © ECE Fall ‘07 ECE Toronto

14 Challenges in Computer Architecture
A. Moshovos © ECE Fall ‘07 ECE Toronto

15 Review of Modern Processor Architectures
A. Moshovos © ECE Fall ‘07 ECE Toronto

16 Sequential Execution Semantics
Contract: How the machine appears to behave A. Moshovos © ECE Fall ‘07 ECE Toronto

17 Dissecting Instructions
Data Movement Data Manipulation Control Flow A. Moshovos © ECE Fall ‘07 ECE Toronto

18 An Instruction in a Processor’s Lifetime
A. Moshovos © ECE Fall ‘07 ECE Toronto

19 Pipelining A. Moshovos © ECE Fall ‘07 ECE Toronto

20 Sequential Semantics are Preserved
A. Moshovos © ECE Fall ‘07 ECE Toronto

21 Superscalar - In-order
Two or more consecutive instructions in the original program order can execute in parallel This is the dynamic execution order N-way Superscalar Can issue up to N instructions per cycle 2-way, 3-way, … fetch decode ld fetch decode add fetch decode sub fetch decode bne A. Moshovos © ECE Fall ‘07 ECE Toronto

22 Data Dependences A. Moshovos © ECE Fall ‘07 ECE Toronto

23 Superscalar vs. Pipelining
loop: ld r2, 10(r1) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Pipelining: sum += a[i--] time fetch decode ld fetch decode add fetch decode sub fetch decode bne Superscalar: fetch decode ld fetch decode add fetch decode sub fetch decode bne A. Moshovos © ECE Fall ‘07 ECE Toronto

24 Superscalar Performance
Performance Spectrum? What if all instructions were dependent? Speedup = 0, Superscalar buys us nothing What if all instructions were independent? Speedup = N where N = superscalarity Again key is typical program behavior Some parallelism exists A. Moshovos © ECE Fall ‘07 ECE Toronto

25 “Real-Life” Performance
OLTP = Online Transaction Processing SOURCE: Partha Ranganathan Kourosh Gharachorloo** Sarita Adve* Luiz André Barroso** Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors ASPLOS98 A. Moshovos © ECE Fall ‘07 ECE Toronto

26 “Real Life” Performance
SPEC CPU 2000: Simplescalar sim: 32K I$ and D$, 8K bpred A. Moshovos © ECE Fall ‘07 ECE Toronto

27 Issue Mechanism – A Group of Instructions at Decode
tgt src1 src1 simplifications may be possible resource checking not shown tgt src1 src1 Program order tgt src1 src1 Assume 2 source & 1 target max per instr. comparators for 2-way: 3 for tgt and 2 for src (tgt: WAW + WAR, src: RAW) comparators for 4-way: 2nd instr: 3 tgt and 2 src 3rd instr: 6 tgt and 4 src 4th instr: 9 tgt and 6 src A. Moshovos © ECE Fall ‘07 ECE Toronto

28 Preserving Sequential Semantics
In principle not much different than pipelining Program order is preserved in the pipeline Some instructions proceed in parallel But order is clearly defined Defer interrupts to commit stage (i.e., writeback) Flush all subsequent instructions may include instructions committing simultaneously Allow all preceding instructions to commit Recall comparisons are done in program order Must have sufficient time in clock cycle to handle these A. Moshovos © ECE Fall ‘07 ECE Toronto

29 Interrupts Example Exception raised Exception taken fetch decode ld
add fetch decode div fetch decode bne fetch decode bne Exception raised Exception raised Exception taken fetch decode ld fetch decode add fetch decode div fetch decode bne fetch decode bne A. Moshovos © ECE Fall ‘07 ECE Toronto

30 Case Study: Alpha 21164 A. Moshovos © ECE Fall ‘07 ECE Toronto

31 21164: Int. Pipe A. Moshovos © ECE Fall ‘07 ECE Toronto

32 21164: Memory Pipeline A. Moshovos © ECE Fall ‘07 ECE Toronto

33 21164: Floating-Point Pipe A. Moshovos ©
ECE Fall ‘07 ECE Toronto

34 Performance Comparison
Source: A. Moshovos © ECE Fall ‘07 ECE Toronto

35 CPI Comparison: Ideal 0.25 A. Moshovos ©
ECE Fall ‘07 ECE Toronto

36 Compiler Impact Optimized Base Performance A. Moshovos ©
ECE Fall ‘07 ECE Toronto

37 Stall Cycles - 21164 Data Dependences/Data Stalls No instructions
A. Moshovos © ECE Fall ‘07 ECE Toronto

38 Issue Cycle Distribution - 21064
A. Moshovos © ECE Fall ‘07 ECE Toronto

39 Issue Cycle Distribution - 21164
A. Moshovos © ECE Fall ‘07 ECE Toronto

40 Stall Cycles Distrubution
Model: When no instruction is committing Does not capture overlapping factors: Stall due to dependence while committing Stall due to cache miss while committing A. Moshovos © ECE Fall ‘07 ECE Toronto

41 Replay Traps Tried to do something and couldn’t
Store and write-buffer is full Can’t complete instruction Load and miss-address-file full Assumed Cache hit and was miss Dependent instructions executed Must re-execute dependent instructions Re-execute the instruction and everything that follows A. Moshovos © ECE Fall ‘07 ECE Toronto

42 Replay Traps Explained
ld r1 add _, r1 F D E M W Cache hit F D D E M W F D E M M W Cache miss F D D D E M W A. Moshovos © ECE Fall ‘07 ECE Toronto

43 Optimistic Scheduling
ld r1 add _, r1 F D E M W Cache hit F D D E M W Hit/miss known here M E D add should start execution here Must decide that add should execute Start making scheduling decisions A. Moshovos © ECE Fall ‘07 ECE Toronto

44 Optimistic Scheduling #2
ld r1 add _, r1 F D E M W Cache hit F D D E M W Hit/miss known here Guess Hit/Miss M E D add should start execution here Must decide that add should execute Start making scheduling decisions A. Moshovos © ECE Fall ‘07 ECE Toronto

45 Stall Distribution A. Moshovos © ECE Fall ‘07 ECE Toronto

46 21164 A. Moshovos © ECE Fall ‘07 ECE Toronto

47 Instruction Decode/Issue
Up to four insts/cycle Naturally aligned groups Must start at 16 byte boundary (INT16) Simplifies Fetch path Where instructions come from? I-Cache: CPU needs: A. Moshovos © ECE Fall ‘07 ECE Toronto

48 Fetching Four Instructions
Where instructions come from? I-Cache: CPU needs: Software must guarantee alignment at 16 byte boundaries Lots of NOPs A. Moshovos © ECE Fall ‘07 ECE Toronto

49 Instruction Decode/Issue
Up to four insts/cycle Naturally aligned groups Must start at 16 byte boundary (INT16) Simplifies Fetch path (in a second) All of group must issue before next group gets in Simplifies Scheduling No need for reshuffling A. Moshovos © ECE Fall ‘07 ECE Toronto

50 Pipeline Processing Front-End
A. Moshovos © ECE Fall ‘07 ECE Toronto

51 Integer Add A. Moshovos © ECE Fall ‘07 ECE Toronto

52 Floating-Point Add A. Moshovos © ECE Fall ‘07 ECE Toronto

53 Load Hit A. Moshovos © ECE Fall ‘07 ECE Toronto

54 Load Miss A. Moshovos © ECE Fall ‘07 ECE Toronto

55 Store Hit A. Moshovos © ECE Fall ‘07 ECE Toronto

56 Sequential Semantics - Review
Instructions appear as if they executed: In the order they appear in the program One after the other Program Order Pipelining Superscalar Out-of-Order A. Moshovos © ECE Fall ‘07 ECE Toronto

57 Out-of-Order Execution
loop: add r4, r4, 1 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop do { sum += a[++m]; i--; } while (i != 0); Superscalar fetch decode sub bne add ld out-of-order fetch decode add fetch decode ld fetch decode add fetch decode sub fetch decode bne A. Moshovos © ECE Fall ‘07 ECE Toronto

58 Sequential Semantics? Execution does NOT adhere to sequential semantics To be precise: Eventually it may Simplest solution: Define problem away Not acceptable today: e.g., Virtual Memory Three-phase Instruction execution In-Progress, Completed and Committed inconsistent fetch decode sub bne add ld consistent A. Moshovos © ECE Fall ‘07 ECE Toronto

59 Back to Sequential Semantics
Instr. exec. in 3 phases: In-progress, Completed, Committed OOO for in-progress and Completed In-order Commits Completed - out-of-order: ”Visible only inside” Results visible to subsequent instructions Results not visible to outsiders On interrupts completed results are discarded Committed - in-order: ”Visible to all” Results visible to outsiders On interrupt committed results are preserved A. Moshovos © ECE Fall ‘07 ECE Toronto

60 How Completes Help w/ Performance
in-order completes out-of-order completes in-order commits DIV R3, _, _ ADD R1, _, _ ADD _, R1, _ Time In-order commits fetch decode sub bne add ld commit commit commit commit commit complete A. Moshovos © ECE Fall ‘07 ECE Toronto

61 Implementing Completes/Commits
Key idea: Maintain sufficient state around to be able to roll-back when necessary Roll-back: Discard (aka Squash) all not committed One solution (conceptual): Upon Complete instruction records previous value of target register Upon Discard, instruction restores target value Upon Commit, nothing to do We will return to this shortly Focus on scheduling mechanisms A. Moshovos © ECE Fall ‘07 ECE Toronto

62 Out-of-Order Execution Overview
Program Form Processing Phase Static program dynamic inst. Stream (trace) execution window completed instructions In-Progress Dispatch/ dependences inst. Issue inst execution inst. Reorder & commit Completed Committed A. Moshovos © ECE Fall ‘07 ECE Toronto

63 Out-of-Order Execution: Stages
Fetch: get instruction from memory Decode/Dispatch: what is it? What are the dependences Issue: Go – all dependences satisfied Execute: perform operation Complete: result available to other insts. Commit: result available to outsiders A. Moshovos © ECE Fall ‘07 ECE Toronto

64 Out-of-Order Execution: Stages
Fetch: get instruction from memory Decode/Dispatch: what is it? What are the dependences Issue: Go – all dependences satisfied Execute: perform operation Complete: result available to other insts. Commit: result available to outsiders We’ll start w/ Decode/Dispatch Then we’ll consider Issue A. Moshovos © ECE Fall ‘07 ECE Toronto

65 OOO Scheduling Instruction @ Decode:
Do I have dependences yet to be satisfied? Yes, stall until they are No, clear to issue When Do I Wakeup? Producer Completes and notifies me If All dependences satisfied I can proceed Dependence: (later instruction, earlier instruction) & type We’ll first consider RAW and then move on to WAW and WAR A. Moshovos © ECE Fall ‘07 ECE Toronto

66 Stalling @ Decode for RAW
Are there unsatisfied dependences? RAW: have to wait for register value We don’t really care who is producing the value Only whether it is available Can use the Register Availability Vector as in pipelining/superscalar Also known as scoreboard At Decode Reset bit corresponding to your target At writeback set Check all bits for source regs: if any is 0 stall A. Moshovos © ECE Fall ‘07 ECE Toronto

67 Issuing Instructions: Scheduling
Determine when an instruction can issue Ignore resources for the time being Stalled because of RAW w/ preceding instruction Concept: Producer (write) notifies consumers (read) Requirements: Consumers need to be able to identify producer The register name is one possible link Mechanism Consumer placed in a reservation station Producers on complete broadcasts identity Waiting instructions observe Update Operand Availability Issue if all operands now available A. Moshovos © ECE Fall ‘07 ECE Toronto

68 Reservation Station State pertaining to an instruction
What registers it reads Whether they are available What is the destination register What state is the instruction in Waiting Executing A. Moshovos © ECE Fall ‘07 ECE Toronto

69 Out-Of-Order Exec. Example
loop: add r4, r4, 4 ld r2, 10(r4) 4 cycles lat add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 1 Cycle 0 A. Moshovos © ECE Fall ‘07 ECE Toronto

70 Out-Of-Order Exec. Example: Cycle 0
loop: add r4, r4, 4 ld r2, 10(r4) 5 cycles lat add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Ready to be executed RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4/0 Rdy 1 Cycle 0 A. Moshovos © ECE Fall ‘07 ECE Toronto

71 Cycle 1 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1
bne r1, r0, loop Notify those waiting for R4 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Exec 1 ld r4/1 NA/1 r2 Rdy R4 gets produced now A. Moshovos © ECE Fall ‘07 ECE Toronto

72 Cycle 2 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1
bne r1, r0, loop Result cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait A. Moshovos © ECE Fall ‘07 ECE Toronto

73 Cycle 3 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1
bne r1, r0, loop Result cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait No dependences sub r1/1 NA/1 r1 Rdy A. Moshovos © ECE Fall ‘07 ECE Toronto

74 Cycle 4 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1
bne r1, r0, loop Result cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait r1 produced now Notify consumers sub r1/1 NA/1 r1 Exec bne r1/1 r0/1 NA Rdy r1 will be available next cycle A. Moshovos © ECE Fall ‘07 ECE Toronto

75 Cycle 5 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1
bne r1, r0, loop Result cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait Completed sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec executing A. Moshovos © ECE Fall ‘07 ECE Toronto

76 Cycle 6 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1
bne r1, r0, loop Result cycle 6 Notify consumers RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/1 r3 Rdy Completed sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec executing A. Moshovos © ECE Fall ‘07 ECE Toronto

77 Cycle 7 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1
bne r1, r0, loop Notify consumers RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Cmtd Executing add r3/1 r2/1 r3 Exec sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Compl Completed A. Moshovos © ECE Fall ‘07 ECE Toronto

78 Cycle 8 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1
bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Cmtd add r3/1 r2/1 r3 Cmtd sub r1/1 NA/1 r1 Cmtd bne r1/1 r0/1 NA Cmtd A. Moshovos © ECE Fall ‘07 ECE Toronto

79 Window vs. Scheduler Window
Distance between oldest and youngest instruction that can co-exist inside the CPU Larger window  Potential for more ILP Scheduler Number of instructions that are waiting to be issued Instructions enter at Fetch Exit at Commit Instructions enter at Decode Leave at writeback/complete Window >= Scheduler Can be the same structure In window but not in scheduler  completed instructions A. Moshovos © ECE Fall ‘07 ECE Toronto

80 Beyond Simple OoO E will wait for B, C and D. WAR w/ C and D WAW w/ B
A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F2, F7, F4 E will wait for B, C and D. WAR w/ C and D WAW w/ B Can we do better? A B C D E A. Moshovos © ECE Fall ‘07 ECE Toronto

81 What if we had infinite registers
A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F2, F7, F4 E: ADDF F9, F7, F4 No false dependences anymore Since we do not reuse a name we can’t have WAW and WAR A. Moshovos © ECE Fall ‘07 ECE Toronto

82 State Recovery Example: Register Alias Table
Original Code Lg(# arch. regs) RAT A add r1, r2, 100 B breq r1, E C sub r1, r2, r2 p4 p1 p5 p5 p4 Architectural Register p2 p3 # arch. regs Renamed Code A add p4, p2, 100 B breq p4, E C sub r5, p2, p2 Physical Register A. Moshovos © ECE Fall ‘07 ECE Toronto

83 Register Renaming Register Version Every Write creates a new version
Uses read the last version Need to keep a version until all uses have read it. Register Renaming: Architectural vs. Physical Registers more phys. than arch. Maintain a map of arch. to phys. regs. Use in-order decoding to properly identify dependences. Instructions wait only for input op. availability. Only last version is written to reg. file. A. Moshovos © ECE Fall ‘07 ECE Toronto

84 ROB: Slow, Fine-Grain Recovery
Each entry contains Architectural destination register Its previous RAT map Program Order 3. Undo RAT updates in reverse order B B B B B Reorder Buffer Misprediction discovered 2. Locate newest instruction INVALID RAT Too slow: recovery latency proportional to number of instructions to squash A. Moshovos © ECE Fall ‘07 ECE Toronto

85 Global Checkpoints: Fast, Coarse-Grain Recovery
Program Order checkpoint checkpoint checkpoint checkpoint B B B B B Reorder Buffer Misprediction discovered INVALID RAT Branch w/ GC: Recovery is “Instantaneous” A. Moshovos © ECE Fall ‘07 ECE Toronto

86 MIPS R10000 Pipeline Overview
A. Moshovos © ECE Fall ‘07 ECE Toronto

87 MIPS R10000 Pipelines A. Moshovos © ECE Fall ‘07 ECE Toronto

88 MIPS R10000 Architecture A. Moshovos © ECE Fall ‘07 ECE Toronto

89 Register Renaming A. Moshovos © ECE Fall ‘07 ECE Toronto

90 Scheduler A. Moshovos © ECE Fall ‘07 ECE Toronto


Download ppt "Computer Architecture"

Similar presentations


Ads by Google