Processor Architecture: ISA-Y86/MIPS 15-213: Introduction to Computer Systems
Course Outline Background Sequential Implementation Pipelining Instruction sets Logic design Sequential Implementation A simple, but not very fast processor design Pipelining Get more things running simultaneously Pipelined Implementation Make it work Advanced Topics Performance analysis High performance processor design
Coverage Our Approach Work through designs for particular instruction set Y86-64 − a simplified version of the Intel x86-64 If you know one, you more-or-less know them all Work at “microarchitectural” level Assemble basic hardware blocks into overall processor structure Memories, functional units, etc. Surround by control logic to make sure each instruction flows through properly Use simple hardware description language to describe control logic Can extend and modify Test via simulation Route to design using Verilog Hardware Description Language See Web aside ARCH:VLOG
Instruction Set Architecture Assembly Language View Processor state Registers, memory, … Instructions addq, pushq, ret, … How instructions are encoded as bytes Layer of Abstraction Above: how to program machine Processor executes instructions in a sequence Below: what needs to be built Use variety of tricks to make it run fast E.g., execute multiple instructions simultaneously ISA Compiler OS CPU Design Circuit Chip Layout Application Program
Y86-64 Processor State Program Registers RF: Program registers CC: Condition codes Stat: Program status %r8 %r9 %r10 %r11 %r12 %r13 %r14 %rax %rcx %rdx %rbx %rsp %rbp %rsi %rdi ZF SF OF DMEM: Memory PC Program Registers 15 registers (omit %r15). Each 64 bits Condition Codes Single-bit flags set by arithmetic or logical instructions ZF: Zero SF:Negative OF: Overflow Program Counter Indicates address of next instruction Program Status Indicates either normal operation or some error condition Memory Byte-addressable storage array Words stored in little-endian byte order
Y86-64 Instruction Set #1 Byte 1 2 3 4 5 6 7 8 9 halt nop 1 1 2 3 4 5 6 7 8 9 halt nop 1 cmovXX rA, rB 2 fn rA rB irmovq V, rB 3 F rB V rmmovq rA, D(rB) 4 rA rB D mrmovq D(rB), rA 5 rA rB D OPq rA, rB 6 fn rA rB jXX Dest 7 fn Dest call Dest 8 Dest ret 9 pushq rA A rA F popq rA B rA F
Y86-64 Instructions Format 1–10 bytes of information read from memory Can determine instruction length from first byte Not as many instruction types, and simpler encoding than with x86-64 Each accesses and modifies some part(s) of the program state
Y86-64 Instruction Set #2 Byte rrmovq 7 1 2 3 4 5 6 7 8 9 cmovle 7 1 Byte 1 2 3 4 5 6 7 8 9 cmovle 7 1 halt cmovl 7 2 nop 1 cmove 7 3 cmovXX rA, rB 2 fn rA rB cmovne 7 4 irmovq V, rB 3 F rB V cmovge 7 5 rmmovq rA, D(rB) 4 rA rB D cmovg 7 6 mrmovq D(rB), rA 5 rA rB D OPq rA, rB 6 fn rA rB jXX Dest 7 fn Dest call Dest 8 Dest ret 9 pushq rA A rA F popq rA B rA F
Y86-64 Instruction Set #3 Byte 1 2 3 4 5 6 7 8 9 halt nop 1 1 2 3 4 5 6 7 8 9 halt nop 1 cmovXX rA, rB 2 fn rA rB irmovq V, rB 3 F rB V rmmovq rA, D(rB) 4 rA rB D addq 6 subq 1 andq 2 xorq 3 mrmovq D(rB), rA 5 rA rB D OPq rA, rB 6 fn rA rB jXX Dest 7 fn Dest call Dest 8 Dest ret 9 pushq rA A rA F popq rA B rA F
Y86-64 Instruction Set #4 Byte jmp 7 jle 1 jl 2 je 3 jne 4 jge 5 jg 6 jle 1 jl 2 je 3 jne 4 jge 5 jg 6 Byte 1 2 3 4 5 6 7 8 9 halt nop 1 cmovXX rA, rB 2 fn rA rB irmovq V, rB 3 F rB V rmmovq rA, D(rB) 4 rA rB D mrmovq D(rB), rA 5 rA rB D OPq rA, rB 6 fn rA rB jXX Dest 7 fn Dest call Dest 8 Dest ret 9 pushq rA A rA F popq rA B rA F
Encoding Registers Each register has 4-bit ID Same encoding as in x86-64 Register ID 15 (0xF) indicates “no register” Will use this in our hardware design in multiple places %rax %rcx %rdx %rbx 1 2 3 %rsp %rbp %rsi %rdi 4 5 6 7 %r8 %r9 %r10 %r11 8 9 A B %r12 %r13 %r14 No Register C D E F
Instruction Example Addition Instruction Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data Set condition codes based on result e.g., addq %rax,%rsi Encoding: 60 06 Two-byte encoding First indicates instruction type Second gives source and destination registers Generic Form Encoded Representation addq rA, rB 6 rA rB
Arithmetic and Logical Operations Instruction Code Function Code Refer to generically as “OPq” Encodings differ only by “function code” Low-order 4 bytes in first instruction word Set condition codes as side effect Add addq rA, rB 6 rA rB Subtract (rA from rB) subq rA, rB 6 1 rA rB And andq rA, rB 6 2 rA rB Exclusive-Or xorq rA, rB 6 3 rA rB
Move Operations Like the x86-64 movq instruction Register Register rrmovq rA, rB 2 Immediate Register irmovq V, rB 3 F rB V Register Memory rmmovq rA, D(rB) 4 rA rB D Memory Register mrmovq D(rB), rA 5 rA rB D Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct
Move Instruction Examples Y86-64 movq $0xabcd, %rdx irmovq $0xabcd, %rdx Encoding: 30 82 cd ab 00 00 00 00 00 00 movq %rsp, %rbx rrmovq %rsp, %rbx Encoding: 20 43 movq -12(%rbp),%rcx mrmovq -12(%rbp),%rcx Encoding: 50 15 f4 ff ff ff ff ff ff ff movq %rsi,0x41c(%rsp) rmmovq %rsi,0x41c(%rsp) Encoding: 40 64 1c 04 00 00 00 00 00 00
Conditional Move Instructions Move Unconditionally Refer to generically as “cmovXX” Encodings differ only by “function code” Based on values of condition codes Variants of rrmovq instruction (Conditionally) copy value from source to destination register rrmovq rA, rB 2 rA rB Move When Less or Equal cmovle rA, rB 2 1 rA rB Move When Less cmovl rA, rB 2 rA rB Move When Equal cmove rA, rB 2 3 rA rB Move When Not Equal cmovne rA, rB 2 4 rA rB Move When Greater or Equal cmovge rA, rB 2 5 rA rB Move When Greater cmovg rA, rB 2 6 rA rB
Jump Instructions Refer to generically as “jXX” Jump (Conditionally) jXX Dest 7 fn Dest Refer to generically as “jXX” Encodings differ only by “function code” fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address Unlike PC-relative addressing seen in x86-64
Jump Instructions Jump Unconditionally jmp Dest 7 Dest Jump When Less or Equal jle Dest 7 1 Dest Jump When Less jl Dest 7 2 Dest Jump When Equal je Dest 7 3 Dest Jump When Not Equal jne Dest 7 4 Dest Jump When Greater or Equal jge Dest 7 5 Dest Jump When Greater jg Dest 7 6 Dest
Y86-64 Program Stack Region of memory holding program data Stack “Bottom” Region of memory holding program data Used in Y86-64 (and x86-64) for supporting procedure calls Stack top indicated by %rsp Address of top stack element Stack grows toward lower addresses Top element is at highest address in the stack When pushing, must first decrement stack pointer After popping, increment stack pointer • Increasing Addresses %rsp Stack “Top”
Stack Operations Decrement %rsp by 8 pushq rA A rA F Decrement %rsp by 8 Store word from rA to memory at %rsp Like x86-64 Read word from memory at %rsp Save in rA Increment %rsp by 8 popq rA B rA F
Subroutine Call and Return call Dest 8 Dest Push address of next instruction onto stack Start executing instructions at Dest Like x86-64 Pop value from stack Use as address for next instruction ret 9
Miscellaneous Instructions nop 1 Don’t do anything Stop executing instructions x86-64 has comparable instruction, but can’t execute it in user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to zero will halt halt
Status Conditions Desired Behavior Normal operation Mnemonic Code AOK 1 Normal operation Halt instruction encountered Bad address (either instruction or data) encountered Invalid instruction encountered Desired Behavior If AOK, keep going Otherwise, stop program execution Mnemonic Code HLT 2 Mnemonic Code ADR 3 Mnemonic Code INS 4
Writing Y86-64 Code Try to Use C Compiler as Much as Possible Write code in C Compile for x86-64 with gcc –Og –S Transliterate into Y86-64 Modern compilers make this more difficult Coding Example Find number of elements in null-terminated list int len1(int a[]); 5043 6125 7395 a 3
Y86-64 Code Generation Example First Try Write typical array code Compile with gcc -Og -S Problem Hard to do array indexing on Y86-64 Since don’t have scaled addressing modes /* Find number of elements in null-terminated list */ long len(long a[]) { long len; for (len = 0; a[len]; len++) ; return len; } L3: addq $1,%rax cmpq $0, (%rdi,%rax,8) jne L3
Y86-64 Code Generation Example #2 Second Try Write C code that mimics expected Y86-64 code Result Compiler generates exact same code as before! Compiler converts both versions into same intermediate form long len2(long *a) { long ip = (long) a; long val = *(long *) ip; long len = 0; while (val) { ip += sizeof(long); len++; val = *(long *) ip; } return len;
Y86-64 Code Generation Example #3 len: irmovq $1, %r8 # Constant 1 irmovq $8, %r9 # Constant 8 irmovq $0, %rax # len = 0 mrmovq (%rdi), %rdx # val = *a andq %rdx, %rdx # Test val je Done # If zero, goto Done Loop: addq %r8, %rax # len++ addq %r9, %rdi # a++ jne Loop # If !0, goto Loop Done: ret Register Use %rdi a %rax len %rdx val %r8 1 %r9 8
Y86-64 Sample Program Structure #1 init: # Initialization . . . call Main halt .align 8 # Program data array: Main: # Main function call len . . . len: # Length function .pos 0x100 # Placement of stack Stack: Program starts at address 0 Must set up stack Where located Pointer values Make sure don’t overwrite code! Must initialize data
Y86-64 Program Structure #2 Program starts at address 0 init: # Set up stack pointer irmovq Stack, %rsp # Execute main program call Main # Terminate halt # Array of 4 elements + terminating 0 .align 8 Array: .quad 0x000d000d000d000d .quad 0x00c000c000c000c0 .quad 0x0b000b000b000b00 .quad 0xa000a000a000a000 .quad 0 Program starts at address 0 Must set up stack Must initialize data Can use symbolic names
Y86-64 Program Structure #3 Set up call to len Main: irmovq array,%rdi # call len(array) call len ret Set up call to len Follow x86-64 procedure conventions Push array address as argument
Assembling Y86-64 Program Generates “object code” file len.yo unix> yas len.ys Generates “object code” file len.yo Actually looks like disassembler output 0x054: | len: 0x054: 30f80100000000000000 | irmovq $1, %r8 # Constant 1 0x05e: 30f90800000000000000 | irmovq $8, %r9 # Constant 8 0x068: 30f00000000000000000 | irmovq $0, %rax # len = 0 0x072: 50270000000000000000 | mrmovq (%rdi), %rdx # val = *a 0x07c: 6222 | andq %rdx, %rdx # Test val 0x07e: 73a000000000000000 | je Done # If zero, goto Done 0x087: | Loop: 0x087: 6080 | addq %r8, %rax # len++ 0x089: 6097 | addq %r9, %rdi # a++ 0x08b: 50270000000000000000 | mrmovq (%rdi), %rdx # val = *a 0x095: 6222 | andq %rdx, %rdx # Test val 0x097: 748700000000000000 | jne Loop # If !0, goto Loop 0x0a0: | Done: 0x0a0: 90 | ret
Simulating Y86-64 Program Instruction set simulator unix> yis len.yo Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original Stopped in 33 steps at PC = 0x13. Status 'HLT', CC Z=1 S=0 O=0 Changes to registers: %rax: 0x0000000000000000 0x0000000000000004 %rsp: 0x0000000000000000 0x0000000000000100 %rdi: 0x0000000000000000 0x0000000000000038 %r8: 0x0000000000000000 0x0000000000000001 %r9: 0x0000000000000000 0x0000000000000008 Changes to memory: 0x00f0: 0x0000000000000000 0x0000000000000053 0x00f8: 0x0000000000000000 0x0000000000000013
CISC Instruction Sets Stack-oriented instruction set Complex Instruction Set Computer IA32 is example Stack-oriented instruction set Use stack to pass arguments, save program counter Explicit push and pop instructions Arithmetic instructions can access memory addq %rax, 12(%rbx,%rcx,8) requires memory read and write Complex address calculation Condition codes Set as side effect of arithmetic and logical instructions Philosophy Add instructions to perform “typical” programming tasks
RISC Instruction Sets Fewer, simpler instructions Reduced Instruction Set Computer Internal project at IBM, later popularized by Hennessy (Stanford) and Patterson (Berkeley) Fewer, simpler instructions Might take more to get given task done Can execute them with small and fast hardware Register-oriented instruction set Many more (typically 32) registers Use for arguments, return pointer, temporaries Only load and store instructions can access memory Similar to Y86-64 mrmovq and rmmovq No Condition codes Test instructions return 0/1 in register
MIPS Registers
MIPS Instruction Examples Op Ra Rb Rd Fn 00000 R-R addu $3,$2,$1 # Register add: $3 = $2+$1 Op Ra Rb Immediate R-I addu $3,$2, 3145 # Immediate add: $3 = $2+3145 sll $3,$2,2 # Shift left: $3 = $2 << 2 Branch Op Ra Rb Offset beq $3,$2,dest # Branch when $3 = $2 Load/Store Op Ra Rb Offset lw $3,16($2) # Load Word: $3 = M[$2+16] sw $3,16($2) # Store Word: M[$2+16] = $3
CISC vs. RISC Original Debate Current Status Strong opinions! CISC proponents---easy for compiler, fewer code bytes RISC proponents---better for optimizing compilers, can make run fast with simple chip design Current Status For desktop processors, choice of ISA not a technical issue With enough hardware, can make anything run fast Code compatibility more important x86-64 adopted many RISC features More registers; use them for argument passing For embedded processors, RISC makes sense Smaller, cheaper, less power Most cell phones use ARM processor
Summary Y86-64 Instruction Set Architecture Similar state and instructions as x86-64 Simpler encodings Somewhere between CISC and RISC How Important is ISA Design? Less now than before With enough hardware, can make almost anything go fast
Processor Architecture: Logic Design 15-213: Introduction to Computer Systems
Overview of Logic Design Fundamental Hardware Requirements Communication How to get values from one place to another Computation Storage Bits are Our Friends Everything expressed in terms of values 0 and 1 Low or high voltage on wire Compute Boolean functions Store bits of information
Digital Signals Voltage Time 1 Use voltage thresholds to extract discrete values from continuous signal Simplest version: 1-bit signal Either high range (1) or low range (0) With guard range between them Not strongly affected by noise or low quality circuit elements Can make circuits simple, small, and fast
Computing with Logic Gates Outputs are Boolean functions of inputs Respond continuously to changes in inputs With some, small delay Rising Delay Falling Delay a && b b Voltage a Time
Combinational Circuits Acyclic Network Primary Inputs Outputs Acyclic Network of Logic Gates Continously responds to changes on primary inputs Primary outputs become (after some delay) Boolean functions of primary inputs
bool eq = (a&&b)||(!a&&!b) Bit Equality Bit equal a b eq HCL Expression bool eq = (a&&b)||(!a&&!b) Generate 1 if a and b are equal Hardware Control Language (HCL) Very simple hardware description language Boolean operations have syntax similar to C logical operations We’ll use it to describe control logic for processors
Word-Level Representation Word Equality Word-Level Representation b63 Bit equal a63 eq63 b62 a62 eq62 b1 a1 eq1 b0 a0 eq0 Eq = B A Eq HCL Representation bool Eq = (A == B) 64-bit word size HCL representation Equality operation Generates Boolean value
Bit-Level Multiplexor s Bit MUX HCL Expression bool out = (s&&a)||(!s&&b) b out a Control signal s Data signals a and b Output a when s=1, b when s=0
Word-Level Representation Word Multiplexor Word-Level Representation b63 s a63 out63 b62 a62 out62 b0 a0 out0 s B A Out MUX HCL Representation int Out = [ s : A; 1 : B; ]; Select input word A or B depending on control signal s HCL representation Case expression Series of test : value pairs Output value for first successful test
HCL Word-Level Examples Minimum of 3 Words Find minimum of three input words HCL case expression Final case guarantees match int Min3 = [ A < B && A < C : A; B < A && B < C : B; 1 : C; ]; A Min3 MIN3 B C 4-Way Multiplexor D0 D3 Out4 s0 s1 MUX4 D2 D1 Select one of 4 inputs based on two control bits HCL case expression Simplify tests by assuming sequential matching int Out4 = [ !s1&&!s0: D0; !s1 : D1; !s0 : D2; 1 : D3; ];
Arithmetic Logic Unit Combinational logic Y X X + Y A L U Y X X - Y 1 A L U Y X X & Y 2 A L U Y X X ^ Y 3 A B A B A B A B OF ZF CF OF ZF CF OF ZF CF OF ZF CF Combinational logic Continuously responding to inputs Control signal selects function computed Corresponding to 4 arithmetic/logical operations in Y86-64 Also computes values for condition codes
Storing 1 Bit Bistable Element Q+ Q– q !q q = 0 or 1 Vin V1 V2
Storing 1 Bit (cont.) Bistable Element Q+ Q– Stable 1 Vin V1 V2 q = 0 or 1 Stable 1 Vin V1 V2 Vin = V2 Vin V1 V2 Metastable Stable 0
Physical Analogy . . . Stable 1 Metastable Stable 0 Metastable Stable left Stable right . .
Storing and Accessing 1 Bit Bistable Element Q+ Q– q !q q = 0 or 1 Q+ Q– R S R-S Latch Resetting 1 Setting 1 Storing !q q
1-Bit Latch D Latch Q+ Q– R S D C Latching Storing 1 d !d d !d q !q Data Clock Latching 1 d !d Storing d !d q !q
Transparent 1-Bit Latch Latching 1 d !d C D Q+ Time Changing D When in latching mode, combinational propogation from D to Q+ and Q– Value latched depends on value of D as C falls
Edge-Triggered Latch Only in latching mode for brief period Data Q+ Q– C S T Clock Trigger Only in latching mode for brief period Rising clock edge Value latched depends on data as clock rises Output remains stable at all other times C D Q+ Time T
Registers Stores word of data Structure D C Q+ i7 i6 i5 i4 i3 i2 i1 i0 o7 o6 o5 o4 o3 o2 o1 o0 Clock I O Clock Stores word of data Different from program registers seen in assembly code Collection of edge-triggered latches Loads input on rising edge of clock
Register Operation y x Stores data bits State = x State = y Output = y y Rising clock x Input = y Output = x Stores data bits For most of time acts as barrier between input and output As clock rises, loads input
State Machine Example Accumulator circuit Comb. Logic A L U Out MUX 1 Clock In Load Accumulator circuit Load or accumulate on each cycle x0 x1 x2 x3 x4 x5 x0+x1 x0+x1+x2 x3+x4 x3+x4+x5 Clock Load In Out
Random-Access Memory Stores multiple words of memory Register file A B W dstW srcA valA srcB valB valW Read ports Write port Clock Stores multiple words of memory Address input specifies which word to read or write Register file Holds values of program registers %rax, %rsp, etc. Register identifier serves as address ID 15 (0xF) implies no read or write performed Multiple Ports Can read and/or write multiple words in one cycle Each has separate address and data input/output
Register File Timing Reading Writing Like combinational logic Output data generated based on input address After some delay Writing Like register Update only as clock rises Register file A B srcA valA srcB valB 2 x x 2 y 2 Register file W dstW valW Clock x Register file W dstW valW Clock y 2 Rising clock
Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation Parts we want to explore and modify Data Types bool: Boolean a, b, c, … int: words A, B, C, … Does not specify word size---bytes, 64-bit words, … Statements bool a = bool-expr ; int A = int-expr ;
HCL Operations Boolean Expressions Word Expressions Classify by type of value returned Boolean Expressions Logic Operations a && b, a || b, !a Word Comparisons A == B, A != B, A < B, A <= B, A >= B, A > B Set Membership A in { B, C, D } Same as A == B || A == C || A == D Word Expressions Case expressions [ a : A; b : B; c : C ] Evaluate test expressions a, b, c, … in sequence Return word expression A, B, C, … for first successful test
Summary Computation Storage Performed by combinational logic Computes Boolean functions Continuously reacts to input changes Storage Registers Hold single words Loaded as clock rises Random-access memories Hold multiple words Possible multiple read or write ports Read word when address input changes Write word as clock rises
Processor Architecture: Sequential Implementation 15-213: Introduction to Computer Systems
Y86-64 Instruction Set #1 Byte 1 2 3 4 5 6 7 8 9 halt nop 1 1 2 3 4 5 6 7 8 9 halt nop 1 cmovXX rA, rB 2 fn rA rB irmovq V, rB 3 F rB V rmmovq rA, D(rB) 4 rA rB D mrmovq D(rB), rA 5 rA rB D OPq rA, rB 6 fn rA rB jXX Dest 7 fn Dest call Dest 8 Dest ret 9 pushq rA A rA F popq rA B rA F
Y86-64 Instruction Set #2 Byte rrmovq 7 1 2 3 4 5 6 7 8 9 cmovle 7 1 Byte 1 2 3 4 5 6 7 8 9 cmovle 7 1 halt cmovl 7 2 nop 1 cmove 7 3 cmovXX rA, rB 2 fn rA rB cmovne 7 4 irmovq V, rB 3 F rB V cmovge 7 5 rmmovq rA, D(rB) 4 rA rB D cmovg 7 6 mrmovq D(rB), rA 5 rA rB D OPq rA, rB 6 fn rA rB jXX Dest 7 fn Dest call Dest 8 Dest ret 9 pushq rA A rA F popq rA B rA F
Y86-64 Instruction Set #3 Byte 1 2 3 4 5 6 7 8 9 halt nop 1 1 2 3 4 5 6 7 8 9 halt nop 1 cmovXX rA, rB 2 fn rA rB irmovq V, rB 3 F rB V rmmovq rA, D(rB) 4 rA rB D addq 6 subq 1 andq 2 xorq 3 mrmovq D(rB), rA 5 rA rB D OPq rA, rB 6 fn rA rB jXX Dest 7 fn Dest call Dest 8 Dest ret 9 pushq rA A rA F popq rA B rA F
Y86-64 Instruction Set #4 Byte jmp 7 jle 1 jl 2 je 3 jne 4 jge 5 jg 6 jle 1 jl 2 je 3 jne 4 jge 5 jg 6 Byte 1 2 3 4 5 6 7 8 9 halt nop 1 cmovXX rA, rB 2 fn rA rB irmovq V, rB 3 F rB V rmmovq rA, D(rB) 4 rA rB D mrmovq D(rB), rA 5 rA rB D OPq rA, rB 6 fn rA rB jXX Dest 7 fn Dest call Dest 8 Dest ret 9 pushq rA A rA F popq rA B rA F
Building Blocks Combinational Logic Storage Elements = fun B = Combinational Logic Compute Boolean functions of inputs Continuously respond to input changes Operate on data and implement control Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises MUX 1 Register file A B W dstW srcA valA srcB valB valW Clock Clock
Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation Parts we want to explore and modify Data Types bool: Boolean a, b, c, … int: words A, B, C, … Does not specify word size---bytes, 32-bit words, … Statements bool a = bool-expr ; int A = int-expr ;
HCL Operations Boolean Expressions Word Expressions Classify by type of value returned Boolean Expressions Logic Operations a && b, a || b, !a Word Comparisons A == B, A != B, A < B, A <= B, A >= B, A > B Set Membership A in { B, C, D } Same as A == B || A == C || A == D Word Expressions Case expressions [ a : A; b : B; c : C ] Evaluate test expressions a, b, c, … in sequence Return word expression A, B, C, … for first successful test
SEQ Hardware Structure newPC SEQ Hardware Structure PC valE , valM Write back valM State Program counter register (PC) Condition code register (CC) Register File Memories Access same memory space Data: for reading/writing program data Instruction: for reading instructions Instruction Flow Read instruction at address specified by PC Process through stages Update program counter Data Data Memory memory memory Addr , Data valE CC CC Execute Cnd ALU ALU aluA , aluB valA , valB Decode srcA , srcB dstA , dstB Register Register A A B B Register Register M M file file file file E E icode , ifun valP rA , rB valC Instruction Instruction PC PC Fetch memory memory increment increment PC
SEQ Stages Fetch Decode Execute Memory Write Back PC newPC SEQ Stages PC valE , valM Write back valM Fetch Read instruction from instruction memory Decode Read program registers Execute Compute value or address Memory Read or write data Write Back Write program registers PC Update program counter Data Data Memory memory memory Addr , Data valE CC CC Execute Cnd ALU ALU aluA , aluB valA , valB Decode srcA , srcB dstA , dstB Register Register A A B B Register Register M M file file file file E E icode , ifun valP rA , rB valC Instruction Instruction PC PC Fetch memory memory increment increment PC
Instruction Decoding Instruction Format Instruction byte icode:ifun Optional Optional 5 rA rB D icode ifun rA rB valC Instruction Format Instruction byte icode:ifun Optional register byte rA:rB Optional constant word valC
Executing Arith./Logical Operation OPq rA, rB 6 fn rA rB Fetch Read 2 bytes Decode Read operand registers Execute Perform operation Set condition codes Memory Do nothing Write back Update register PC Update Increment PC by 2
Stage Computation: Arith/Log. Ops OPq rA, rB icode:ifun M1[PC] rA:rB M1[PC+1] valP PC+2 Fetch Read instruction byte Read register byte Compute next PC valA R[rA] valB R[rB] Decode Read operand A Read operand B valE valB OP valA Set CC Execute Perform ALU operation Set condition code register Memory R[rB] valE Write back Write back result PC valP PC update Update PC Formulate instruction execution as sequence of simple steps Use same general form for all instructions
Executing rmmovq Fetch Decode Execute Memory Write back PC Update 4 rB rmmovq rA, D(rB) 4 rA rB D Fetch Read 10 bytes Decode Read operand registers Execute Compute effective address Memory Write to memory Write back Do nothing PC Update Increment PC by 10
Stage Computation: rmmovq rmmovq rA, D(rB) Fetch icode:ifun M1[PC] Read instruction byte rA:rB M1[PC+1] Read register byte valC M8[PC+2] Read displacement D valP PC+10 Compute next PC valA R[rA] valB R[rB] Decode Read operand A Read operand B valE valB + valC Execute Compute effective address M8[valE] valA Memory Write value to memory Write back PC valP PC update Update PC Use ALU for address computation
Executing popq Fetch Decode Execute Memory Write back PC Update popq rA b rA 8 Fetch Read 2 bytes Decode Read stack pointer Execute Increment stack pointer by 8 Memory Read from old stack pointer Write back Update stack pointer Write result to register PC Update Increment PC by 2
Stage Computation: popq popq rA icode:ifun M1[PC] rA:rB M1[PC+1] valP PC+2 Fetch Read instruction byte Read register byte Compute next PC valA R[%rsp] valB R[%rsp] Decode Read stack pointer valE valB + 8 Execute Increment stack pointer valM M8[valA] Memory Read from stack R[%rsp] valE R[rA] valM Write back Update stack pointer Write back result PC valP PC update Update PC Use ALU to increment stack pointer Must update two registers Popped value New stack pointer
Executing Conditional Moves cmovXX rA, rB 2 fn rA rB Fetch Read 2 bytes Decode Read operand registers Execute If !cnd, then set destination register to 0xF Memory Do nothing Write back Update register (or not) PC Update Increment PC by 2
Stage Computation: Cond. Move cmovXX rA, rB icode:ifun M1[PC] rA:rB M1[PC+1] valP PC+2 Fetch Read instruction byte Read register byte Compute next PC valA R[rA] valB 0 Decode Read operand A valE valB + valA If ! Cond(CC,ifun) rB 0xF Execute Pass valA through ALU (Disable register update) Memory R[rB] valE Write back Write back result PC valP PC update Update PC Read register rA and pass through ALU Cancel move by setting destination register to 0xF If condition codes & move condition indicate no move
Executing Jumps Fetch Decode Execute Memory Write back PC Update jXX Dest 7 fn Dest Not taken fall thru: XX target: XX Taken Fetch Read 9 bytes Increment PC by 9 Decode Do nothing Execute Determine whether to take branch based on jump condition and condition codes Memory Do nothing Write back PC Update Set PC to Dest if branch taken or to incremented PC if not branch
Stage Computation: Jumps jXX Dest icode:ifun M1[PC] valC M8[PC+1] valP PC+9 Fetch Read instruction byte Read destination address Fall through address Decode Cnd Cond(CC,ifun) Execute Take branch? Memory Write back PC Cnd ? valC : valP PC update Update PC Compute both addresses Choose based on setting of condition codes and branch condition
Executing call Fetch Decode Execute Memory Write back PC Update call Dest 8 Dest XX return: target: Fetch Read 9 bytes Increment PC by 9 Decode Read stack pointer Execute Decrement stack pointer by 8 Memory Write incremented PC to new value of stack pointer Write back Update stack pointer PC Update Set PC to Dest
Stage Computation: call call Dest icode:ifun M1[PC] valC M8[PC+1] valP PC+9 Fetch Read instruction byte Read destination address Compute return point valB R[%rsp] Decode Read stack pointer valE valB + –8 Execute Decrement stack pointer M8[valE] valP Memory Write return value on stack R[%rsp] valE Write back Update stack pointer PC valC PC update Set PC to destination Use ALU to decrement stack pointer Store incremented PC
Executing ret Fetch Decode Execute Memory Write back PC Update 9 return: XX Fetch Read 1 byte Decode Read stack pointer Execute Increment stack pointer by 8 Memory Read return address from old stack pointer Write back Update stack pointer PC Update Set PC to return address
Stage Computation: ret icode:ifun M1[PC] Fetch Read instruction byte valA R[%rsp] valB R[%rsp] Decode Read operand stack pointer valE valB + 8 Execute Increment stack pointer valM M8[valA] Memory Read return address R[%rsp] valE Write back Update stack pointer PC valM PC update Set PC to return address Use ALU to increment stack pointer Read return address from memory
Computation Steps All instructions follow same general pattern OPq rA, rB Fetch icode,ifun icode:ifun M1[PC] Read instruction byte rA,rB rA:rB M1[PC+1] Read register byte valC [Read constant word] valP valP PC+2 Compute next PC Decode valA, srcA valA R[rA] Read operand A valB, srcB valB R[rB] Read operand B Execute valE valE valB OP valA Perform ALU operation Cond code Set CC Set/use cond. code reg Memory valM [Memory read/write] Write back dstE R[rB] valE Write back ALU result dstM [Write back memory result] PC update PC PC valP Update PC All instructions follow same general pattern Differ in what gets computed on each step
Computation Steps All instructions follow same general pattern call Dest Fetch icode,ifun icode:ifun M1[PC] Read instruction byte rA,rB [Read register byte] valC valC M8[PC+1] Read constant word valP valP PC+9 Compute next PC Decode valA, srcA [Read operand A] valB, srcB valB R[%rsp] Read operand B Execute valE valE valB + –8 Perform ALU operation Cond code [Set /use cond. code reg] Memory valM M8[valE] valP Memory read/write Write back dstE R[%rsp] valE Write back ALU result dstM [Write back memory result] PC update PC PC valC Update PC All instructions follow same general pattern Differ in what gets computed on each step
Computed Values Fetch Decode Execute Memory icode Instruction code ifun Instruction function rA Instr. Register A rB Instr. Register B valC Instruction constant valP Incremented PC Decode srcA Register ID A srcB Register ID B dstE Destination Register E dstM Destination Register M valA Register value A valB Register value B Execute valE ALU result Cnd Branch/move flag Memory valM Value from memory
SEQ Hardware Key Blue boxes: predesigned hardware blocks E.g., memories, ALU Gray boxes: control logic Describe in HCL White ovals: labels for signals Thick lines: 64-bit word values Thin lines: 4-8 bit values Dotted lines: 1-bit values
Fetch Logic Predefined Blocks PC: Register containing PC Instruction memory PC increment rB icode ifun rA valC valP Need regids Instr valid Align Split Bytes 1-9 Byte 0 imem_error Fetch Logic Predefined Blocks PC: Register containing PC Instruction memory: Read 10 bytes (PC to PC+9) Signal invalid address Split: Divide instruction byte into icode and ifun Align: Get fields for rA, rB, and valC
Fetch Logic Control Logic Instr. Valid: Is this instruction valid? memory PC increment rB icode ifun rA valC valP Need regids Instr valid Align Split Bytes 1-9 Byte 0 imem_error Fetch Logic Control Logic Instr. Valid: Is this instruction valid? icode, ifun: Generate no-op if invalid address Need regids: Does this instruction have a register byte? Need valC: Does this instruction have a constant word?
Fetch Control Logic in HCL Instruction memory PC Split Byte 0 imem_error icode ifun # Determine instruction code int icode = [ imem_error: INOP; 1: imem_icode; ]; # Determine instruction function int ifun = [ imem_error: FNONE; 1: imem_ifun;
Fetch Control Logic in HCL popq rA A rA F jXX Dest 7 fn Dest B call Dest 8 cmovXX rA, rB 2 rB irmovq V, rB 3 V rmmovq rA, D(rB) 4 D mrmovq D(rB), rA 5 OPq rA, rB 6 ret 9 nop 1 halt bool need_regids = icode in { IRRMOVQ, IOPQ, IPUSHQ, IPOPQ, IIRMOVQ, IRMMOVQ, IMRMOVQ }; bool instr_valid = icode in { INOP, IHALT, IRRMOVQ, IIRMOVQ, IRMMOVQ, IMRMOVQ, IOPQ, IJXX, ICALL, IRET, IPUSHQ, IPOPQ };
Decode Logic Control Logic Signals Register File Read ports A, B Write ports E, M Addresses are register IDs or 15 (0xF) (no access) rB dstE dstM srcA srcB Register file A B M E icode rA valB valA valE valM Cnd Control Logic srcA, srcB: read port addresses dstE, dstM: write port addresses Signals Cnd: Indicate whether or not to perform conditional move Computed in Execute stage
A Source cmovXX rA, rB valA R[rA] Decode Read operand A rmmovq rA, D(rB) popq rA valA R[%rsp] Read stack pointer jXX Dest No operand call Dest ret OPq rA, rB A Source int srcA = [ icode in { IRRMOVQ, IRMMOVQ, IOPQ, IPUSHQ } : rA; icode in { IPOPQ, IRET } : RRSP; 1 : RNONE; # Don't need register ];
E Desti- nation None R[%rsp] valE Update stack pointer R[rB] valE cmovXX rA, rB Write-back rmmovq rA, D(rB) popq rA jXX Dest call Dest ret Conditionally write back result OPq rA, rB Write back result E Desti- nation int dstE = [ icode in { IRRMOVQ } && Cnd : rB; icode in { IIRMOVQ, IOPQ} : rB; icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP; 1 : RNONE; # Don't write any register ];
Execute Logic Units Control Logic ALU CC cond Implements 4 required functions Generates condition code values CC Register with 3 condition code bits cond Computes conditional jump/move flag Control Logic Set CC: Should condition code register be loaded? ALU A: Input A to ALU ALU B: Input B to ALU ALU fun: What function should ALU compute? CC ALU A B fun. Cnd icode ifun valC valB valA valE Set cond
ALU A Input int aluA = [ icode in { IRRMOVQ, IOPQ } : valA; valE valB + –8 Decrement stack pointer No operation valE valB + 8 Increment stack pointer valE valB + valC Compute effective address valE 0 + valA Pass valA through ALU cmovXX rA, rB Execute rmmovq rA, D(rB) popq rA jXX Dest call Dest ret valE valB OP valA Perform ALU operation OPq rA, rB int aluA = [ icode in { IRRMOVQ, IOPQ } : valA; icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ } : valC; icode in { ICALL, IPUSHQ } : -8; icode in { IRET, IPOPQ } : 8; # Other instructions don't need ALU ];
ALU Oper- ation valE valB + –8 Decrement stack pointer No operation Increment stack pointer valE valB + valC Compute effective address valE 0 + valA Pass valA through ALU cmovXX rA, rB Execute rmmovl rA, D(rB) popq rA jXX Dest call Dest ret valE valB OP valA Perform ALU operation OPl rA, rB ALU Oper- ation int alufun = [ icode == IOPQ : ifun; 1 : ALUADD; ];
Memory Logic Memory Control Logic Reads or writes memory word Data memory Mem. read addr write data out data valE valM valA valP data in icode Stat dmem_error instr_valid imem_error stat Memory Reads or writes memory word Control Logic stat: What is instruction status? Mem. read: should word be read? Mem. write: should word be written? Mem. addr.: Select address Mem. data.: Select data
Instruction Status Control Logic stat: What is instruction status? Data memory Mem. read addr write data out data valE valM valA valP data in icode Stat dmem_error instr_valid imem_error stat Control Logic stat: What is instruction status? ## Determine instruction status int Stat = [ imem_error || dmem_error : SADR; !instr_valid: SINS; icode == IHALT : SHLT; 1 : SAOK; ];
Memory Address OPq rA, rB Memory rmmovq rA, D(rB) popq rA jXX Dest call Dest ret No operation M8[valE] valA Write value to memory valM M8[valA] Read from stack M8[valE] valP Write return value on stack Read return address int mem_addr = [ icode in { IRMMOVQ, IPUSHQ, ICALL, IMRMOVQ } : valE; icode in { IPOPQ, IRET } : valA; # Other instructions don't need address ];
Memory Read OPq rA, rB Memory rmmovq rA, D(rB) popq rA jXX Dest call Dest ret No operation M8[valE] valA Write value to memory valM M8[valA] Read from stack M8[valE] valP Write return value on stack Read return address bool mem_read = icode in { IMRMOVQ, IPOPQ, IRET };
PC Update Logic New PC Select next value of PC New PC Cnd icode valC valP valM New PC Select next value of PC
PC Update OPq rA, rB rmmovq rA, D(rB) popq rA jXX Dest call Dest ret PC valP PC update Update PC PC Cnd ? valC : valP PC valC Set PC to destination PC valM Set PC to return address PC Update int new_pc = [ icode == ICALL : valC; icode == IJXX && Cnd : valC; icode == IRET : valM; 1 : valP; ];
SEQ Operation State Combinational Logic PC register Cond. Code register Data memory Register file All updated as clock rises Combinational Logic ALU Control logic Memory reads Instruction memory Combinational logic Data memory Register file %rbx = 0x100 PC 0x014 CC 100 Read ports Write
SEQ Operation #2 state set according to second irmovq instruction 0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000 0x016: je dest # Not taken 0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300 Cycle 3: Cycle 4: Cycle 5: 0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200 Cycle 2: 0x000: irmovq $0x100,%rbx # %rbx <-- 0x100 Cycle 1: Clock Cycle 1 j l m k Cycle 2 Cycle 3 Cycle 4 SEQ Operation #2 Combinational logic Data memory Register file %rbx = 0x100 PC 0x014 CC 100 Read ports Write state set according to second irmovq instruction combinational logic starting to react to state changes
SEQ Operation #3 state set according to second irmovq instruction 0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000 0x016: je dest # Not taken 0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300 Cycle 3: Cycle 4: Cycle 5: 0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200 Cycle 2: 0x000: irmovq $0x100,%rbx # %rbx <-- 0x100 Cycle 1: Clock Cycle 1 j l m k Cycle 2 Cycle 3 Cycle 4 SEQ Operation #3 Combinational logic Data memory Register file %rbx = 0x100 PC 0x014 CC 100 Read ports Write 0x016 000 %rbx <-- 0x300 state set according to second irmovq instruction combinational logic generates results for addq instruction Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write
SEQ Operation #4 state set according to addq instruction 0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000 0x016: je dest # Not taken 0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300 Cycle 3: Cycle 4: Cycle 5: 0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200 Cycle 2: 0x000: irmovq $0x100,%rbx # %rbx <-- 0x100 Cycle 1: Clock Cycle 1 j l m k Cycle 2 Cycle 3 Cycle 4 SEQ Operation #4 Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write state set according to addq instruction combinational logic starting to react to state changes Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write
SEQ Operation #5 state set according to addq instruction 0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000 0x016: je dest # Not taken 0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300 Cycle 3: Cycle 4: Cycle 5: 0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200 Cycle 2: 0x000: irmovq $0x100,%rbx # %rbx <-- 0x100 Cycle 1: Clock Cycle 1 j l m k Cycle 2 Cycle 3 Cycle 4 SEQ Operation #5 Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write 0x01f state set according to addq instruction combinational logic generates results for je instruction Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write 0x01f Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write 0x01f Combinational logic Data memory Register file %rbx = 0x300 PC 0x016 CC 000 Read ports Write 0x01f
SEQ Summary Implementation Limitations Express every instruction as series of simple steps Follow same general flow for each instruction type Assemble registers, memories, predesigned combinational blocks Connect with control logic Limitations Too slow to be practical In one cycle, must propagate through instruction memory, register file, ALU, and data memory Would need to run clock very slowly Hardware units only active for fraction of clock cycle
Processor Architecture: Pipelined Implementation 15-213: Introduction to Computer Systems
Overview General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86-64 Processor Rearranging SEQ Inserting pipeline registers Problems with data and control hazards
Real-World Pipelines: Car Washes Sequential Parallel Pipelined Idea Divide process into independent stages Move objects through stages in sequence At any given times, multiple objects being processed
Computational Example Combinational logic R e g 300 ps 20 ps Clock Delay = 320 ps Throughput = 3.12 GIPS System Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Must have clock cycle of at least 320 ps
3-Way Pipelined Version g Clock Comb. logic A B C 100 ps 20 ps Delay = 360 ps Throughput = 8.33 GIPS System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps Overall latency increases 360 ps from start to finish
Pipeline Diagrams Unpipelined 3-Way Pipelined Cannot start new operation until previous one completes 3-Way Pipelined Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time A B C OP1 OP2 OP3
Operating a Pipeline Clock Comb. logic A B C 100 ps 20 ps 359 100 ps 300 R e g Clock Comb. logic A B C 100 ps 20 ps 239 R e g Clock Comb. logic A B C 100 ps 20 ps 241 Time OP1 OP2 OP3 A B C 120 240 360 480 640 Clock
Limitations: Nonuniform Delays g Clock Comb. logic B C 50 ps 20 ps 150 ps 100 ps Delay = 510 ps Throughput = 5.88 GIPS A Time OP1 OP2 OP3 A B C Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages
Limitations: Register Overhead Delay = 420 ps, Throughput = 14.29 GIPS Clock R e g Comb. logic 50 ps 20 ps As try to deepen pipeline, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register: 1-stage pipeline: 6.25% 3-stage pipeline: 16.67% 6-stage pipeline: 28.57% High speeds of modern processor designs obtained through very deep pipelining
Data Dependencies System Clock Combinational logic R e g Time OP1 OP2 OP3 System Each operation depends on result from preceding one
Data Hazards R e g Clock Comb. logic A B C Time OP1 OP2 OP3 A B C OP4 Result does not feed back around in time for next operation Pipelining has changed behavior of system
Data Dependencies in Processors 1 irmovq $50, %rax 2 addq %rax , %rbx 3 mrmovq 100( %rbx ), %rdx Result from one instruction used as operand for another Read-after-write (RAW) dependency Very common in actual programs Must make sure our pipeline handles these properly Get correct results Minimize performance impact
SEQ Hardware Stages occur in sequence One operation in process at a time
SEQ+ Hardware PC Stage Processor State Still sequential implementation Reorder PC stage to put at beginning PC Stage Task is to select PC for current instruction Based on results computed by previous instruction Processor State PC is no longer stored in register But, can determine PC based on other stored information
Adding Pipeline Registers Instruction memory PC increment CC ALU Data Fetch Decode Execute Memory Write back icode , ifun rA rB valC Register file A B M E valP srcA srcB dstA dstB valA valB aluA aluB Cnd valE Addr , Data valM newPC
Pipeline Stages Fetch Decode Execute Memory Write Back Select current PC Read instruction Compute incremented PC Decode Read program registers Execute Operate ALU Memory Read or write data memory Write Back Update register file
PIPE- Hardware Forward (Upward) Paths Pipeline registers hold intermediate values from instruction execution Forward (Upward) Paths Values passed from one stage to next Cannot jump past stages e.g., valC passes through decode
Signal Naming Conventions S_Field Value of Field held in stage S pipeline register s_Field Value of Field computed in stage S
Feedback Paths Predicted PC Branch information Return point Guess value of next PC Branch information Jump taken/not-taken Fall-through or target address Return point Read from memory Register updates To register file write ports
Predicting the PC Start fetch of new instruction after current one has completed fetch stage Not enough time to reliably determine next instruction Guess which instruction will follow Recover if prediction was incorrect
Our Prediction Strategy Instructions that Don’t Transfer Control Predict next PC to be valP Always reliable Call and Unconditional Jumps Predict next PC to be valC (destination) Conditional Jumps Only correct if branch is taken Typically right 60% of time Return Instruction Don’t try to predict
Recovering from PC Misprediction Mispredicted Jump Will see branch condition flag once instruction reaches memory stage Can get fall-through PC from valA (value M_valA) Return Instruction Will get return PC when ret reaches write-back stage (W_valM)
Pipeline Demonstration irmovq $1,%rax #I1 1 2 3 4 5 6 7 8 9 F D E M W irmovq $2,%rcx #I2 irmovq $3,%rdx #I3 irmovq $4,%rbx #I4 halt #I5 Cycle 5 I1 I2 I3 I4 I5 File: demo-basic.ys
Data Dependencies: 3 Nop’s 0x000: irmovq $10,% rdx 1 2 3 4 5 6 7 8 9 F D E M W 0x00a: $3,% rax 0x014: nop 0x015: 0x016: 0x017: addq % ,% 10 R[ ] f valA = valB # demo-h3.ys Cycle 6 11 0x019: halt Cycle 7
Data Dependencies: 2 Nop’s 0x000: irmovq $10,% rdx 1 2 3 4 5 6 7 8 9 F D E M W 0x00a: $3,% rax 0x014: nop 0x015: 0x016: addq % ,% 0x018: halt 10 # demo-h2.ys R[ ] f valA = valB • Cycle 6 Error
Data Dependencies: 1 Nop 0x000: irmovq $10,% rdx 1 2 3 4 5 6 7 8 9 F D E M W 0x00a: $3,% rax 0x014: nop 0x015: addq % ,% 0x017: halt # demo-h1.ys R[ ] f 10 valA = valB • Cycle 5 Error M_ valE = 3 dstE
Data Dependencies: No Nop 0x000: irmovq $10,% rdx 1 2 3 4 5 6 7 8 F D E M W 0x00a: $3,% rax 0x014: addq % ,% 0x016: halt # demo-h0.ys valA f R[ ] = valB Cycle 4 Error M_ valE = 10 dstE e_ 0 + 3 = 3 E_
Branch Misprediction Example demo-j.ys 0x000: xorq %rax,%rax 0x002: jne t # Not taken 0x00b: irmovq $1, %rax # Fall through 0x015: nop 0x016: nop 0x017: nop 0x018: halt 0x019: t: irmovq $3, %rdx # Target (Should not execute) 0x023: irmovq $4, %rcx # Should not execute 0x02d: irmovq $5, %rdx # Should not execute Should only execute first 8 instructions
Branch Misprediction Trace 0x000: xorq % rax ,% 1 2 3 4 5 6 7 8 9 F D E M W 0x002: jne t # Not taken 0x019: t: irmovq $3, % rdx # Target 0x023: $4, % rcx # Target+1 0x00b: $1, % # Fall Through # demo - j Cycle 5 valE f dstE = M_Cnd = M_ valA = 0x007 valC ecx rB Incorrectly execute two instructions at branch target
Return Example Require lots of nops to avoid data hazards demo-ret.ys 0x000: irmovq Stack,%rsp # Intialize stack pointer 0x00a: nop # Avoid hazard on %rsp 0x00b: nop 0x00c: nop 0x00d: call p # Procedure call 0x016: irmovq $5,%rsi # Return point 0x020: halt 0x020: .pos 0x20 0x020: p: nop # procedure 0x021: nop 0x022: nop 0x023: ret 0x024: irmovq $1,%rax # Should not be executed 0x02e: irmovq $2,%rcx # Should not be executed 0x038: irmovq $3,%rdx # Should not be executed 0x042: irmovq $4,%rbx # Should not be executed 0x100: .pos 0x100 0x100: Stack: # Initial stack pointer Require lots of nops to avoid data hazards
Incorrect Return Example Incorrectly execute 3 instructions following ret
Pipeline Summary Concept Limitations Fixing the Pipeline Break instruction execution into 5 stages Run instructions through in pipelined mode Limitations Can’t handle dependencies between instructions when instructions follow too closely Data dependencies One instruction writes register, later one reads it Control dependency Instruction sets PC in way that pipeline did not predict correctly Mispredicted branch and return Fixing the Pipeline We’ll do that next time
Processor Architecture: Pipelined Implementation: 2 15-213: Introduction to Computer Systems
Make the pipelined processor work! Overview Make the pipelined processor work! Data Hazards Instruction having register R as source follows shortly after instruction having register R as destination Common condition, don’t want to slow down pipeline Control Hazards Mispredict conditional branch Our design predicts all branches as being taken Naïve pipeline executes two extra instructions Getting return address for ret instruction Naïve pipeline executes three extra instructions Making Sure It Really Works What if multiple special cases happen simultaneously?
Pipeline Stages Fetch Decode Execute Memory Write Back Select current PC Read instruction Compute incremented PC Decode Read program registers Execute Operate ALU Memory Read or write data memory Write Back Update register file
PIPE- Hardware Forward (Upward) Paths Pipeline registers hold intermediate values from instruction execution Forward (Upward) Paths Values passed from one stage to next Cannot jump past stages e.g., valC passes through decode
Data Dependencies: 2 Nop’s 0x000: irmovq $10,% rdx 1 2 3 4 5 6 7 8 9 F D E M W 0x00a: $3,% rax 0x014: nop 0x015: 0x016: addq % ,% 0x018: halt 10 # demo-h2.ys R[ ] f valA = valB • Cycle 6 Error
Data Dependencies: No Nop 0x000: irmovq $10,% rdx 1 2 3 4 5 6 7 8 F D E M W 0x00a: $3,% rax 0x014: addq % ,% 0x016: halt # demo-h0.ys valA f R[ ] = valB Cycle 4 Error M_ valE = 10 dstE e_ 0 + 3 = 3 E_
Stalling for Data Dependencies 1 2 3 4 5 6 7 8 9 10 11 # demo-h2.ys 0x000: irmovq $10,%rdx F D E M W 0x00a: irmovq $3,%rax F D E M W 0x014: nop F D E M W 0x015: nop F D E M W bubble F E M W 0x016: addq %rdx,%rax D D E M W 0x018: halt F F D E M W If instruction follows too closely after one that writes register, slow it down Hold instruction in decode Dynamically inject nop into execute stage
Stall Condition Source Registers Destination Registers Special Case srcA and srcB of current instruction in decode stage Destination Registers dstE and dstM fields Instructions in execute, memory, and write-back stages Special Case Don’t stall for register ID 15 (0xF) Indicates absence of register operand Or failed cond. move
Detecting Stall Condition 1 2 3 4 5 6 7 8 9 10 11 # demo-h2.ys 0x000: irmovq $10,%rdx F D E M W 0x00a: irmovq $3,%rax F D E M W 0x014: nop F D E M W 0x015: nop F D E M W bubble F E M W 0x016: addq %rdx,%rax D D E M W 0x018: halt F F D E M W Cycle 6 W D • W_dstE = %rax W_valE = 3 srcA = %rdx srcB = %rax
Stalling X3 F D E M W F D E M W E M W E M W E M W F D D D D E M W F F 1 2 3 4 5 6 7 8 9 10 11 # demo-h0.ys 0x000: irmovq $10,%rdx F D E M W 0x00a: irmovq $3,%rax F D E M W bubble E M W bubble E M W bubble E M W 0x014: addq %rdx,%rax F D D D D E M W 0x016: halt F F F F D E M W Cycle 6 W W_dstE = %rax Cycle 5 M M_dstE = %rax Cycle 4 • E e_dstE = %rax • D srcA = %rdx srcB = %rax D srcA = %rdx srcB = %rax D srcA = %rdx srcB = %rax
What Happens When Stalling? 0x000: irmovq $10,%rdx 0x00a: irmovq $3,%rax 0x014: addq %rdx,%rax # demo-h0.ys 0x016: halt bubble 0x014: addq %rdx,%rax Cycle 7 0x016: halt bubble Cycle 8 0x014: addq %rdx,%rax 0x016: halt 0x00a: irmovq $3,%rax bubble 0x014: addq %rdx,%rax Cycle 6 0x016: halt 0x000: irmovq $10,%rdx 0x00a: irmovq $3,%rax bubble 0x014: addq %rdx,%rax Cycle 5 0x016: halt 0x000: irmovq $10,%rdx 0x00a: irmovq $3,%rax 0x014: addq %rdx,%rax Cycle 4 0x016: halt Write Back Memory Execute Decode Fetch Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage Like dynamically generated nop’s Move through later stages
Implementing Stalling Pipeline Control Combinational logic detects stall condition Sets mode signals for how pipeline registers should update Pipeline control logic Pipeline control logic Pipeline control logic Pipeline control logic
Pipeline Register Modes Rising clock _ Output = y y Output = x Input = y stall = 0 bubble x Normal Rising clock _ Output = x x Output = x Input = y stall = 1 bubble = 0 x Stall n o p Rising clock _ Output = nop Output = x Input = y stall = 0 bubble = 1 Bubble x x
Data Forwarding Naïve Pipeline Observation Trick Register isn’t written until completion of write-back stage Source operands read from register file in decode stage Needs to be in register file at start of stage Observation Value generated in execute or memory stage Trick Pass value directly from generating instruction to decode stage Needs to be available at end of decode stage
Data Forwarding Example irmovq $10,% rdx 1 2 3 4 5 6 7 8 9 F D E M W 0x00a: $3,% rax 0x014: nop 0x015: 0x016: addq % ,% 0x018: halt 10 # demo-h2.ys Cycle 6 R[ ] f valA = valB W_ valE • dstE = 3 srcA srcB irmovq in write-back stage Destination value in W pipeline register Forward as valB for decode stage
Bypass Paths Decode Stage Forwarding Sources Forwarding logic selects valA and valB Normally from register file Forwarding: get valA or valB from later pipeline stage Forwarding Sources Execute: valE Memory: valE, valM Write back: valE, valM
Data Forwarding Example #2 0x000: irmovq $10,%rdx 1 2 3 4 5 6 7 8 F D E M W 0x00a: irmovq $3,%rax 0x014: addq %rdx,%rax 0x016: halt # demo-h0.ys Cycle 4 valA f M_valE = 10 valB f e_valE = 3 M_dstE = %rdx M_valE = 10 srcA = %rdx srcB = %rax E_dstE = %rax e_valE f 0 + 3 = 3 Register %rdx Generated by ALU during previous cycle Forward from memory as valA Register %rax Value just generated by ALU Forward from execute as valB
Forwarding Priority Multiple Forwarding Choices 0x000: irmovq $1, %rax 1 2 3 4 5 6 7 8 9 F D E M W 0x00a: irmovq $2, %rax 0x014: irmovq $3, %rax 0x01e: rrmovq %rax, %rdx 0x020: halt 10 # demo-priority.ys W R[ % rax ] f 3 1 D valA rdx = 10 valB ? Cycle 5 M 2 E Multiple Forwarding Choices Which one should have priority Match serial semantics Use matching value from earliest pipeline stage
Implementing Forwarding Add additional feedback paths from E, M, and W pipeline registers into decode stage Create logic blocks to select from multiple sources for valA and valB in decode stage
Implementing Forwarding ## What should be the A value? int d_valA = [ # Use incremented PC D_icode in { ICALL, IJXX } : D_valP; # Forward valE from execute d_srcA == e_dstE : e_valE; # Forward valM from memory d_srcA == M_dstM : m_valM; # Forward valE from memory d_srcA == M_dstE : M_valE; # Forward valM from write back d_srcA == W_dstM : W_valM; # Forward valE from write back d_srcA == W_dstE : W_valE; # Use value read from register file 1 : d_rvalA; ];
Limitation of Forwarding Load-use dependency Value needed by end of decode stage in cycle 7 Value read from memory in memory stage of cycle 8
Avoiding Load/Use Hazard Stall using instruction for one cycle Can then pick up loaded value by forwarding from memory stage
Detecting Load/Use Hazard Condition Trigger Load/Use Hazard E_icode in { IMRMOVQ, IPOPQ } && E_dstM in { d_srcA, d_srcB }
Control for Load/Use Hazard 0x000: irmovq $128,% rdx 1 2 3 4 5 6 7 8 9 F D E M W 0x00a: $3,% rcx 0x014: rmmovq % , 0(% ) 0x01e: $10,% ebx 0x028: mrmovq 0(% ), rax # Load % # demo - luh . ys 0x032: addq , # Use % 0x034: halt 10 11 bubble 12 Stall instructions in fetch and decode stages Inject bubble into execute stage Condition F D E M W Load/Use Hazard stall bubble normal
Branch Misprediction Example demo-j.ys 0x000: xorq %rax,%rax 0x002: jne t # Not taken 0x00b: irmovq $1, %rax # Fall through 0x015: nop 0x016: nop 0x017: nop 0x018: halt 0x019: t: irmovq $3, %rdx # Target 0x023: irmovq $4, %rcx # Should not execute 0x02d: irmovq $5, %rdx # Should not execute Should only execute first 8 instructions
Handling Misprediction Predict branch as taken Fetch 2 instructions at target Cancel when mispredicted Detect branch not-taken in execute stage On following cycle, replace instructions in execute and decode by bubbles No side effects have occurred yet
Detecting Mispredicted Branch Condition Trigger Mispredicted Branch E_icode = IJXX & !e_Cnd
Control for Misprediction Condition F D E M W Mispredicted Branch normal bubble
Return Example Previously executed three additional instructions demo-retb.ys Return Example 0x000: irmovq Stack,%rsp # Intialize stack pointer 0x00a: call p # Procedure call 0x013: irmovq $5,%rsi # Return point 0x01d: halt 0x020: .pos 0x20 0x020: p: irmovq $-1,%rdi # procedure 0x02a: ret 0x02b: irmovq $1,%rax # Should not be executed 0x035: irmovq $2,%rcx # Should not be executed 0x03f: irmovq $3,%rdx # Should not be executed 0x049: irmovq $4,%rbx # Should not be executed 0x100: .pos 0x100 0x100: Stack: # Stack: Stack pointer Previously executed three additional instructions
Correct Return Example # demo - retb 0x026: ret F D E M W bubble F D E M W bubble F D E M W bubble F D E M W 0x013: irmovq $5,% rsi # Return F F D D E E M M W W As ret passes through pipeline, stall at fetch stage While in decode, execute, and memory stage Inject bubble into decode stage Release stall when reach write-back stage W valM = 0x0b 0x013 • F F valC valC f f 5 5 rB rB f f % % esi rsi
Detecting Return Condition Trigger Processing ret IRET in { D_icode, E_icode, M_icode }
Control for Return Condition F D E M W Processing ret stall bubble # demo - retb 0x026: ret F D E M W bubble F D E M W bubble F D E M W bubble F D E M W 0x014: irmovq $5,% rsi # Return F F D D E E M M W W Condition F D E M W Processing ret stall bubble normal
Special Control Cases Detection Action (on next cycle) Condition Trigger Processing ret IRET in { D_icode, E_icode, M_icode } Load/Use Hazard E_icode in { IMRMOVQ, IPOPQ } && E_dstM in { d_srcA, d_srcB } Mispredicted Branch E_icode = IJXX & !e_Cnd Condition F D E M W Processing ret stall bubble normal Load/Use Hazard Mispredicted Branch
Implementing Pipeline Control Combinational logic generates pipeline control signals Action occurs at start of following cycle
Initial Version of Pipeline Control bool F_stall = # Conditions for a load/use hazard E_icode in { IMRMOVQ, IPOPQ } && E_dstM in { d_srcA, d_srcB } || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool D_stall = E_icode in { IMRMOVQ, IPOPQ } && E_dstM in { d_srcA, d_srcB }; bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Cnd) || bool E_bubble = # Load/use hazard
Control Combinations Combination A Combination B Special cases that can arise on same clock cycle Combination A Not-taken branch ret instruction at branch target Combination B Instruction that reads from memory to %rsp Followed by ret instruction
Control Combination A Should handle as mispredicted branch JXX E D M Mispredict ret 1 Combination A Condition F D E M W Processing ret stall bubble normal Mispredicted Branch Combination Should handle as mispredicted branch Stalls F pipeline register But PC selection logic will be using M_valM anyhow
Control Combination B Load/use ret ret ret 1 1 1 M M M M E Load E E E D Use D D D ret ret ret Combination B Condition F D E M W Processing ret stall bubble normal Load/Use Hazard Combination bubble + stall Would attempt to bubble and stall pipeline register D Signaled by processor as pipeline error
Handling Control Combination B Load/use ret ret ret 1 1 1 M M M M E Load E E E D Use D D D ret ret ret Combination B Condition F D E M W Processing ret stall bubble normal Load/Use Hazard Combination Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle
Corrected Pipeline Control Logic bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Cnd) || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode } # but not condition for a load/use hazard && !(E_icode in { IMRMOVQ, IPOPQ } && E_dstM in { d_srcA, d_srcB }); Condition F D E M W Processing ret stall bubble normal Load/Use Hazard Combination Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle
Pipeline Summary Data Hazards Control Hazards Control Combinations Most handled by forwarding No performance penalty Load/use hazard requires one cycle stall Control Hazards Cancel instructions when detect mispredicted branch Two clock cycles wasted Stall fetch stage while ret passes through pipeline Three clock cycles wasted Control Combinations Must analyze carefully First version had subtle bug Only arises with unusual instruction combination
Processor Architecture: Wrap-up 15-213: Introduction to Computer Systems
Overview Wrap-Up of PIPE Design Modern High-Performance Processors Exceptional conditions Performance analysis Fetch stage design Modern High-Performance Processors Out-of-order execution
Exceptions Causes Typical Desired Action Our Implementation Conditions under which processor cannot continue normal operation Causes Halt instruction (Current) Bad address for instruction or data (Previous) Invalid instruction (Previous) Typical Desired Action Complete some instructions Either current or previous (depends on exception type) Discard others Call exception handler Like an unexpected procedure call Our Implementation Halt when instruction causes exception
Exception Examples Detect in Memory Stage Detect in Fetch Stage jmp $-1 # Invalid jump target .byte 0xFF # Invalid instruction code halt # Halt instruction Detect in Memory Stage irmovq $100,%rax rmmovq %rax,0x10000(%rax) # invalid address
Exceptions in Pipeline Processor #1 # demo-exc1.ys irmovq $100,%rax rmmovq %rax,0x10000(%rax) # Invalid address nop .byte 0xFF # Invalid instruction code 1 2 3 4 W 5 M E D Exception detected 0x000: irmovq $100,%rax F D E M 0x00a: rmmovq %rax,0x1000(%rax) F D E 0x014: nop F D 0x015: .byte 0xFF F Exception detected Desired Behavior rmmovq should cause exception Following instructions should have no effect on processor state
Exceptions in Pipeline Processor #2 # demo-exc2.ys 0x000: xorq %rax,%rax # Set condition codes 0x002: jne t # Not taken 0x00b: irmovq $1,%rax 0x015: irmovq $2,%rdx 0x01f: halt 0x020: t: .byte 0xFF # Target 1 2 3 4 M E F D W 5 M D F E E D M 6 M E W 7 W M 8 W 9 0x000: xorq %rax,%rax F D E 0x002: jne t F D 0x020: t: .byte 0xFF F Exception detected 0x???: (I’m lost!) 0x00b: irmovq $1,%rax Desired Behavior No exception should occur
Maintaining Exception Ordering W icode valE valM dstE dstM stat M Cnd icode valE valA dstE dstM stat E icode ifun valC valA valB dstE dstM srcA srcB stat D rB valC valP icode ifun rA stat F predPC Add status field to pipeline registers Fetch stage sets to either “AOK,” “ADR” (when bad fetch address), “HLT” (halt instruction) or “INS” (illegal instruction) Decode & execute pass values through Memory either passes through or sets to “ADR” Exception triggered only when instruction hits write back
Exception Handling Logic Fetch Stage Memory Stage Writeback Stage dmem_error # Determine status code for fetched instruction int f_stat = [ imem_error: SADR; !instr_valid : SINS; f_icode == IHALT : SHLT; 1 : SAOK; ]; # Update the status int m_stat = [ dmem_error : SADR; 1 : M_stat; ]; int Stat = [ # SBUB in earlier stages indicates bubble W_stat == SBUB : SAOK; 1 : W_stat; ];
Side Effects in Pipeline Processor # demo-exc3.ys irmovq $100,%rax rmmovq %rax,0x10000(%rax) # invalid address addq %rax,%rax # Sets condition codes 1 2 3 4 W 5 M E Exception detected 0x000: irmovq $100,%rax F D E M 0x00a: rmmovq %rax,0x1000(%rax) F D E 0x014: addq %rax,%rax F D Condition code set Desired Behavior rmmovq should cause exception No following instruction should have any effect
Avoiding Side Effects Presence of Exception Should Disable State Update Invalid instructions are converted to pipeline bubbles Except have stat indicating exception status Data memory will not write to invalid address Prevent invalid update of condition codes Detect exception in memory stage Disable condition code setting in execute Must happen in same clock cycle Handling exception in final stages When detect exception in memory stage Start injecting bubbles into memory stage on next cycle When detect exception in write-back stage Stall excepting instruction Included in HCL code
Control Logic for State Changes Setting Condition Codes Stage Control Also controls updating of memory # Should the condition codes be updated? bool set_cc = E_icode == IOPQ && # State changes only during normal operation !m_stat in { SADR, SINS, SHLT } && !W_stat in { SADR, SINS, SHLT }; # Start injecting bubbles as soon as exception passes through memory stage bool M_bubble = m_stat in { SADR, SINS, SHLT } || W_stat in { SADR, SINS, SHLT }; # Stall pipeline register W when exception encountered bool W_stall = W_stat in { SADR, SINS, SHLT };
Rest of Real-Life Exception Handling Call Exception Handler Push PC onto stack Either PC of faulting instruction or of next instruction Usually pass through pipeline along with exception status Jump to handler address Usually fixed address Defined as part of ISA Implementation Haven’t tried it yet!
Performance Metrics Clock rate Rate at which instructions executed Measured in Gigahertz Function of stage partitioning and circuit design Keep amount of work per stage small Rate at which instructions executed CPI: cycles per instruction On average, how many clock cycles does each instruction require? Function of pipeline design and benchmark programs E.g., how frequently are branches mispredicted?
CPI = C/I = (I+B)/I = 1.0 + B/I CPI for PIPE CPI 1.0 Fetch instruction each clock cycle Effectively process new instruction almost every cycle Although each individual instruction has latency of 5 cycles CPI > 1.0 Sometimes must stall or cancel branches Computing CPI C clock cycles I instructions executed to completion B bubbles injected (C = I + B) CPI = C/I = (I+B)/I = 1.0 + B/I Factor B/I represents average penalty due to bubbles
CPI for PIPE (Cont.) B/I = LP + MP + RP LP: Penalty due to load/use hazard stalling Fraction of instructions that are loads 0.25 Fraction of load instructions requiring stall 0.20 Number of bubbles injected each time 1 LP = 0.25 * 0.20 * 1 = 0.05 MP: Penalty due to mispredicted branches Fraction of instructions that are cond. jumps 0.20 Fraction of cond. jumps mispredicted 0.40 Number of bubbles injected each time 2 MP = 0.20 * 0.40 * 2 = 0.16 RP: Penalty due to ret instructions Fraction of instructions that are returns 0.02 Number of bubbles injected each time 3 RP = 0.02 * 3 = 0.06 Net effect of penalties 0.05 + 0.16 + 0.06 = 0.27 CPI = 1.27 (Not bad!) Typical Values
Fetch Logic Revisited During Fetch Cycle Timing Select PC Read bytes from instruction memory Examine icode to determine instruction length Increment PC Timing Steps 2 & 4 require significant amount of time
need_regids, need_valC Standard Fetch Timing need_regids, need_valC Select PC Mem. Read Increment 1 clock cycle Must Perform Everything in Sequence Can’t compute incremented PC until know how much to increment it by
A Fast PC Increment Circuit incrPC High-order 60 bits Low-order 4 bits carry MUX 1 Slow Fast 60-bit incre- menter 4-bit adder need_regids High-order 60 bits need_ValC Low-order 4 bits PC
need_regids, need_valC Modified Fetch Timing need_regids, need_valC 4-bit add Select PC Mem. Read MUX Incrementer Standard cycle 1 clock cycle 60-Bit Incrementer Acts as soon as PC selected Output not needed until final MUX Works in parallel with memory read
More Realistic Fetch Logic Bytes 1-9 Fetch Box Integrated into instruction cache Fetches entire cache block (16 or 32 bytes) Selects current instruction from current block Works ahead to fetch next block As reaches end of current block At branch target
Modern CPU Design
Instruction Control Grabs Instruction Bytes From Memory Based on Current PC + Predicted Targets for Predicted Branches Hardware dynamically guesses whether branches taken/not taken and (possibly) branch target Translates Instructions Into Operations Primitive steps required to perform instruction Typical instruction requires 1–3 operations Converts Register References Into Tags Abstract identifier linking destination of one operation with sources of later operations
Execution Unit Multiple functional units Integer/ Branch FP Add Mult /Div Load Store Data Cache Prediction OK? Addr . General Integer Operation Results Register Updates Operations Execution Unit Multiple functional units Each can operate in independently Operations performed as soon as operands available Not necessarily in program order Within limits of functional units Control logic Ensures behavior equivalent to sequential program execution
CPU Capabilities of Intel Haswell Multiple Instructions Can Execute in Parallel 2 load 1 store 4 integer 2 FP multiply 1 FP add / divide Some Instructions Take > 1 Cycle, but Can be Pipelined Instruction Latency Cycles/Issue Load / Store 4 1 Integer Multiply 3 1 Integer Divide 3—30 3—30 Double/Single FP Multiply 5 1 Double/Single FP Add 3 1 Double/Single FP Divide 10—15 6—11
Haswell Operation Translates instructions dynamically into “Uops” ~118 bits wide Holds operation, two sources, and destination Executes Uops with “Out of Order” engine Uop executed when Operands available Functional unit available Execution controlled by “Reservation Stations” Keeps track of data dependencies between uops Allocates resources
High-Performance Branch Prediction Critical to Performance Typically 11–15 cycle penalty for misprediction Branch Target Buffer 512 entries 4 bits of history Adaptive algorithm Can recognize repeated patterns, e.g., alternating taken–not taken Handling BTB misses Detect in ~cycle 6 Predict taken for negative offset, not taken for positive Loops vs. conditionals
Example Branch Prediction Branch History Encode information about prior history of branch instructions Predict whether or not branch will be taken State Machine Each time branch taken, transition to right When not taken, transition to left Predict branch taken when in state Yes! or Yes? T Yes! Yes? No? No! NT
Processor Summary Design Technique Operation Enhancing Performance Create uniform framework for all instructions Want to share hardware among instructions Connect standard logic blocks with bits of control logic Operation State held in memories and clocked registers Computation done by combinational logic Clocking of registers/memories sufficient to control overall behavior Enhancing Performance Pipelining increases throughput and improves resource utilization Must make sure to maintain ISA behavior