Two-issue Super Scalar CPU
CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction fetch (bubble handling) -decode stage (instr handling, scoreboard implemented) -execute stage (doubled execution unit, forwarding, branch resolving, write-back ports) -load-store stage (memory access handling, doubled write-back signal)
Top level model Global 50MHz clock connected do DLL component which performs clock frequency doubling Doubled clock needed to implement 4-port Block RAM performance counter CPU chipset DLL CLK IO interface CLK0 CLK2x
Instruction cache Block RAM extension to two-port implementation Cache miss and hit tests for two ports One memory port FSM responsible for memory access is switched between two requests from instruction fetch first portsecond port Block RAM FSM Memory Access
Instruction fetch Fetching two instruction from cache bubble insertion for each instruction stream instructions passed to the output in order two instruction cache ports Instruction Fetch two decode stage ports branch request bubble1bubble2
Decode stage Decoding two instructions Quad-port Block RAM inferred Taking advantage from doubled clock – double write-back handling Scoreboard implemented – set of conditions for checking data dependencies Bubble generation Instruction stream prepared for load-store stage two instruction fetch ports two execute stage ports Scoreboard Block RAM Write-back Instruction decoding Write-back Previous Instr.
Scoreboard Simplification of full scoreboard unit Introduced as a set of conditions implemented in decode stage Used for bubble insertion of both types (concurrent and consecutive instructions) and separating memory access instructions Presented by abtract instruction table consisted of two lines NrInstructionIdx_dIdx_aIdx_bExecutability In practice corresponds to Outputs of instructions fetch 1 2 MUL ST
And few examples: Firstly, normal operation without any bubble insertion, two instructions are fully independent Write-back two instruction fetch ports two execute stage ports Block RAM Instruction decoding Scoreboard Previous Instr.
Bubble insertion caused by data dependencies between concurrent instructions two instruction fetch ports two execute stage ports Block RAM Instruction decoding Write-back Scoreboard Previous Instr.
Bubble insertion caused by data dependencies between load instruction and consecutive arbitrary instructions two execute stage ports Block RAM Instruction decoding Write-back InstrInstr $1,$0LD $0 Instr Scoreboard Previous Instr.
Bubble insertion introduced to split two memory-access instructions two execute stage ports Block RAM Instruction decoding Write-back LD ST Instr Scoreboard Previous Instr.
Execute stage Doubled ALU Resolving of branch priority Forwarding from both instruction streams Write-back generation two decode stage ports two load store stage ports Data forwarding ALU Register branch request
Load-store stage It is ensured that only one memory access instruction is passed to load store unit Memory access process is switched to the right instruction write back signals are generated write back signals write back from execute memory access write back multiplexing memory ports
In action
Performance (1) – blinking leds Additional parameters: Number of simulated cycles : Execution Frequency of Memory Access Instructions compared with number of all instructions: - Super Sc : 0,29 - SIMD : 0,24 ALU Instructions : - Super Sc : 0,14 - SIMD : 0,13 Instruction/ cycle SIMDSuper scalar SIMD 0,5 0,42
Performance (2) - apfel Additional parameters: Execution Frequency of Memory Access Instructions: - for both : 0,2 ALU Instructions : - both : 0,4 Measurement Results of Instruction Execution Frequency are surprising, probably because of many memory access instructions executed at the beginning of program (the longer the simulation time is, the better results we should get) Instruction/ cycle SIMDSuper scalar SIMD 0,56 0,45
Synthesis last version seen working on XCV300 was 2-way SIMD (MUCH faster than HaPra CPU!) 4-way SIMD and Super Scalar versions are too big for XCV and for unknown reasons don't work in XCV800 probably severe timing issues - running on 25MHz instead of 50MHs doesn't help (but 4-way SIMD should work anyway!) all we've got is fully working simulation