Multithreaded SPARC v8 Functional Model for RAMP Gold Zhangxi Tan UC Berkeley RAMP Retreat, Jan 17, 2008.

Multithreaded SPARC v8 Functional Model for RAMP Gold Zhangxi Tan UC Berkeley RAMP Retreat, Jan 17, 2008

Motivation Traditional RISC optimizations are far less appealing on soft-core processors on FPGAs – Mapped to expensive wide bus muxes; becomes area/frequency bottleneck on fabric Bypassing network Delayed branch – Less efficient when dealing with memory access latency Small cache size & shared memory controller make things even worse – Poor core count on single FPGA! (e.g. V5 LX110T) <16 32-bit Sparc V8 Leon integer pipeline

Approach Need a new functional model, which is able to – Support a large number of emulated cores (~1k) per BEE3 board – Accelerate aggregate emulate performance (MIPS/chip) Including optimizations to tolerate memory & I/O latency – Run full OS and support OS development TLB/exception support Memory mapped I/O + IRQ support – Interfacing with timing model Virtualizing Sparc V8 RTL with fine-grain multithreading – High density design (256/512 emulated CPUs per chip) 8 cores in 2 clusters per FPGA (V5 LX110T); each core has 32 or 64 threads (configurable) 4 cores in one cluster share one BEE3 mem controller – Start from 32-bit ISA, eventually support 64-bit ISA (v9)

Design philosophy 1 Keep everything simple! – Build processor w/o bypassing network Greatly simplify pipeline design Preliminary result shows ~28% LUT reduction + ~18% frequency improvement on Leon3 processor – Direct map cache/TLB – Simple fine-grain multithreading to fill pipeline bubbles Static RR issue : T1->T2->T3->T4->T1->T2….. Never stall the pipeline – Long latency operations? – Tell the pipeline to REPLAY the instruction in the next rotation – “Microcode” for complex instructions/trap handling

Design philosophy 2 Design for fabric (Targeting Virtex 5) – High working frequency (expect ~150 MHz) Deep pipeline: 10~11 physical stages – Manually controlled FPGA resources mapping BRAMs, LUTRAM Use V5 DSPs as ALU Pipelining all BRAMs and DSPs. (maximize Fmax) – Error detection/correction for all BRAMs Cache tags and register file use parity bit to detect soft errors TLB entry and cache data are protected by built-in V5 ECC- BRAM

Challenges Thread state storage & per-thread L1 cache – Will BRAM/LUTRAM fit? – How large ? – Where to map? LUTRAM or BRAM Bandwidth and RW ports requirement – Multithreading amplifies the requirement! How to make use of FPGA primitives to control total LUT usage – 6-input LUTs: LUT5_2, RAM64B – DSPs

State storage Main thread state (integer pipeline) – 3 register windows per thread (2-minimum by specification, 3 for performance) 8 global + 16*3 window registers Stored in BRAM in chunks of 64 registers – PC/nPC – LUTRAM – PSR (processor state register) – LUTRAM – WIM (register window mask) – LUTRAM – TBR (trap base register) – BRAM packed w. 3 reg window – Y (high 32-bit for mul/div) - LUTRAM

Regfile layout ThreadBRAM Address BRAM Content 00-7Global register g0-g7 8TBR 9-15scratch register for microcode mode 16-633-register window 164-71Global register g0-g7 72TBR 73-79scratch register for microcode mode 80-1273-register window 2……. 64 threads per pipeline, 8 pipelines per chip (V5 LX110T) Eight 18kb blocks Double clocked BRAM (virtually 4 ports) Indexed with {thread_id, reg_addr}

Cache & TLB Per thread Cache – Split I/D direct-map write-allocate write-back cache Block size: 32 bytes (BEE3 DDR2 controller heart beat) 512B total in 64-thread configuration : 256B – I$, 256B – D$ – Size doubled (1KB) for 32-thread configuration Non-blocking to a different thread, but blocking to the same thread CPU and memory controller access cache at the same time through different ports – Physical tag Per thread TLB – split I/D direct-map TLB 16 entries in total : 8 for ITLB and 8 for DTLB Total BRAM usage per thread (regfile + cache/TLB + tag +misc) : 30~32 blocks (18kb) BRAM is still the critical resource

DSP48E are perfect for ALU DSP48E is a MAC. Two 48-bit inputs, one 48-bit output – Add/subtract/logic/by pass/address calculation – Pattern detector (generate Z flag) <10 LUTs for C, O, nothing for N

Mapping SPARC instructions to DSP48E Most of SPARC v8 instructions can be covered by DSP48E – 1 cycle ALU (1 DSP) LD/ST (address calculation) Bit-wise logic (and, or, …) SETHI JMPL, RETT, Call Write special register (WRPSR) SAVE/RESTORE – Long latency ALU Pipelined shift/Mul (4 DSPs) Divide (1 DSP) – Misc RDPSR, RDWIM (XOR ops.) Only one 32-bit adder is not in DSP (nPC+4) DSP48E is not silver bullet – Barrel shifter/shifter support is weak Altera does better on shifters – 48-bit is odd! Expecting 64-bit inputs DSPs w. 32x32 multipliers (DSP64E?)

Pipeline Arch 7-stage pipeline – MMU support soon

Status Coded in Systemverilog – ~4000 lines of code implemented Push to synthesis tools in Feb 08 – Synthesize with Precision or Synplify – Full V8 instruction (integer) support (no MMU) – Aiming ~150 MHz, estimate <4000 LUTs per core Verification Goal – pass microsparc verification suite / sparc.org certification test

Backup Slides

SPARC vs MIPS Similar ISA – Similar ALU/Jump and Link/Jump instructions – Similar LD/ST inst. (LDB, LDH, LDW) – Delay branch Except – Branch on 4 condition codes (N, C, O, Z) E.g. Addcc r1, r2, r3 Bicc address – Trap on condition code for SW traps (e.g. System call) – Register window ( 2-32 windows) Only 1 window (32 registers) activates, controlled by CWP field in Processor State Register (PSR) SAVE/RESTORE, RETT, trap will affect the window SAVE/RESTORE are common used in function call – No FPU Integer register file transfer instructions – Difference in atomic instructions: MIPS: LL/SC, SPARC: LDSTUB, SWAP

Multithreaded SPARC v8 Functional Model for RAMP Gold Zhangxi Tan UC Berkeley RAMP Retreat, Jan 17, 2008.

Similar presentations

Presentation on theme: "Multithreaded SPARC v8 Functional Model for RAMP Gold Zhangxi Tan UC Berkeley RAMP Retreat, Jan 17, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multithreaded SPARC v8 Functional Model for RAMP Gold Zhangxi Tan UC Berkeley RAMP Retreat, Jan 17, 2008.

Similar presentations

Presentation on theme: "Multithreaded SPARC v8 Functional Model for RAMP Gold Zhangxi Tan UC Berkeley RAMP Retreat, Jan 17, 2008."— Presentation transcript:

Similar presentations

About project

Feedback