Download presentation
Presentation is loading. Please wait.
1
Multithreaded SPARC v8 Functional Model for RAMP Gold Zhangxi Tan UC Berkeley RAMP Retreat, Jan 17, 2008
2
Motivation Traditional RISC optimizations are far less appealing on soft-core processors on FPGAs – Mapped to expensive wide bus muxes; becomes area/frequency bottleneck on fabric Bypassing network Delayed branch – Less efficient when dealing with memory access latency Small cache size & shared memory controller make things even worse – Poor core count on single FPGA! (e.g. V5 LX110T) <16 32-bit Sparc V8 Leon integer pipeline
3
Approach Need a new functional model, which is able to – Support a large number of emulated cores (~1k) per BEE3 board – Accelerate aggregate emulate performance (MIPS/chip) Including optimizations to tolerate memory & I/O latency – Run full OS and support OS development TLB/exception support Memory mapped I/O + IRQ support – Interfacing with timing model Virtualizing Sparc V8 RTL with fine-grain multithreading – High density design (256/512 emulated CPUs per chip) 8 cores in 2 clusters per FPGA (V5 LX110T); each core has 32 or 64 threads (configurable) 4 cores in one cluster share one BEE3 mem controller – Start from 32-bit ISA, eventually support 64-bit ISA (v9)
4
Design philosophy 1 Keep everything simple! – Build processor w/o bypassing network Greatly simplify pipeline design Preliminary result shows ~28% LUT reduction + ~18% frequency improvement on Leon3 processor – Direct map cache/TLB – Simple fine-grain multithreading to fill pipeline bubbles Static RR issue : T1->T2->T3->T4->T1->T2….. Never stall the pipeline – Long latency operations? – Tell the pipeline to REPLAY the instruction in the next rotation – “Microcode” for complex instructions/trap handling
5
Design philosophy 2 Design for fabric (Targeting Virtex 5) – High working frequency (expect ~150 MHz) Deep pipeline: 10~11 physical stages – Manually controlled FPGA resources mapping BRAMs, LUTRAM Use V5 DSPs as ALU Pipelining all BRAMs and DSPs. (maximize Fmax) – Error detection/correction for all BRAMs Cache tags and register file use parity bit to detect soft errors TLB entry and cache data are protected by built-in V5 ECC- BRAM
6
Challenges Thread state storage & per-thread L1 cache – Will BRAM/LUTRAM fit? – How large ? – Where to map? LUTRAM or BRAM Bandwidth and RW ports requirement – Multithreading amplifies the requirement! How to make use of FPGA primitives to control total LUT usage – 6-input LUTs: LUT5_2, RAM64B – DSPs
7
State storage Main thread state (integer pipeline) – 3 register windows per thread (2-minimum by specification, 3 for performance) 8 global + 16*3 window registers Stored in BRAM in chunks of 64 registers – PC/nPC – LUTRAM – PSR (processor state register) – LUTRAM – WIM (register window mask) – LUTRAM – TBR (trap base register) – BRAM packed w. 3 reg window – Y (high 32-bit for mul/div) - LUTRAM
8
Regfile layout ThreadBRAM Address BRAM Content 00-7Global register g0-g7 8TBR 9-15scratch register for microcode mode 16-633-register window 164-71Global register g0-g7 72TBR 73-79scratch register for microcode mode 80-1273-register window 2……. 64 threads per pipeline, 8 pipelines per chip (V5 LX110T) Eight 18kb blocks Double clocked BRAM (virtually 4 ports) Indexed with {thread_id, reg_addr}
9
Cache & TLB Per thread Cache – Split I/D direct-map write-allocate write-back cache Block size: 32 bytes (BEE3 DDR2 controller heart beat) 512B total in 64-thread configuration : 256B – I$, 256B – D$ – Size doubled (1KB) for 32-thread configuration Non-blocking to a different thread, but blocking to the same thread CPU and memory controller access cache at the same time through different ports – Physical tag Per thread TLB – split I/D direct-map TLB 16 entries in total : 8 for ITLB and 8 for DTLB Total BRAM usage per thread (regfile + cache/TLB + tag +misc) : 30~32 blocks (18kb) BRAM is still the critical resource
10
DSP48E are perfect for ALU DSP48E is a MAC. Two 48-bit inputs, one 48-bit output – Add/subtract/logic/by pass/address calculation – Pattern detector (generate Z flag) <10 LUTs for C, O, nothing for N
11
Mapping SPARC instructions to DSP48E Most of SPARC v8 instructions can be covered by DSP48E – 1 cycle ALU (1 DSP) LD/ST (address calculation) Bit-wise logic (and, or, …) SETHI JMPL, RETT, Call Write special register (WRPSR) SAVE/RESTORE – Long latency ALU Pipelined shift/Mul (4 DSPs) Divide (1 DSP) – Misc RDPSR, RDWIM (XOR ops.) Only one 32-bit adder is not in DSP (nPC+4) DSP48E is not silver bullet – Barrel shifter/shifter support is weak Altera does better on shifters – 48-bit is odd! Expecting 64-bit inputs DSPs w. 32x32 multipliers (DSP64E?)
12
Pipeline Arch 7-stage pipeline – MMU support soon
13
Status Coded in Systemverilog – ~4000 lines of code implemented Push to synthesis tools in Feb 08 – Synthesize with Precision or Synplify – Full V8 instruction (integer) support (no MMU) – Aiming ~150 MHz, estimate <4000 LUTs per core Verification Goal – pass microsparc verification suite / sparc.org certification test
14
Backup Slides
15
SPARC vs MIPS Similar ISA – Similar ALU/Jump and Link/Jump instructions – Similar LD/ST inst. (LDB, LDH, LDW) – Delay branch Except – Branch on 4 condition codes (N, C, O, Z) E.g. Addcc r1, r2, r3 Bicc address – Trap on condition code for SW traps (e.g. System call) – Register window ( 2-32 windows) Only 1 window (32 registers) activates, controlled by CWP field in Processor State Register (PSR) SAVE/RESTORE, RETT, trap will affect the window SAVE/RESTORE are common used in function call – No FPU Integer register file transfer instructions – Difference in atomic instructions: MIPS: LL/SC, SPARC: LDSTUB, SWAP
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.