Download presentation
Presentation is loading. Please wait.
Published byDrake Bradstreet Modified over 10 years ago
1
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab, EECS UC Berkeley March 2010
2
Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work 2
3
Overview Purpose of RAMP Gold An FPGA-based simulator for shared-memory multicore target for Parlab Usage case: Architecture, OS and applications Highlight of RAMP Gold Works on $750 Xilinx XUP v5 board Written in systemverilog, no special CAD tools required, works with standard FPGA CAD flows (Synplify/ISE/Modelsim) Two orders of magnitude faster than Simics+GEMS Runtime configurable parameters without resynthesis Full RTL verification environment and software infrastructure BSD and GNU license 3
4
Simulation Jargon Target vs. Host Target: System/architecture being simulated, e.g. SPARC v8 CMP Host : The platform on which the simulator runs, e.g. FPGAs Functional model and timing model Functional: compute instruction result Timing: how long to compute the instruction 4
5
RAMP Gold Overall Setup 5 Both functional and timing models on FPGA App server: control and service syscall/IO
6
Target Machine Template 6 64-core SPARC v8 shared-memory machine Configurable two-level cache + multichannel DRAM
7
RAMP Gold Performance vs Simics PARSEC parallel benchmarks running on a research OS >250x faster than full system simulator for a 64-core multiprocessor target 7
8
Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work 8
9
RAMP Gold Model Key Concepts 9 Decoupled functional/timing model, both in hardware Enables many FPGA fabric friendly optimizations Increase modeling efficiency and module reuse Host multithreading of both functional and timing models Hide emulation latencies and improve resource utilization Time-multiplexed effect patched by the timing model Functional Model Pipeline Arch State Timing Model Pipeline Timing State
10
Host multithreading Example: simulating four independent CPUs 10 +1 PC 1 PC 1 PC 1 PC 1 I$ IR GPR1 X Y ALUALU D$ 22 DE 2 Thread Select CPU 0 CPU 1 CPU 2 CPU 3 Target Model Functional CPU model on FPGA
11
Functional Model Full SPARC v8 support (FP, MMU, I/Os) Pass the SPARC v8 certification test Run Linux and research OS 11
12
Timing Model Simple CPU timing but detailed memory timing model (i.e. every instruction takes 1 cycle except LD/ST) Cache models: only store tags in BRAMs Runtime configurable parameters: associativity, size, line size, # of banks, latency and etc Model 3C but not 4C (coherent support soon) DRAM model: bandwidth-delay pipe with optional QoS 12
13
Debugging and Simulation Configuration 13 Frontend app server Reliable Gigabit Ethernet connection to FPGA Periodically pulls the simulator to serve I/O requests Transparent to target (no side effect on simulated timing) 64-bit hardware performance counters to collect runtime stats 657 counters in timing model + 10 host counters Can be read by either target apps or the app server Ring interconnect for counters (easy to add and remove)
14
Host Performance Timing synchronization is the largest overhead Tiny host $/TLBs are not on the performance critical path Host DRAM bandwidth is not a problem (<15% utilization) 14
15
Implementation Single FPGA: 64-core @ 90 MHz, 2 GB DDR2 SODIMM ~2 hours CAD turnaround time on a mid-range workstation BRAM bounded, but have logic resources to fit more pipelines 15
16
Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work 16
17
Software Tools SPARC cross compiler with binutils/gcc/glibc Support most of POSIX programs Static & dynamic linking support Built from GNU GCC (4.3.2) Full software and HW debugging suite Low-cost XUP boards sometimes do not work out-of-box FPGA CAD tools are very bad 17
18
18 Target Software Proxy Kernel: single-protection-domain application host Runs programs statically linked against glibc Forwards I/O system calls to x86/Linux host PC Presents simple hard-threads API for multithreaded programs Very easy to modify ROS: UCBs manycore research OS Provides multiprogramming support Sufficiently POSIX compliant to run many programs Much easier to modify than linux Run more than 64-cores
19
Infrastructure 19
20
Case studies Parallel application studies for software programmers Parallel OS for system researchers Adding hardware performance counter for advanced debugging Micro-architecture studies - adding features and modifying existing timing models Adding new instructions – changing the functional model 20
21
Appserver 101 Appserver command-line options: Usage: sparc_app [-f ] [-p ] [-s] [binary] [args] Platform memory test: App server memory test: sparc_app –p64 hw memtest none Proxykernel memory test (stress test) sparc_app –p64 hw pathlkernel.ramp path/memtest 21
22
For application programmers Main usage scenario: use runtime configurable timing model without any FPGA hardware change Use hard-threads to write a parallel hello world program running on the proxykernel Compile the program using the cross toolchain sparc-ros-gcc –o hello hellp.cpp -lhart Measure performance using performance counters sparc_app –s1 –p64 hw kernel.ramp hello Change target machine configuration on the fly and rerun the experiment edit file appserver.conf 22
23
For OS Developer Similar usage model like application programmers Proxykernel is a good start to learn the bootstrapping process ROS is a full functional kernel Demo: Boot the ROS kernel using the appserver sparc_app –p64 –fappserver_ros.conf hw your_kernel none 23
24
Adding Hardware Performance Counters Two types of counter interface Global counter: Local (per core) counter: Modify the verilog file to add more counters on the ring. perfctr_io #(.NLOCAL(num_of_local),.NGLOBAL(num_of_global)) gen_tm_counter(.gclk,.rst,.bus_out(io_out),.bus_in(io_in),.bus_sel(), //IO bus interface.global_inc(global_counter_inc),.local_inc(local_counter_inc),.local_tid(local_counter_tid)); Modify the app server to support more counters: Add your counter definition in TestAppServer/perfcnt.h 24
25
Adding Features to Timing Models Timing models are much simpler than functional models ~1000 LoC vs 35,000 LoC Example 1: Changing the cache replacement policy Example 2: Adding memory QoS Lee et al. Globally-Synchronized Frames for, Guaranteed Quality-of-Service in On-Chip Networks, ISCA08 ~100 lines of code added in the timing model A new DRAM model Several memory mapped register added on the functional I/O bus for configuration purpose 25
26
Adding New Instructions Adding instructions to a feed-through pipeline is straightforward FPU instructions were added as new instructions within a week Including: new register file, decode, exception/commit and microcode Example: Adding new atomic instructions through microcode 4 global scratchpad registers (not visible to programmer) in the main integer register file for temporary storage Two write-port for supporting scratchpad registers update along with architecture register change 26
27
Steps of Adding Instructions Add proper decoding logic in function decode_dsp_add_logic of regacc_dma.sv Update the writeback/exception stage in file exception_dma.sv to trap to microcode. Edit function decode_microcode_mode to trap to microcode Edit function rd_gen to write address to scratch register 0, and load data to scratch register 1 Edit microcode ROM Microcode.sv //----------SWAP*------- 9: begin uco.uend = '0; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {LDST, 5'b0, ST, REGADDR_SCRATCH_0 | UCI_MASK, 1'b1, 13'b0}; end 10: begin uco.uend = '1; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {FMT3, 5'b0, IADD, REGADDR_SCRATCH_1 | UCI_MASK, 1'b1, 13'b0}; end 27
28
Future work Cache Coherence models (soon) Realistic interconnect model (soon) Better CPU core model (next major version) Support other ISAs (next major version) 28
29
Further References Research papers Usage case: A Case for FAME: FPGA Architecture Model Execution, ISCA10 RAMP Gold design: RAMP Gold: An FPGA-based Architecture Simulator for Multiprocessors, DAC10 Beta release http://sites.google.com/site/rampgold 29
30
Backup Slides 30
31
Functional/Timing Model Interface 31 // FM -> TM typedef struct { bit valid; //timing token between FM and TM. bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit [5:0] tid; //thread ID bit run; //cpu states bit run; //cpu states bit replay; //this instruction needs to replay by FM bit replay; //this instruction needs to replay by FM bit retired; //retiring an instruction bit retired; //retiring an instruction bit [31:0] inst; //the instruction that was retired bit [31:0] inst; //the instruction that was retired bit [31:0] paddr; //load/store physical address bit [31:0] paddr; //load/store physical address bit [31:0] npc; //PC of next fetched insn bit [31:0] npc; //PC of next fetched insn}tm_cpu_ctrl_token_type; // TM -> FM typedef struct { bit valid; //timing token between FM and TM. bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit [5:0] tid; //thread ID bit run; //run bit bit run; //run bit}tm2cpu_token_type;
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.