1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007
RAMP: An infrastructure to build simulators using FPGAs
3 Host Platform CPU Interconnect Network DRAM Target Model Hard Work Run Target Model on Host Platform
4 Reduce, Reuse, Recycle Reduce effort to build target models Users just build components, infrastructure handles connections (The RDL Compiler) Reuse components by having good abstractions Across different target models Across different host platforms XUP, Calinx, BEE2, BEE3, also Altera (see Greg) Recycle existing IP for use as simulation models Commercial processor RTL is its own model
5 RAMP Target Models Units Relatively large chunks of functionality e.g., processor + L1 cache User-written in some HDL or software Channels Point-point, undirectional, two kinds: FIFO channel: Flow-controlled interface Pipeline channel: Simple shift register, bits drop off end Generated by RAMP infrastructure Unit C Unit B Unit A FIFO Channel Pipeline Channel
6 Target FIFO Channel Parameters Need buffering of at least (Forward+Reverse) latency to get full bandwidth over link RAMP infrastructure instantiates channel with desired parameters D Forward Latency Buffering D Reverse Latency Datawidth RDY ENQ RDY DEQ
7 Target Pipeline Channel Parameters Only recommended for expert use in target models (Should use FIFO channels and latency-insensitive protocols in target design) D Forward Latency Datawidth D
8 RAMP Description Language (RDL) Unit C Unit B Unit A User describes target model topology, channel parameters, and (manual) mapping to host platform FPGAs using RDL RDL Compiler (RDLC) generates configurations Unit C Uni t B Uni t A FPGA1 FPGA2 RDLC Generated Unit Wrappers Generated links carry channels Target: Host: [ Greg Gibeling, UCB ]
9 Virtual Target Clock
10 Virtualized RTL Improves FPGA Resource Usage RAMP allows units to run at varying target-host clock ratios to optimize area and overall performance Example 1: Multiported register file Example, Sun Niagara has 3 read ports and 2 write ports to 6KB of register storage If RTL mapped directly, requires 48K flip-flops Slow cycle time, large area If mapping into block RAMs (one read+one write per cycle), takes 3 host cycles and 3x2KB block RAMs Faster cycle time (~3X) and far less resources Example 2: Large L2/L3 caches Current FPGAs only have ~1MB of on-chip SRAM Use on-chip SRAM to build cache of active piece of L2/L3 cache, stall target cycle if access misses and fetch data from off-chip DRAM
11 Start/Done Timing Interface Wrapper generated by RDL asserts “Start” on the physical FPGA cycle when the inputs to the unit are ready for the next target cycle Unit asserts “Done” when it finishes the target cycle and its outputs are ready Unit can take variable amount of time Unvirtualized RTL unit can connect “Done” to “Start” (but must not clock until “Start”) Unit Start Done Wrapper Out In1 In2
12 Distributed Timing Models
13 Distributed Timing Example Unit A Unit B Latency L D Target:RDYsRDY Host: Unit A Unit B DD Start Done Start Done DEQs ENQDEQ Pipeline target channel implemented as distributed FIFO with at least L buffers
14 Latency L D Target: D D Credits RDY ENQ D RDY DEQ Credit control Timing Target FIFO Channel Can build timed credit-based flow control (CBFC) FIFO inside Target model, using pipeline channels for communicating data forwards and credits backwards But this puts two CBFCs in series (one in target unit, one hidden in host implementation of pipeline channels) RDL can generate a unified FIFO that merges both of these behind the FIFO interface
15 Other Automatically Generated Networks Control network has workstation as master and every unit as slave device Memory-mapped interface with block transfers Used for initialization, stats gathering, debugging, and monitoring Units can connect to DRAM resources outside of timed target channels Used to support emulation and virtualization state Units can communicate with each other outside of timed target channels Support arbitrary communication. E.g., for distributed stats gathering
16 Wide Variety of RAMP Simulators
17 Simulator Design Choices Structural Analog versus Highly Virtualized Functional-only versus Functional+Timing Timing via (virtual) RTL design versus separate functional and timing models Hybrid software/hardware simulators We’re trying to build layers of abstractions that are useful to all types of simulator Also, trying to make modules in different styles inter- operate
18 Effective Abstractions Hide Details
19 …But Provide Inter-Operability
20 Work in Progress: Stay Tuned