Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009.

Similar presentations


Presentation on theme: "1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009."— Presentation transcript:

1 1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009

2 2 Much confusion about RAMP Frequently asked questions: When will RAMP be finished/usable? What ISA does RAMP use? Can RAMP model my new feature “X”? How accurate is RAMP? Why so many different RAMP projects? Why is there not more sharing among projects?

3 3 Not much confusion about software simulators Rarely asked questions: When will software simulation be finished/usable? What ISA do software simulators use? Can a software simulator model my new feature “X”? How accurate is software simulation? Why so many software simulators? Why is there not more sharing among software simulators?

4 4 RAMP is a consortium, not a project Many projects with different goals  sometimes multiple per site So far, much sharing of ideas and techniques  Very healthy and active community Some sharing of low-level infrastructure  Boards + platform-level interfaces to DRAM, Ethernet, etc. Not a single complete infrastructure that everyone uses  and that’s been OK, and might continue to be OK

5 5 Host Platform CPU Interconnect Network DRAM Target Machine Hard Work Run Model of Target on Host Platform

6 6 RAMP Projects’ Goals Model some target machine trading off:  Fidelity  Model design effort  Emulation speed (and capacity)

7 7 Space of Target Machines Which ISA?  x86, SPARC, PowerPC, Alpha, ARM, MIPS? In-order or out-of-order cores? How many cores?  1, 16, 256, 1M? Processor+memory of general-purpose machine, or whole SoC including I/O devices? Accelerators, GPUs? Which operating system? Hypervisor?

8 8 ISA Wars Original pick to standardize around was SPARC  Open standard  Available verification suite  Simplest ISA with extensive general-purpose software support (i.e., desktop/server development environment available) SGI/MIPS sorely missed…  Leon implementation for FPGA  Simics But the intent was always to support multiple ISAs

9 9 ISA usage in RAMP models UCB RAMP Blue: Microblaze++  Xilinx soft core modified to add 64-bit FPU Stanford RAMP Red: PowerPC  Used Virtex-II Pro hard cores UT FAST: x86  Functional simulation in software on front-end machine (or on PowerPC hardcore) UT RAMP White: PowerPC -> SPARC  Initial version used hard PowerPC cores moving to Leon soft cores MIT/Intel HASIM: Alpha -> x86?  Initially Alpha ISA, eventually to form basis of x86/uOP machine CMU ProtoFLEX: SPARC  “SPARC three ways” (own core + emulation on hard PowerPC core + emulation on front-end machine) UCB RAMP Gold & Internet-in-a-Box: SPARC  Own core design UCB/LBNL Green Flash: Tensilica  RTL generated from Tensilica tools

10 10 Supporting new ISAs x86 still very desirable, but difficult  FAST software functional model is probably current best approach if want to play with different timings  Microcoded functional model would be good way to go if had resources (HASIM?)  Even with working functional model, timing model is difficult? Adding new features difficult? ARM also desirable for mobile device modeling  Renewed interest in engaging here MIT/IBM PowerPC work in progress, could form functional model But nobody does this for fun - only to advance their own research goals…

11 11 Commercial/Existing RTL Cores Originally seen as big benefit of RAMP But didn’t turn out that way in practice (except for prototyping usage model - see later) Cores don’t provide features we need, too big, too difficult to modify For simple ISAs (i.e. non-x86), biggest help is ISA verification suites, and/or *really* simple synthesizable ISA pipeline to form basis of functional model

12 12 Operating System Support Currently only ProtoFLEX, FAST, RAMP-White support OS  Others can run one application with proxy mechanism for I/O Reflects interests of groups. OS is not primary subject of research for groups building models so far.  RAMP Gold to add support for ParLab OS work (Tessellation)  Green Flash to add support for HPC-style microkernel

13 13 Target systems From a few, to millions of cores  Scaling simulation to 100s of cores was a shared goal  But smaller core counts (16-128) very interesting also  Huge core counts (>1E6) also of interest Single node versus clusters  RAMP Blue & Internet-in-a-box are message-passing clusters  Rest are shared-memory systems Memory hierarchy and cache coherence protocols  Wide variety of possibilities Desktop/Laptop/Server versus Handheld or SoC  What is important to model for given research topic? Accelerators/GPUs  Even wider variety than CPU ISAs/microarchitectures

14 14 Wide variety, how to reuse? Proposal: ISA functional models  also FPU across ISAs  Perhaps even common uOP engine across all ISAs? CPU Microarchitecture timing model  E.g., in-order superscalar, out-of-order with unified physical register file Memory functional model  Host-level caches + memory interleaving Memory hierarchy timing models  On-chip network types as subset I/O bus shims  To allow random RTL to be attached for I/O devices and non-GPU accelerators This won’t be easy, as have to agree on interfaces between these components, might need further specialization Definitely need more experience doing all of the above

15 15 Simulator Types Functional model only (no timing) RTL models (functional includes timing)  Also used for chip prototyping Split functional and timing models + Hybrids of above

16 16 Simulator Mapping Styles Gate-level emulator (Quickturn, Palladium)  ~1MHz Direct RTL emulator  5-20MHz FPGA-tuned RTL emulator  20-50MHz Virtualized RTL emulator  50-100MHz Host-multithreaded models  >100MHz

17 17

18 18 RAMP Blue Release 2/25/2008 - design available from RAMP website - ramp.eecs.berkeley.edu

19 19 Climate System Design Concept Strawman Design Study 10PF sustained ~120 m 2 <3MWatts < $75M 32 boards per rack 100 racks @ ~25KW power + comms 32 chip + memory clusters per board (2.7 TFLOPS @ 700W VLIW CPU: 128b load-store + 2 DP MUL/ADD + integer op/ DMA per cycle: Synthesizable at 650MHz in commodity 65nm 1mm 2 core, 1.8-2.8mm 2 with inst cache, data cache data RAM, DMA interface, 0.25mW/MHz Double precision SIMD FP : 4 ops/cycle (2.7GFLOPs) Vectorizing compiler, cycle-accurate simulator, debugger GUI (Existing part of Tensilica Tool Set) 8 channel DMA for streaming from on/off chip DRAM Nearest neighbor 2D communications grid Proc Array RAM 8 DRAM per processor chip: ~50 GB/s CPU 64-128K D 2x128b 32K I 8 chan DMA CPU D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA Opt. 8MB embedded DRAM External DRAM interface Master Processor Comm Link Control 32 processors per 65nm chip 83 GFLOPS @ 7W

20 20 Virtualized RTL Improves FPGA Resource Usage RAMP allows units to run at varying target-host clock ratios to optimize area and overall performance Example 1: Multiported register file  Example, Sun Niagara has 3 read ports and 2 write ports to 6KB of register storage  If RTL mapped directly, requires 48K flip-flops Slow cycle time, large area  If mapping into block RAMs (one read+one write per cycle), takes 3 host cycles and 3x2KB block RAMs Faster cycle time (~3X) and far less resources Example 2: Large L2/L3 caches  Current FPGAs only have ~1MB of on-chip SRAM  Use on-chip SRAM to build cache of active piece of L2/L3 cache, stall target cycle if access misses and fetch data from off-chip DRAM

21 21 Host Multithreading (Zhangxi Tan (UCB), Chung, (CMU)) CPU 1 CPU 2 CPU 3 CPU 4 Target Model Multithreading emulation engine reduces FPGA resource use and improves emulator throughput Hides emulation latencies (e.g., communicating across FPGAs) Multithreaded Host Emulation Engine (on FPGA) +1 2 PC 1 PC 1 PC 1 PC 1 I$ IR GPR1 X Y 2 D$ Single hardware pipeline with multiple copies of CPU state

22 22 Split Functional/Timing Models (HASIM Emer (MIT/Intel), FAST Chiou, (UT Austin)) Functional model executes CPU ISA correctly, no timing information  Only need to develop functional model once for each ISA Timing model captures pipeline timing details, does not need to execute code  Much easier to change timing model for architectural experimentation  Without RTL design, cannot be 100% certain that timing is accurate Many possible splits between timing and functional model Functional Model Timing Model

23 23 RAMP White Hari Angepat, Derek Chiou (UT Austin) RAMP-White23 Leon 3 MstSlvDbgInt Leon3 shim MP IntCntrl DSUEthDDR2 Leon 3 MstSlvDbgInt AHB bus Leon3 shim Intersectio n Unit NIU Intersectio n Unit NIU Route r Scalable Coherent Shared Memory Multiprocessor Support standard shared memory programming models DDR2 AHB bus AHB shim

24 24 Multithreaded Func. & Timing Models (RAMP Gold: UCB) MT-Unit multiplexes multiple target units on a single host engine MT-Channel multiplexes multiple target channels over a single host link Functional Model Pipeline Arch State Timing Model Pipeline Timing State MT-Unit MT-Channels

25 25 CMU Simics/RAMP Simulator 16-CPU Shared-memory UltraSPARC III Server (SunFire 3800) BEE2 Platform

26 26 What Hardware Platforms? RTL mapping approaches  Need large amounts of logic  Selected BEE2, and then designed BEE3 for this emulation style  Observed that don’t need much interconnect bandwidth (memory + inter-board links) because RTL cores are slow and latency sensitive Host-multithreading allows large systems to be mapped to small (one?) FPGA (e.g., 64-128 cores on ML505)  Logic gate count not as critical, need to focus on on-chip capacity, off- chip memory bandwidth and total memory capacity per FPGA (conventional processor memory hierarchy issues multiplied by multithreading factor)  One big FPGA with lots of fast memory channels would be ideal Software functional emulation (FAST) or transplant (ProtoFLEX)  Focus on fast coherent connection to front-end x86 CPU  Hypertransport, FSB, QPI interfaces better than PCI I/O connections

27 27 Summary Many reasons for great divergence in RAMP projects  Different ISAs, different target machines, different research topics, different emulation styles Sharing possible, but hard work and more experience needed Questions?


Download ppt "1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009."

Similar presentations


Ads by Google