RAMP Retreat Summer 2006 Break Session Leaders & Questions Greg Gibeling, Derek Chiou, James Hoe, John Wawrzynek & Christos Kozyrakis 6/21/2006
Breakout Topics RDL & Design Infrastructure RAMP White Caches, Network & IO (Uncore) RAMP2 Hardware BEE3 OS, VM and Compiler Software Stack
RDL & Design Infrastructure Leader/Reporter: Greg Gibeling Topics Features & Schedule Proposals Multi-platform migration Languages Which languages, priorities Assignments for support Debugging – Models & Requirements Retargeting to ASICs (Platform Optimization)
RDL & DI Notes (1) Languages Hardware Verilog BlueSpec IBM uses VHDL Software? Multi-Platform Integration of hardware simulations Control of multiplexing Needed for efficiency! Possible through channel & link parameters Features Meta-types Component (and unit) libraries
RDL & DI Notes (2) Debugging Split target model RDL Target Design Exposed to a second level of RDL Allows statistics aggregation Modeling of noisy channels Integration with unit internals Event & State Extraction Connection to processor debugging tools People clearly want this ASAP
RDL & DI Notes (3) Debugging (Integrated) Message tracing Causality Diagrams Framework to debug through units Checkpoints Injection Single stepping May not be widely used But cheap to implement Watch/Breakpoints
RDL & DI Notes (4) Why Java? Runs on various platforms Recompilation is generally pretty painful Decent type system in Java 1.5 Perfect for plugin infrastructure (e.g. OSGi) When to use RDL Detailed timing model Great at abstracting inter-chip comm Perfect platform for partitioning designs Concise, logical specification Support for the debugging framework With standard interfaces, good for sharing
RDL & DI Notes (5) Basic Infrastructure First system bringup Interfaces with workstations Initial board support Standard interfaces (RDL and otherwise) Processor Replacements Board Support Currently a heroic effort Solutions Standardized components? Generators?
RDL & DI Notes (6) Timelines Greg’s Goals 10/2006 should see RCF/RDLC3 11/2006 should see documentation Debugging (Integrated) should be ASAP Manpower Board support First board bring up RDL & RDLC users Standard interfaces Features & Documentation
RAMP White Leader/Reporter: Derek Chiou Topics Two day break-out First day should be pro/con Overall Preliminary Plan Evaluation Who is doing exactly what? ISA for RAMP White OpenSPARC 32bit Leon PowerPC 405 Processor agnosticism Implementation Reimplementation will be required Test suites from companies are very useful
RAMP White Notes (1) Use embedded PowerPC core first Available Debugged Can run full OS today FPGA chip space is already committed PowerPC and Sparc are both candidates PowerPC pros Embedded processor is PowerPC Sparc pros 64b available today Wait and see on soft-core for RAMP-White from Derek go here
RAMP White Notes (2) >= 256 processors Can buy 64 processors today Reasonable speed 10’s of MHz With 280K LUTs in Virtex 5, assume 50% for processor but 80% for ease of place- and-route 100K LUTs for processors Need 4 per FPGA (16 per board, 16 boards) 25K LUTs per processor
RAMP White Notes (3) Embedded PowerPC core (it’s there and better performance than any soft-core) Soft L1 data cache (no L2) Hard L1 instruction cache Emulation???? Ring coherence (a la IBM) Linux on top of embedded PowerPC core NSF mount for disk access Mark’s port of Peh’s and Dally’s router To do: Ring coherence + L1 data cache + memory interface RDL for modules Software port Timing models for memory, ring, cache, processor? integration
RAMP White Notes (4) RAMP-White Greek Beta More general fabric using same router Still use ring coherence Gamma James Hoe’s coherence engine Delta Soft core integration
Caches, Networks & IO (Uncore) Leader/Reporter: James Hoe Topics CPU, Cache and Memories Hybrid FPGA Cosimulation Network Storage Interfaces Especially with respect to interfaces Components, not sub-frameworks Phase uncore abilities
Uncore Notes (1) A fully-system has more than just CPUs and memory I/O is very important Getting RAMP to “work” Just like the real thing (from SW and OS’s perspective) Software porting/development Performance studies Someone has to build the “uncore”? Co-simulation Direct HW support for paravirtualization / VM
Uncore Notes (2) Why make RAMP white generic? What is a more interesting target system? What is a more relevant target system? Building a system without an application in mind? Would anyone care about RAMP- “vanilla”?
Uncore Notes (3) Why insist on directory-based CC for 1000 nodes Today’s large SMPs (at 100+ ways) are actually snoopy-based Plug in 8-core CMPs, that is a 1000-node snoopy system (that the industry may be more interested it in)
Uncore Notes (4) Let’s ping down a reference system architecture (including the uncore) minimum modules required? optional modules supported? fix standard interfaces between modules RDL script for RAMP white?? Need more than a block diagram for RAMP white
Uncore Notes (5) Requests and Ideas for RDL Compensate for skewed raw performance of components (for timing measurements) Large I/O bandwidth relative to CPU throughput Need knobs to dial-in different rates for experiments Some form of HW/SW co-simulation Built-in performance monitoring
Uncore Notes (6) Sanity Check 1000 processing nodes: no problem I/O: we can fake it somehow DRAM for 1000 processing node Not easy to cheat on this one
RAMP2 Hardware (BEE3) Leader/Reporter: Dan Burke & John Wawrzynek Topics Follow up to XUP Should RAMP embrace XUP at low end? Inexpensive small systems Size & scaling of new platform More than 40 FPGAs? Technical Questions Reconsider use of SRAM DRAM Capacity Presence of on-board hard CPUs On-board interfaces (PCI-Express) Project Questions Timelines Definitely need one Packaging Pricing (Especially FPGAs) Design for largest FPGA, change part at solder time? Evaluation of Chen Chang’s Design
RAMP2 HW Notes (1) Follow-up to XUP XUP has been useful to the project, particularly for early development efforts. Xilinx will continue to design and support new XUP boards No v4 version planned. V5 version will be out Q2 next year. For BEE3 can't really count on V5 FX in 2Q next year. Perhaps use a separate (AMCC) powerPC processor chip.
RAMP2 HW Notes (2) Size and Scaling of new platform: Given potential processor core density issue, will need to plan on a system that can scale past 40 FPGAs. Better compatibility with new XUP is important: ex: DRAM standard (better sharing of memory controllers) USB use Cypress CY7300 for USB compatibility with Xilinx core. Our design and production of BEE3 is timed to the production of V5 parts. We need to better understand RAMP team schedule for RAMP white. Hope to be able to choose the package and have flexibility in part sizes and ideally part feature set. How about a daughterboard for FPGA (DRC approach)?
RAMP2 HW Notes (3) Technical Questions Reconsider use of SRAM: group thought SRAM is a bad idea. It is faster, smaller, simpler to interface to. Newer parts will make interfacing simpler. Faster not a big concern for RAMP. Smaller is a big concern. 8GB DDR2 DIMM modules on the horizon. A target will be 1 GByte/processor. Presence of on-board hard CPUs Are hard cores in FPGAs useful (e.g. PPC405 in V2Pro) Would commodity chips on PCB be useful (eg for management)
RAMP2 HW Notes (4) Enclosures: Using a standard form-factor will help in the with module packaging. Need to look carefully at IBM blade center (adopted by IBM and Intel) ATCA is gaining momentum. Power may be a problem Can we accomodate custom ASIC integration (perhaps through a slight generalization of the DRAM interface). What does Google do for packaging in their data centers? Is it racks of 1U modules?
RAMP2 HW Notes (5) Interesting Idea from Chuck Thacker: "Design new board based on need of RAMP White"! Previously suggested by others Can we estimate the logic capacity, memory BW, network BW, etc.?
OS, VM & Compiler Leader/Reporter: Christos Kozyrakis Topics Debugging HW and SW (RDL) Phased approach Proxy, full kernel, VMMs, Hypervisor HW/SW schedule and dependencies High level applications
Software Notes (1) RAMP milestones Pick ISA Deploy basic VMM Deploy OS
Software Notes (2) VMM approach: use split VMM system (ala VMware/Xen) Run full VMM on x86 host that allows access to devices Run simple VMM on RAMP that communicates with host for devices accesses through some network A timing model may be used if I/O performed is important Should talk with Sun & IBM about their VMM systems for Sparc and PowerPC. May be able to port a very basic Xen system on our own Questions Accurate I/O timing with para-virtualization (you also need repeatability) SW/system-level/IO issues for large scale machine may be more important than coherence Related Issue: Do we want global cache coherence in white? Benefit vs complexity (schedule etc)
Software Notes (3) Separate infrastructure from RAMP Example: RDL should not be tied to RAMP White Note: This is in progress with some current RDL applications Same with BEE3 design work Most of our tools are applicable to others
Software Notes (4) Debugging support: RDL-scope Arbitrary conditions on RDL-level events to trigger debugging Get traces of messages Track lineage of messages Traceability, accountability, relate events to program constructs Infinite checkpoints for instructions & data Checkpoint support Swappable & observable designs Single step Instruction, RDL, or cycle level Note: not always a commonly use feature Such features may attract people to RDL more than retiming Note: This is already the case with current RDL applications
Software Notes (5) What our is schedule What can we have up and running with 1 year? Does it have to be RAMP white? Do we need to migrate RDL maintenance from Greg? Note: The work should be spread out at least. Do we have enough manpower for this SW work? Compiler, VMMs, Applications, etc…
Software Notes (6) Application Domains Enterprise/desktop Full featured OS on all nodes Running a JVM is a big plus here Should be able to run webservers, middleware and DBs. Embedded While eventually an app may directly control a number of nodes, it is easier to start with all nodes running the OS. The base design should allow all nodes to run the OS. Easiest starting point for SW. Various researchers may decide to run the OS in a subset of nodes, managing the rest of them directly A simple runtime with app-specific policies Common in embedded systems
Software Notes (7) A simple kernel for embedded systems should support Fast remapping of computation Protection across processes Emulation of attached disk ISCSI + a timing model for disks RAMP VMM uses: Attract VMM researchers (might require x86) Our own convenience Get an OS running, access to devices etc We may achieve (b) without (a) Some researchers will want to turn cache coherence off anyway!