ASPLOS ’08 Ramp Tutorial BEE3 Update Chuck Thacker John Davis Microsoft Research 2 March 2008
ASPLOS ’08 Ramp Tutorial Outline BEE3 Overview BEE3 Status BEE3 Gateware Moving forward
ASPLOS ’08 Ramp Tutorial BEE3 System
ASPLOS ’08 Ramp Tutorial BEE3 Package
ASPLOS ’08 Ramp Tutorial BEE3 Tidbits Design uses essentially every pin on the chip. Design was done to be “PC-like” to leverage PC economies: –PWB is about half the area of BEE2. –PWB is 18 layers rather than 22 for BEE2. –Uses PC power and peripherals. System is divided into main board plus a separate (and separately designed) Control Board. –Allow designs to proceed in parallel at Celestica and BWRC, and reduced the risk of having to spin the (expensive) main board. –Control board has JTAG, and Flash for bitstreams and boot flash for each FPGA. Can operate without it. The use of pros for PCB and mechanical design was an enormous win. –Celestica’s design was 100% correct, and five systems worked with only one problem (which was easily corrected). –Took (probably) half the time, to produce something much more manufacturable and robust (and therefore cheaper).
ASPLOS ’08 Ramp Tutorial BEE3 Subsystems
ASPLOS ’08 Ramp Tutorial BEE3 Control Board
ASPLOS ’08 Ramp Tutorial Project Participants and Roles Microsoft Research (Silicon Valley) –Funds, manages system engineering, does some gateware Celestica (Ottawa and Shanghai) –Did main board engineering, prototype fabrication –Microsoft has a very deep relationship with Celestica BEECube –Builds and delivers functioning systems Function Engineering (Palo Alto) –Did thermal and mechanical engineering Xilinx (San Jose) –Provides FPGAs for academic machines –Provides FPGA application expertise Ramp Group (BWRC) –Control board, basic software Ramp Community –Uses the systems for research –Expanding to industrial users (e.g., us)
ASPLOS ’08 Ramp Tutorial BEE3 Status All subsystems work! Board spin is required to correct MGT placement. –10 Gbit channels require long routing. Due to lack of information from Xilinx, not Celestica’s error. –Respin is in progress. ETA for final board is 1 May.
ASPLOS ’08 Ramp Tutorial BEE3 Gateware Today, consists primarily of test and characterization routines. –Much of this was ported from BEE2, although some is new: –DDR2 Controller –Control RISC MS designs use a minimal subset of the Xilinx tool suite: –Just ISE, ChipScope, and (soon) Data2MEM. –May need EDK, but not yet.
ASPLOS ’08 Ramp Tutorial DDR2 Controller Largest piece of new Gateware. – 5 Modules, ~2000 lines of Verilog. Supports 2 4GB DIMMS/channel, 2 channels per FPGA. Transfers are DDR 400 (5ns clock) with -2. Supports only x4 registered DIMMs –Unbuffered DIMMs can’t work because of address/control loading. Handles all initialization, refresh, and calibration (semi) automatically. –Keeps track of up to 16 open banks/controller. Calibration is fast (768 clocks). – So can be done at frequent intervals or in response to single errors. Primary user commands are Read and Write: –Both deal with 36-byte blocks. Simple FIFO interfaces. Each channel is about 3% of the LX110T LUTs (no BRAMs).
ASPLOS ’08 Ramp Tutorial DDR Controller Organization Centralized main controller –Main control FSM –Address Fifo (64 30-bit command/addresses) –Open bank CAMs. –Clock generation, timing limit enforcement. Six replicated I/O pin bank logic: –Read and Write Fifos for 24 data bits (3 4-bit lanes, with one RAM chip/lane on each DIMM). –Calibration state machine, so that all 6 banks can calibrate in parallel.
ASPLOS ’08 Ramp Tutorial DDR Controller (simplified)
ASPLOS ’08 Ramp Tutorial Control RISC (TC4) 36 bits (memories are 36n bits wide) Harvard architecture –1K instruction memory (1 BRAM) –1K data memory (1 BRAM) –256 register 3-port register file (2 BRAMs) Very small (~100 slices) “Tiny Computer” –All instructions execute in three 5ns phases. No pipelining. Assembler, no C compiler. Sigh… So far, DRAM initialization, DRAM calibration, Control shell with UART interface.
ASPLOS ’08 Ramp Tutorial TC4
ASPLOS ’08 Ramp Tutorial Next Steps Use Data2Mem to speed up TC4 edit, assemble, load cycle time: –Currently takes 30 minutes, since we regenerate cores and rebuild entire design. –Should be a couple of minutes. Add DDR2 test system (LFSRs) to do full-speed testing with random addresses and data. Should be rock solid. Use Xilinx PlanAhead to lock the design so that it can be used as a component in larger designs. Develop an on-chip interconnect to allow multiple DDR2 requesters without needing huge cross-chip busses. Use BEE3 in our own research programs –A couple have already started. –This is the fun part. Building it was just work.
ASPLOS ’08 Ramp Tutorial Questions?