Download presentation
Presentation is loading. Please wait.
1
© 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message-Passing Many-Core System in FPGAs ASPLOS Tutorial/Workshop March 2nd, 2008 Alex Krasnov, Andrew Schultz, John Wawrzynek, Greg Gibeling, and Pierre-Yves Droz
2
RAMP Blue 3 RAMP Tutorial and Workshop, ASPLOS 2008 Purpose of RAMP Blue Generation of interest in the community Demonstration of feasibility and competence Verification of hardware platform Implementation of first research experiments Approximately 2 man-years to meet these goals – Original TCL release: Andrew Schultz, Pierre-Yves Droz Alex Krasnov – New RDL release: Jue Sun, Greg Gibeling Alex Krasnov Approximately 2 days to build latest instance
3
RAMP Blue 4 RAMP Tutorial and Workshop, ASPLOS 2008 Version Highlights V1: 256 cores total – 8 BEE2 modules – 4 user FPGAs * 8 cores per FPGA – 100MHz Xilinx MicroBlaze soft cores running uClinux. – Dec 06: 256 cores running benchmark suite of UPC NAS Parallel Benchmarks V2: 768 cores total – 16 BEE2 modules, 12 cores per FPGA – Cores running at 90 MHz V3: 1008 cores total – 21 BEE2 modules, 12 cores per FPGA – Summer 2007 V4: 256 cores total – Written in RDL – Growing parameterization support – Public release Future versions – Use newer BEE3 FPGA platform Support for other processor cores.
4
RAMP Blue 5 RAMP Tutorial and Workshop, ASPLOS 2008 History of RAMP Blue Photos by Alex Krasnov, Dan Burke, Derek Chiou
5
RAMP Blue 6 RAMP Tutorial and Workshop, ASPLOS 2008 MicroBlaze v4 3-stage, RISC designed for FPGAs – Accounts for FPGA features & shortcomings fast carry chains lack of CAMs in cache – Short pipeline minimizes multiplexors in bypass logic Max clock rate of 100 MHz (~0.5 MIPS/MHz) on Virtex-II Pro Split I and D cache with configurable size, direct mapped – We use 2KB $I, 8KB $D) Optional single precision floating point unit Up to 8 independent fast simplex links (FSLs) with ISA support Configurable hardware debugging support (watch/breakpoints) – MDM (Microprocessor Debug Module) GCC tool chain support and ability to run uClinux
6
RAMP Blue 7 RAMP Tutorial and Workshop, ASPLOS 2008 Node Architecture
7
RAMP Blue 8 RAMP Tutorial and Workshop, ASPLOS 2008 Board Architecture
8
RAMP Blue 9 RAMP Tutorial and Workshop, ASPLOS 2008 Memory System DIMMs are shared, Memory is not – More MicroBlaze cores than DIMMs – No coherence, DIMMs are partitioned – Bank management isolates cores
9
RAMP Blue 10 RAMP Tutorial and Workshop, ASPLOS 2008 Double Precision FPU Shared FPU – Necessary due to size constraints – Similar to “reservation stations” Implemented with Xilinx CoreGen library FP components
10
RAMP Blue 11 RAMP Tutorial and Workshop, ASPLOS 2008 Network Characteristics Interface – S imple FSL channels – Currently programmed I/O, but could/should be DMA – Encapsulated Ethernet packets with source routes prepended Source routing between nodes – dimension-order style routing, non-adaptive permanent link failure intolerant – FPGA-FPGA and board-to-board link failure rare (1 error / 1M packets) – CRC checks for corruption, software (GASNet/UDP) uses acks/time-outs and retransmits for reliable transport Cut through flow control with virtual channels for deadlock avoidance Topology of interconnect – Full cross-bar on chip – Ring on module – Intermodule: all-to-all (v1-2), 3D mesh (v3-on)
11
RAMP Blue 12 RAMP Tutorial and Workshop, ASPLOS 2008 Network Implementation Switch (8b wide) provides connections between all on-chip cores, 2 fpga-fpga links, & 4 module- module links
12
RAMP Blue 13 RAMP Tutorial and Workshop, ASPLOS 2008 Physical Network Topology InterModule – 10Gb/s serial links permit a wide variety of topologies – High latency (10’s of cycles) – All-To-All Topology Used in RAMP Blue v1-v2 Each FPGA is at most 4 on-module links + 1 serial link connection away from any other + Minimizes dependence on use of serial links: - Scales only to 17 modules total – 3-D mesh Used in RAMP Blue v3-on InterFPGA – High speed parallel I/O – Organized as a ring – Low latency (2-3 cycles) InterCore – All-to-all within an FPGA – Only 12 cores BEE2 Module MGT Link
13
RAMP Blue 14 RAMP Tutorial and Workshop, ASPLOS 2008 16 Module All-to-All module 15 module 14 module 0 User FPGA 4User FPGA 3User FPGA 2User FPGA 1
14
RAMP Blue 15 RAMP Tutorial and Workshop, ASPLOS 2008 Early “Ideal” Floorplan 16 Xilinx MicroBlazes per XC2VP70 FPGA Each group of 4 clustered around a memory interface FP Integrated with each core
15
RAMP Blue 16 RAMP Tutorial and Workshop, ASPLOS 2008 Scalability and Performance 1-1008 MicroBlaze cores on 1-84 Virtex-II Pro chips – 1-12 MicroBlaze cores per chip – 1-4 Virtex-II Pro chips per board – 1-21 BEE2 boards per machine tested, upper limit depends on network topology 2-3 MIPS per core, 2-3 GIPS per machine with 672-1008 cores (Whetstone) – Consequence of software FP interface, can be much better – PIO network interface further reduces performance of actual applications, can be much better
16
RAMP Blue 17 RAMP Tutorial and Workshop, ASPLOS 2008 Area and Power Control FPGA, 8 cores: – 33,027 (49%) LUTs, 18,690 (28%) FFs, 21,057 (63%) slices User FPGA, 8 cores: – 44,753 (67%) LUTs, 28,048 (42%) FFs, 28,580 (86%) slices Control FPGA, 12 cores: – 43,566 (65%) LUTs, 23,480 (35%) FFs, 25,650 (77%) slices User FPGA, 12 cores: – 61,891 (93%) LUTs, 37,198 (56%) FFs, 32,991 (99%) slices 25-38 A at 5 V ↔ 125-188 W per board, 2.6-3.9 kW per machine with 21 boards
17
RAMP Blue 18 RAMP Tutorial and Workshop, ASPLOS 2008 Floor Planning Strategies Want just enough control over placement of blocks – Too little planning results in placement instability – Too much planning results in suboptimal performance and area Hybrid approach with various constraints – LOC constraints for I/O and timing-critical blocks: DIMMs, XAUIs – RLOC constraints for structural blocks: MicroBlaze cores – AREA_GROUP constraints for top-level blocks: FPU, network switch Routing constraints would be nice – Eliminate timing instability in critical blocks: DDR2 still has some uncertainty – 2 exact routes currently used in DDR2 infrastructure
18
RAMP Blue 19 RAMP Tutorial and Workshop, ASPLOS 2008 Floor Planning Results Control FPGAUser FPGA 8 cores 12 cores
19
RAMP Blue 20 RAMP Tutorial and Workshop, ASPLOS 2008 Build Statistics From our EDK XST cache: Core uniqueness based on instance name Does not include ISE-based RDL builds At least 700 synthesis cycles performed! NewCachedTotal Cores synthesized13233107037120270 MicroBlaze cores synthesized87140644935 Single MicroBlaze cores synthesized141559700
20
RAMP Blue 21 RAMP Tutorial and Workshop, ASPLOS 2008 Software Development: – Early: Xilinx FPGA tools (EDK, ISE) – Final: RDL (RAMP Description Language) Allows parameterization First step in making RAMP Blue into a emulator System: – Each node boots its own copy of uClinux – Each node mounts an NFS file system – Unified Parallel C (UPC) Shared memory abstraction over messages framework – FPU code generated by custom GCC SoftFPU backend
21
RAMP Blue 22 RAMP Tutorial and Workshop, ASPLOS 2008 Applications Application: – Runs UPC (Unified Parallel C) version of a subset of NAS (NASA Advanced Scientific) Parallel Benchmarks (all class S, to date) CG Conjugate Gradient, IS Integer Sort 512 cores EP Embarrassingly Parallel, MG Multi-Grid 512, 1008 cores FT FFT <64 cores
22
RAMP Blue 23 RAMP Tutorial and Workshop, ASPLOS 2008 Software Debugging Instrumentation framework – iptables rules sample and duplicate packets – IP-in-IP tunnels forward packets up the control network – Works in conjunctions with modified EtherApe network analyzer On x86 server: – iptables -t mangle -A PREROUTING -i tunl0 -j ROUTE --oif dummy0 --gw 10.0.0.1 – iptables -t mangle -A INPUT -p ! ipencap -i eth1 -m random --average 5 -j ROUTE --oif dummy0 --gw 10.0.0.1 --tee – iptables -t mangle -A OUTPUT -p ! ipencap -o eth1 -m random --average 5 - j ROUTE --oif dummy0 --gw 10.0.0.1 --tee
23
RAMP Blue 24 RAMP Tutorial and Workshop, ASPLOS 2008 Software Screenshot
24
RAMP Blue 25 RAMP Tutorial and Workshop, ASPLOS 2008 RDL & Emulation The “RAMP Description Language” (RDL) – Hierarchical structural netlisting language – Describes message passing distributed event simulations – System level: contains no behavioral spec. Tradeoffs – Costs Use of the RAMP target model Area, time and power to implement this model – Benefits Abstraction of locality & timing of communications System debugging & power tools Determinism, sharing and research – Goal: trade costs for benefits as needed
25
RAMP Blue 26 RAMP Tutorial and Workshop, ASPLOS 2008 Implementation Issues Large Hardware/Software System with many bugs: – Reliable low-level physical SDRAM controller has been a major challenge – A few MicroBlaze bugs in both gateware and GCC tool-chain (race conditions, OS bugs, GCC backend bugs) – FPU: Compilation Problems & Bad Results – RAMP Blue pushed the use of BEE2 modules to new levels - previously most aggressive users were for Radio Astronomy memory errors exposed memory controller calibration loop errors (tracked down to PAR problems) DIMM socket mechanical integrity problems Long “recompile” times hindered debugging – FPGA place and route takes 3-30 hours
26
RAMP Blue 27 RAMP Tutorial and Workshop, ASPLOS 2008 Debugging Summary Hardware: electrical/ mechanical Multimeter, oscilloscope Test suite Multi-board trials to determine statistical significance Gateware: general ChipScope, hardware events/harnesses Implement error tolerance: CRC, ECC, drop, replicate Gateware: processor ChipScope, software events/harnesses Trace buffers triggering on simple events like stalls Software: kernel printk if possible, binary disassembly Trap and use trace buffers Limited tool support Software: user printf, simple test programs Network analysis tools: tcpdump, EtherApe, low-level tools Instrumentation framework
27
RAMP Blue 28 RAMP Tutorial and Workshop, ASPLOS 2008 Future Work / Opportunities Processor/network interface currently very inefficient – DMA support should replace programmed I/O approach Many of the features for a RAMP (emulator) currently missing – Time dilation Ex: change relative speed of network, processor, memory) – Extensive HW supported monitoring – Virtual memory, other CPU/ISA models – Other network topologies – RDL implementation is a start Collaboration – Good starting point for processor+HW-accelerator architectures – At least one other group at Berkeley is already using it – Released version available now at: http://ramp.eecs.berkeley.edu
28
RAMP Blue 29 RAMP Tutorial and Workshop, ASPLOS 2008 Release Contents Full build and demonstration framework: – TCL-based control FPGA design – RDL-based user FPGA design – PowerPC GCC compiler – MicroBlaze GCC compiler – x86 kernel – PowerPC kernel and userland – MicroBlaze kernel and userland – UPC runtime – EtherApe network analyzer Many dependencies everywhere Released as complete VM image
29
RAMP Blue 30 RAMP Tutorial and Workshop, ASPLOS 2008 Hardware Requirements 1 BEE2 board Management server: – 1 x86 CPU – 1-2 GB RAM – 10-20 GB HDD – 2 NICs – 1 RS232 port (serial cable) – 1 LPT port (Parallel Cable IV) – 1 USB port (CF card reader)
30
RAMP Blue 31 RAMP Tutorial and Workshop, ASPLOS 2008 Software Requirements ISE 9.2.04 and EDK 9.2.02 – Backward compatibility not provided – Forward compatibility possible We support only Debian GNU/Linux! – Do not ask about Windows or Solaris – Do not ask about other distributions: we do not know – Xilinx supports only Red Hat… Kernel and ABI almost the same Works everywhere in our experience
31
RAMP Blue 32 RAMP Tutorial and Workshop, ASPLOS 2008 Features and Support Many parameters currently fixed – Number of MicroBlaze cores – Number of DIMMs and XAUIs – Not a limitation of RDL, specific to current setup – Likely to be improved in the future – Ask Greg Gibeling about RDL specifics Basic build and usage documentation – Detailed support by E-mail Documentation provided for single-board system – But really a multi-board system with configurable topology – Further support on individual basis
32
RAMP Blue 33 RAMP Tutorial and Workshop, ASPLOS 2008 Conclusions A First Step Towards – Developing a robust RAMP infrastructure for more complicated parallel systems Required debugging/insight capabilities Driver & source for general RAMP infrastructure – Reusable Gateware Much of the RAMP Blue gateware is directly applicable to future systems – Fixing bugs and reliability issues Exposed & corrected bugs in BEE2 platform and gateware Help in design of future RAMP hardware and gateware RAMP Blue represents the largest soft-core, FPGA based computing system ever built!
33
RAMP Blue 34 RAMP Tutorial and Workshop, ASPLOS 2008 Questions…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.