© 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message-Passing Many-Core System in FPGAs ASPLOS Tutorial/Workshop March 2nd,

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
1 RAMP Implementation J. Wawrzynek. 2 RDL supports multiple platforms:  XUP, pure software, BEE2 BEE2 will be the standard RAMP platform for the next.
© ABB Group Jun-15 Evaluation of Real-Time Operating Systems for Xilinx MicroBlaze CPU Anders Rönnholm.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
1 Performed By: Khaskin Luba Einhorn Raziel Einhorn Raziel Instructor: Rivkin Ina Spring 2004 Spring 2004 Virtex II-Pro Dynamical Test Application Part.
© 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message- Passing Many-Core System in FPGAs ISCA Tutorial/Workshop June 10th,
BEEKeeper Remote Management and Debugging of Large FPGA Clusters Terry Filiba Navtej Sadhal.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
Configurable System-on-Chip: Xilinx EDK
1 Breakout thoughts (compiled with N. Carter): Where will RAMP be in 3-5 Years (What is RAMP, where is it going?) Is it still RAMP if it is mapping onto.
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.
© 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov.
© 2006 Regents University of California. All Rights Reserved RAMP Blue Status Andrew Schultz, John Wawrzynek June 21, 2006 RAMP MIT Summer Workshop.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
ASPLOS ’08 Ramp Tutorial BEE3 Update Chuck Thacker John Davis Microsoft Research 2 March 2008.
1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
Switch EECS 252 – Spring 2006 RAMP Blue Project Jue Sun and Gary Voronel Electrical Engineering and Computer Sciences University of California, Berkeley.
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל Technion - Israel institute of technology department of Electrical Engineering Virtex II-PRO Dynamical.
CSE430/830 Course Project Tutorial Instructor: Dr. Hong Jiang TA: Dongyuan Zhan Project Duration: 01/26/11 – 04/29/11.
® ChipScope ILA TM Xilinx and Agilent Technologies.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
1 Chapter 2. The System-on-a-Chip Design Process Canonical SoC Design System design flow The Specification Problem System design.
Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan
ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
© 2003 Xilinx, Inc. All Rights Reserved For Academic Use Only Xilinx Design Flow FPGA Design Flow Workshop.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
J. Christiansen, CERN - EP/MIC
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
NIOS II Ethernet Communication Final Presentation
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
The Macro Design Process The Issues 1. Overview of IP Design 2. Key Features 3. Planning and Specification 4. Macro Design and Verification 5. Soft Macro.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.
1 EDK 7.1 Tutorial -- SystemACE and EthernetMAC on Avnet Virtex II pro Development Boards Chia-Tien Dan Lo Department of Computer Science University of.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
MODUS Project FP7- SME – , Eclipse Conference Toulouse, May 6 th 2013 Page 1 MODUS Project FP Methodology and Supporting Toolset Advancing.
Introductory project. Development systems Design Entry –Foundation ISE –Third party tools Mentor Graphics: FPGA Advantage Celoxica: DK Design Suite Design.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)
This material exempt per Department of Commerce license exception TSU Xilinx On-Chip Debug.
An Investigation of Xen and PTLsim for Exploring Latency Constraints of Co-Processing Units Grant Jenks UCLA.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
Somervill RSC 1 125/MAPLD'05 Reconfigurable Processing Module (RPM) Kevin Somervill 1 Dr. Robert Hodson 1
Content Project Goals. Workflow Background. System configuration. Working environment. System simulation. System synthesis. Benchmark. Multicore.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Embedded Systems Design with Qsys and Altera Monitor Program
Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.
Andrew Putnam University of Washington RAMP Retreat January 17, 2008
Derek Chiou The University of Texas at Austin
Combining Simulators and FPGAs “An Out-of-Body Experience”
NetFPGA - an open network development platform
Cluster Computers.
Presentation transcript:

© 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message-Passing Many-Core System in FPGAs ASPLOS Tutorial/Workshop March 2nd, 2008 Alex Krasnov, Andrew Schultz, John Wawrzynek, Greg Gibeling, and Pierre-Yves Droz

RAMP Blue 3 RAMP Tutorial and Workshop, ASPLOS 2008 Purpose of RAMP Blue Generation of interest in the community Demonstration of feasibility and competence Verification of hardware platform Implementation of first research experiments Approximately 2 man-years to meet these goals – Original TCL release: Andrew Schultz, Pierre-Yves Droz Alex Krasnov – New RDL release: Jue Sun, Greg Gibeling Alex Krasnov Approximately 2 days to build latest instance

RAMP Blue 4 RAMP Tutorial and Workshop, ASPLOS 2008 Version Highlights V1: 256 cores total – 8 BEE2 modules – 4 user FPGAs * 8 cores per FPGA – 100MHz Xilinx MicroBlaze soft cores running uClinux. – Dec 06: 256 cores running benchmark suite of UPC NAS Parallel Benchmarks V2: 768 cores total – 16 BEE2 modules, 12 cores per FPGA – Cores running at 90 MHz V3: 1008 cores total – 21 BEE2 modules, 12 cores per FPGA – Summer 2007 V4: 256 cores total – Written in RDL – Growing parameterization support – Public release Future versions – Use newer BEE3 FPGA platform Support for other processor cores.

RAMP Blue 5 RAMP Tutorial and Workshop, ASPLOS 2008 History of RAMP Blue Photos by Alex Krasnov, Dan Burke, Derek Chiou

RAMP Blue 6 RAMP Tutorial and Workshop, ASPLOS 2008 MicroBlaze v4 3-stage, RISC designed for FPGAs – Accounts for FPGA features & shortcomings fast carry chains lack of CAMs in cache – Short pipeline minimizes multiplexors in bypass logic Max clock rate of 100 MHz (~0.5 MIPS/MHz) on Virtex-II Pro Split I and D cache with configurable size, direct mapped – We use 2KB $I, 8KB $D) Optional single precision floating point unit Up to 8 independent fast simplex links (FSLs) with ISA support Configurable hardware debugging support (watch/breakpoints) – MDM (Microprocessor Debug Module) GCC tool chain support and ability to run uClinux

RAMP Blue 7 RAMP Tutorial and Workshop, ASPLOS 2008 Node Architecture

RAMP Blue 8 RAMP Tutorial and Workshop, ASPLOS 2008 Board Architecture

RAMP Blue 9 RAMP Tutorial and Workshop, ASPLOS 2008 Memory System DIMMs are shared, Memory is not – More MicroBlaze cores than DIMMs – No coherence, DIMMs are partitioned – Bank management isolates cores

RAMP Blue 10 RAMP Tutorial and Workshop, ASPLOS 2008 Double Precision FPU Shared FPU – Necessary due to size constraints – Similar to “reservation stations” Implemented with Xilinx CoreGen library FP components

RAMP Blue 11 RAMP Tutorial and Workshop, ASPLOS 2008 Network Characteristics Interface – S imple FSL channels – Currently programmed I/O, but could/should be DMA – Encapsulated Ethernet packets with source routes prepended Source routing between nodes – dimension-order style routing, non-adaptive  permanent link failure intolerant – FPGA-FPGA and board-to-board link failure rare (1 error / 1M packets) – CRC checks for corruption, software (GASNet/UDP) uses acks/time-outs and retransmits for reliable transport Cut through flow control with virtual channels for deadlock avoidance Topology of interconnect – Full cross-bar on chip – Ring on module – Intermodule: all-to-all (v1-2), 3D mesh (v3-on)

RAMP Blue 12 RAMP Tutorial and Workshop, ASPLOS 2008 Network Implementation Switch (8b wide) provides connections between all on-chip cores, 2 fpga-fpga links, & 4 module- module links

RAMP Blue 13 RAMP Tutorial and Workshop, ASPLOS 2008 Physical Network Topology InterModule – 10Gb/s serial links permit a wide variety of topologies – High latency (10’s of cycles) – All-To-All Topology Used in RAMP Blue v1-v2 Each FPGA is at most 4 on-module links + 1 serial link connection away from any other + Minimizes dependence on use of serial links: - Scales only to 17 modules total – 3-D mesh Used in RAMP Blue v3-on InterFPGA – High speed parallel I/O – Organized as a ring – Low latency (2-3 cycles) InterCore – All-to-all within an FPGA – Only 12 cores BEE2 Module MGT Link

RAMP Blue 14 RAMP Tutorial and Workshop, ASPLOS Module All-to-All module 15 module 14 module 0 User FPGA 4User FPGA 3User FPGA 2User FPGA 1

RAMP Blue 15 RAMP Tutorial and Workshop, ASPLOS 2008 Early “Ideal” Floorplan 16 Xilinx MicroBlazes per XC2VP70 FPGA Each group of 4 clustered around a memory interface FP Integrated with each core

RAMP Blue 16 RAMP Tutorial and Workshop, ASPLOS 2008 Scalability and Performance MicroBlaze cores on 1-84 Virtex-II Pro chips – 1-12 MicroBlaze cores per chip – 1-4 Virtex-II Pro chips per board – 1-21 BEE2 boards per machine tested, upper limit depends on network topology 2-3 MIPS per core, 2-3 GIPS per machine with cores (Whetstone) – Consequence of software FP interface, can be much better – PIO network interface further reduces performance of actual applications, can be much better

RAMP Blue 17 RAMP Tutorial and Workshop, ASPLOS 2008 Area and Power Control FPGA, 8 cores: – 33,027 (49%) LUTs, 18,690 (28%) FFs, 21,057 (63%) slices User FPGA, 8 cores: – 44,753 (67%) LUTs, 28,048 (42%) FFs, 28,580 (86%) slices Control FPGA, 12 cores: – 43,566 (65%) LUTs, 23,480 (35%) FFs, 25,650 (77%) slices User FPGA, 12 cores: – 61,891 (93%) LUTs, 37,198 (56%) FFs, 32,991 (99%) slices A at 5 V ↔ W per board, kW per machine with 21 boards

RAMP Blue 18 RAMP Tutorial and Workshop, ASPLOS 2008 Floor Planning Strategies Want just enough control over placement of blocks – Too little planning results in placement instability – Too much planning results in suboptimal performance and area Hybrid approach with various constraints – LOC constraints for I/O and timing-critical blocks: DIMMs, XAUIs – RLOC constraints for structural blocks: MicroBlaze cores – AREA_GROUP constraints for top-level blocks: FPU, network switch Routing constraints would be nice – Eliminate timing instability in critical blocks: DDR2 still has some uncertainty – 2 exact routes currently used in DDR2 infrastructure

RAMP Blue 19 RAMP Tutorial and Workshop, ASPLOS 2008 Floor Planning Results Control FPGAUser FPGA 8 cores 12 cores

RAMP Blue 20 RAMP Tutorial and Workshop, ASPLOS 2008 Build Statistics From our EDK XST cache: Core uniqueness based on instance name Does not include ISE-based RDL builds At least 700 synthesis cycles performed! NewCachedTotal Cores synthesized MicroBlaze cores synthesized Single MicroBlaze cores synthesized

RAMP Blue 21 RAMP Tutorial and Workshop, ASPLOS 2008 Software Development: – Early: Xilinx FPGA tools (EDK, ISE) – Final: RDL (RAMP Description Language) Allows parameterization First step in making RAMP Blue into a emulator System: – Each node boots its own copy of uClinux – Each node mounts an NFS file system – Unified Parallel C (UPC) Shared memory abstraction over messages framework – FPU code generated by custom GCC SoftFPU backend

RAMP Blue 22 RAMP Tutorial and Workshop, ASPLOS 2008 Applications Application: – Runs UPC (Unified Parallel C) version of a subset of NAS (NASA Advanced Scientific) Parallel Benchmarks (all class S, to date) CG Conjugate Gradient, IS Integer Sort 512 cores EP Embarrassingly Parallel, MG Multi-Grid 512, 1008 cores FT FFT <64 cores

RAMP Blue 23 RAMP Tutorial and Workshop, ASPLOS 2008 Software Debugging Instrumentation framework – iptables rules sample and duplicate packets – IP-in-IP tunnels forward packets up the control network – Works in conjunctions with modified EtherApe network analyzer On x86 server: – iptables -t mangle -A PREROUTING -i tunl0 -j ROUTE --oif dummy0 --gw – iptables -t mangle -A INPUT -p ! ipencap -i eth1 -m random --average 5 -j ROUTE --oif dummy0 --gw tee – iptables -t mangle -A OUTPUT -p ! ipencap -o eth1 -m random --average 5 - j ROUTE --oif dummy0 --gw tee

RAMP Blue 24 RAMP Tutorial and Workshop, ASPLOS 2008 Software Screenshot

RAMP Blue 25 RAMP Tutorial and Workshop, ASPLOS 2008 RDL & Emulation The “RAMP Description Language” (RDL) – Hierarchical structural netlisting language – Describes message passing distributed event simulations – System level: contains no behavioral spec. Tradeoffs – Costs Use of the RAMP target model Area, time and power to implement this model – Benefits Abstraction of locality & timing of communications System debugging & power tools Determinism, sharing and research – Goal: trade costs for benefits as needed

RAMP Blue 26 RAMP Tutorial and Workshop, ASPLOS 2008 Implementation Issues Large Hardware/Software System with many bugs: – Reliable low-level physical SDRAM controller has been a major challenge – A few MicroBlaze bugs in both gateware and GCC tool-chain (race conditions, OS bugs, GCC backend bugs) – FPU: Compilation Problems & Bad Results – RAMP Blue pushed the use of BEE2 modules to new levels - previously most aggressive users were for Radio Astronomy memory errors exposed memory controller calibration loop errors (tracked down to PAR problems) DIMM socket mechanical integrity problems Long “recompile” times hindered debugging – FPGA place and route takes 3-30 hours

RAMP Blue 27 RAMP Tutorial and Workshop, ASPLOS 2008 Debugging Summary Hardware: electrical/ mechanical Multimeter, oscilloscope Test suite Multi-board trials to determine statistical significance Gateware: general ChipScope, hardware events/harnesses Implement error tolerance: CRC, ECC, drop, replicate Gateware: processor ChipScope, software events/harnesses Trace buffers triggering on simple events like stalls Software: kernel printk if possible, binary disassembly Trap and use trace buffers Limited tool support Software: user printf, simple test programs Network analysis tools: tcpdump, EtherApe, low-level tools Instrumentation framework

RAMP Blue 28 RAMP Tutorial and Workshop, ASPLOS 2008 Future Work / Opportunities Processor/network interface currently very inefficient – DMA support should replace programmed I/O approach Many of the features for a RAMP (emulator) currently missing – Time dilation Ex: change relative speed of network, processor, memory) – Extensive HW supported monitoring – Virtual memory, other CPU/ISA models – Other network topologies – RDL implementation is a start Collaboration – Good starting point for processor+HW-accelerator architectures – At least one other group at Berkeley is already using it – Released version available now at:

RAMP Blue 29 RAMP Tutorial and Workshop, ASPLOS 2008 Release Contents Full build and demonstration framework: – TCL-based control FPGA design – RDL-based user FPGA design – PowerPC GCC compiler – MicroBlaze GCC compiler – x86 kernel – PowerPC kernel and userland – MicroBlaze kernel and userland – UPC runtime – EtherApe network analyzer Many dependencies everywhere Released as complete VM image

RAMP Blue 30 RAMP Tutorial and Workshop, ASPLOS 2008 Hardware Requirements 1 BEE2 board Management server: – 1 x86 CPU – 1-2 GB RAM – GB HDD – 2 NICs – 1 RS232 port (serial cable) – 1 LPT port (Parallel Cable IV) – 1 USB port (CF card reader)

RAMP Blue 31 RAMP Tutorial and Workshop, ASPLOS 2008 Software Requirements ISE and EDK – Backward compatibility not provided – Forward compatibility possible We support only Debian GNU/Linux! – Do not ask about Windows or Solaris – Do not ask about other distributions: we do not know – Xilinx supports only Red Hat… Kernel and ABI almost the same Works everywhere in our experience

RAMP Blue 32 RAMP Tutorial and Workshop, ASPLOS 2008 Features and Support Many parameters currently fixed – Number of MicroBlaze cores – Number of DIMMs and XAUIs – Not a limitation of RDL, specific to current setup – Likely to be improved in the future – Ask Greg Gibeling about RDL specifics Basic build and usage documentation – Detailed support by Documentation provided for single-board system – But really a multi-board system with configurable topology – Further support on individual basis

RAMP Blue 33 RAMP Tutorial and Workshop, ASPLOS 2008 Conclusions A First Step Towards – Developing a robust RAMP infrastructure for more complicated parallel systems Required debugging/insight capabilities Driver & source for general RAMP infrastructure – Reusable Gateware Much of the RAMP Blue gateware is directly applicable to future systems – Fixing bugs and reliability issues Exposed & corrected bugs in BEE2 platform and gateware Help in design of future RAMP hardware and gateware RAMP Blue represents the largest soft-core, FPGA based computing system ever built!

RAMP Blue 34 RAMP Tutorial and Workshop, ASPLOS 2008 Questions…