Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow Electrical and Computer Engineering, University of Toronto SHARCNET Symposium on GPU and CELL Computing 2008
Many scientific applications can be accelerated by targeting parallel machines 2 This work demonstrates a method for combining high performance computer clusters with FPGAs for maximum computational power Coarse-grained parallelization allows applications to be distributed across hundreds or thousands of nodes FPGAs can accelerate many computing tasks by 2 or 3 orders of magnitude over a CPU
Interconnection Network MEM CPU … Interconnection Network CPU … MEM Interconnection Network ASP (GPU/ FPGA) ASP (GPU/ FPGA) ASP (GPU/ FPGA) ASP (GPU/ FPGA) ASP (GPU/ FPGA) ASP (GPU/ FPGA) ASP (GPU/ FPGA) ASP (GPU/ FPGA) … MEM Interconnection Network CPU FPGA GPU … MEM 3
FPGAs can speed up applications, however... High barrier of entry for designing digital hardware Developing monolithic FPGA designs is very daunting How does one easily take advantage of FPGAs for accelerating HPC applications? 5
Toronto Molecular Dynamics machine is an investigation into high performance computing based on a scalable network of FPGAs Applications are defined as a simple collection of computing tasks A task is roughly equivalent to a software process/thread Major focus is facilitating transition from cluster-based applications to TMD machine 6
7 Step 1: Application Prototyping Software prototype of application developed Profiling identifies compute-intensive routines Step 2: Application Refinement Partitioning into tasks communicating using MPI Communication patterns analyzed to determine network topology Step 3: TMD Prototyping Tasks are ported to soft-processors on TMD On-chip communication network verified Step 4: TMD Optimization Intensive tasks replaced with hardware engines MPE handles communication for hardware engines Hardware engines easily moved, replicated Application Prototype Process AProcess BProcess C MPI CPU Cluster FPGA Network AC TMD-MPI B B
Use essential subset of MPI standard Software library for tasks run on processors Hardware Message Passing Engine (MPE) for hardware-based tasks Tasks do not know (or care) whether remote tasks are run as software processes or hardware engines MPI isolation of tasks facilitates C-to-gates compilers 8
9 The Xilinx Advanced Computing Platform are modules that plug directly into CPU socket Direct access to FSB CPU and FPGA are both peers in system Equal priority main memory access
CPU does not have to orchestrate activity of FPGA CPU does not have to relay data to and from FPGAs FPGA not on slow connection to CPU All tasks can run independently 10
11 F U =
12 FSB Quad Core CPU MEM Xilinx ACP Module User FPGA 2 User FPGA 1 Comm FPGA NBE 1 NBE 2 NBE 3 NBE 4 Comm Xilinx ACP Module User FPGA 4 User FPGA 3 Comm FPGA NBE 5 NBE 6 NBE 7 NBE 8 Comm Xilinx ACP Module User FPGA 5 Comm FPGA Ewald Comm Ewald User FPGA 6
Target system is a combination of software running on CPUs and FPGA hardware accelerators Key to performance is in identifying hotspots and adding corresponding hardware acceleration Hardware engineer must focus only on small part of overall application MPI facilitates hardware/software isolation, collaboration 13
SOCRN 1: Molecular Structure and Function, The Hospital for Sick Children 2: Department of Biochemistry, University of Toronto Prof. Paul Chow Prof. Régis Pomès 1,2 David Chui Christopher Comis Sam Lee Daniel Ly Lesley Shannon Mike Yan Danny Gupta Alireza Heiderbarghi Alex Kaganov Daniel Ly Chris Madill 1,2 Daniel Nunes Emanuel Ramalho David Woods Arun Patel Manuel Saldaña Arches Computing: TMD Group: Past Members:
15
16 Application Hardware MPI Application Interface Point-to-Point MPI Functions Send/Receive Implementation FSL Hardware Interface Layer 4: MPI Interface All MPI functions implemented in TMD-MPI that are available to the application. Layer 3: Collective Operations Barrier synchronization, data gathering and message broadcasts. Layer 2: Communication Primitives MPI_Send and MPI_Recv methods are used to transmit data between processes. Layer 1: Hardware Interface Low level methods to communicate with FSLs for both on and off-chip communication.
Communication links are based on Fast Simplex Links (FSL) Unidirectional Point-to-Point FIFO Provides buffering and flow-control Can be used to isolate different clock domains FSLs simplify component interconnects Standardized interface, used by both hardware engines and processors Can assemble system modules rapidly Application-specific network topologies can be defined 17
Inter-FPGA communication uses abstracted communication links Communication is independent of physical link Single serial transceivers (FSL-over-Aurora) Bonded serial transceivers (FSL-over-XAUI) Parallel Busses (FSL-over-Wires) FSL-over-10GbE coming soon…