A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented.

A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented By: Arun Patel and Christopher Madill {apatel, cmadill}@eecg.toronto.edu IEEE 2006 Conference on Field-Programmable Custom Computing Machines Napa Valley, California April 25 th, 2006 1: Department of Electrical and Computer Engineering, University of Toronto 2: Department of Structural Biology and Biochemistry, The Hospital for Sick Children 3: Department of Biochemistry, University of Toronto

4/25/2006IEEE FCCM 20062 Introduction –FPGAs can accelerate many computing tasks by up to 2 or 3 orders of magnitude –Supercomputers and computing clusters have been designed to improve computing performance. –Our work focuses on developing a powerful computing cluster based on a scalable network of FPGAs –Initial design will be tailored for performing Molecular Dynamics simulations

4/25/2006IEEE FCCM 20063 1.Calculate interatomic forces. 2.Calculate the net force. 3.Integrate Newton’s equations of motion. Molecular Dynamics – Combines empirical force calculations with Newton’s equations of motion. – Predict the time trajectory of small atomic systems. – Computationally demanding.   F 

4/25/2006IEEE FCCM 20064 Molecular Dynamics U =     + + + +

4/25/2006IEEE FCCM 20065 Why Molecular Dynamics? 2. Computationally Demanding 30 CPU Years 1. Inherently Parallelizable

4/25/2006IEEE FCCM 20066 Example of MD at Work

4/25/2006IEEE FCCM 20067 Motivation for Architecture Majority of hardware accelerators achieve ~10 2 -10 3 x improvement over S/W by –Pipelining a serially-executed algorithm - or - –Performing operations in parallel Such techniques do not address large- scale computing applications (such as MD) –Much greater speedups are required (10 4 -10 5 ) –Not likely with a single hardware accelerator Ideal solution for large-scale computing? –Scalability of modern HPC platforms –Performance of hardware acceleration

4/25/2006IEEE FCCM 20068 Large-Scale Computing Solutions Class 1 Machines –Supercomputers or clusters of workstations –~10-10 5 interconnected CPUs Interconnection Network

4/25/2006IEEE FCCM 20069 Large-Scale Computing Solutions Class 1 Machines –Supercomputers or clusters of workstations –~10-10 5 interconnected CPUs Class 2 Machines –Hybrid network of CPU and FPGA hardware –FPGA acts as external co-processor to CPU –Programming model still evolving Interconnection Network

4/25/2006IEEE FCCM 200610 Large-Scale Computing Solutions Class 1 Machines –Supercomputers or clusters of workstations –~10-10 5 interconnected CPUs Class 2 Machines –Hybrid network of CPU and FPGA hardware –FPGA acts as external co-processor to CPU –Programming model still evolving Class 3 Machines –Network of FPGA-based computing nodes –Recent area of academic and industrial focus Interconnection Network

4/25/2006IEEE FCCM 200611 The “TMD” Machine An investigation of a Class 3 architecture –Designed for applications that exhibit high compute-to- communication ratio –Made possible by integration of microprocessors, high-speed communication interfaces into modern FPGA packages Design Features –Distributed memory model –Low-latency point-to-point interconnection network –Provides abstraction of uniform, extensible FPGA fabric to system designers –Constructed entirely using commodity FPGA components –Does not address shared memory, external I/O issues

4/25/2006IEEE FCCM 200612 TMD “Computing Tasks” (1/2) Computing Tasks –Applications are defined as collection of computing tasks –Tasks communicate by passing messages Task Implementation Flexibility –Software processes executing on embedded microprocessors –Dedicated hardware computing engines Task Computing Engine Embedded Microprocessor Processor on CPU Node Class 3Class 1

4/25/2006IEEE FCCM 200613 TMD “Computing Tasks” (2/2) Computing Task Granularity –Tasks can vary in size and complexity –Not restricted to one task per FPGA FPGAsTasks A B C DEF GHI JKLM

4/25/2006IEEE FCCM 200614 TMD Communication Infrastructure Tier 1: Intra-FPGA Communication –Point-to-Point FIFOs are used as communication channels –Asynchronous FIFOs isolate clock domains –Application-specific network topologies can be defined

4/25/2006IEEE FCCM 200615 TMD Communication Infrastructure Tier 1: Intra-FPGA Communication –Point-to-Point FIFOs are used as communication channels –Asynchronous FIFOs isolate clock domains –Application-specific network topologies can be defined Tier 2: Inter-FPGA Communication –Multi-gigabit serial transceivers used for inter-FPGA communication –Fully-interconnected network topology using 2N*(N-1) pairs of traces

4/25/2006IEEE FCCM 200616 TMD Communication Infrastructure Tier 1: Intra-FPGA Communication –Point-to-Point FIFOs are used as communication channels –Asynchronous FIFOs isolate clock domains –Application-specific network topologies can be defined Tier 2: Inter-FPGA Communication –Multi-gigabit serial transceivers used for inter-FPGA communication –Fully-interconnected network topology using 2N*(N-1) pairs of traces Tier 3: Inter-Cluster Communication –Commercially-available switches interconnect cluster PCBs –Built-in features for large-scale computing: fault-tolerance, scalability

4/25/2006IEEE FCCM 200617 Inter-Task Communication Based on Message Passing Interface (MPI) –Popular message-passing standard for distributed applications –Implementations available for virtually every HPC platform TMD-MPI –Subset of MPI standard developed for TMD architecture –Software library for tasks implemented on embedded microprocessors –Hardware Message Passing Engine (MPE) for hardware computing tasks

4/25/2006IEEE FCCM 200618 TMD-MPI Software Implementation Application Hardware MPI Application Interface Point-to-Point MPI Functions Send/Receive Implementation FSL Hardware Interface Layer 4: MPI Interface All MPI functions implemented in TMD-MPI that are available to the application.

4/25/2006IEEE FCCM 200619 TMD-MPI Software Implementation Application Hardware MPI Application Interface Point-to-Point MPI Functions Send/Receive Implementation FSL Hardware Interface Layer 4: MPI Interface All MPI functions implemented in TMD-MPI that are available to the application. Layer 3: Collective Operations Barrier synchronization, data gathering and message broadcasts.

4/25/2006IEEE FCCM 200620 TMD-MPI Software Implementation Application Hardware MPI Application Interface Point-to-Point MPI Functions Send/Receive Implementation FSL Hardware Interface Layer 4: MPI Interface All MPI functions implemented in TMD-MPI that are available to the application. Layer 3: Collective Operations Barrier synchronization, data gathering and message broadcasts. Layer 2: Communication Primitives MPI_Send and MPI_Recv methods are used to transmit data between processes.

4/25/2006IEEE FCCM 200621 TMD-MPI Software Implementation Application Hardware MPI Application Interface Point-to-Point MPI Functions Send/Receive Implementation FSL Hardware Interface Layer 4: MPI Interface All MPI functions implemented in TMD-MPI that are available to the application. Layer 3: Collective Operations Barrier synchronization, data gathering and message broadcasts. Layer 2: Communication Primitives MPI_Send and MPI_Recv methods are used to transmit data between processes. Layer 1: Hardware Interface Low level methods to communicate with FSLs for both on and off-chip communication.

4/25/2006IEEE FCCM 200622 TMD Application Design Flow Step 1: Application Prototyping –Software prototype of application developed –Profiling identifies compute-intensive routines Application Prototype

4/25/2006IEEE FCCM 200623 TMD Application Design Flow Step 1: Application Prototyping –Software prototype of application developed –Profiling identifies compute-intensive routines Step 2: Application Refinement –Partitioning into tasks communicating using MPI –Each task emulates a computing engine –Communication patterns analyzed to determine network topology Application Prototype Process AProcess BProcess C

4/25/2006IEEE FCCM 200624 TMD Application Design Flow Step 1: Application Prototyping –Software prototype of application developed –Profiling identifies compute-intensive routines Step 2: Application Refinement –Partitioning into tasks communicating using MPI –Each task emulates a computing engine –Communication patterns analyzed to determine network topology Step 3: TMD Prototyping –Tasks are ported to soft-processors on TMD –Software refined to utilize TMD-MPI library –On-chip communication network verified Application Prototype Process AProcess BProcess C ABC

4/25/2006IEEE FCCM 200625 TMD Application Design Flow Step 1: Application Prototyping –Software prototype of application developed –Profiling identifies compute-intensive routines Step 2: Application Refinement –Partitioning into tasks communicating using MPI –Each task emulates a computing engine –Communication patterns analyzed to determine network topology Step 3: TMD Prototyping –Tasks are ported to soft-processors on TMD –Software refined to utilize TMD-MPI library –On-chip communication network verified Step 4: TMD Optimization –Intensive tasks replaced with hardware engines –MPE handles communication for hardware engines Application Prototype Process AProcess BProcess C ABC B

4/25/2006IEEE FCCM 200626 MD Software Implementation Atom Store r → r → Force Engine Atom Store r → F → Force Engine Atom Store r → F → F → F → mpiCC Interconnection Network Design Flow – Testing and validation – Parallel design – Software to hardware transition

4/25/2006IEEE FCCM 200627 Current Work XC2VP100 PPC-405 Force Engine Atom Store Force Engine Atom Store + TMD-MPI + ppc-g++ Force Engine C++ → HDL + TMD-MPE + Synthesis Replace software processes with hardware computing engines

4/25/2006IEEE FCCM 200628 Future Work – Phase 2 TMD Version 2 Prototype

4/25/2006IEEE FCCM 200629 Future Work – Phase 3 The final TMD architecture will contain a hierarchical network of FPGA chips

4/25/2006IEEE FCCM 200630 Acknowledgements SOCRN David Chui Christopher Comis Sam Lee Dr. Paul Chow Andrew House Daniel Nunes Manuel Saldaña Emanuel Ramalho Dr. Régis Pomès Christopher Madill Arun Patel Lesley Shannon TMD Group: Past Members:

A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented.

Similar presentations

Presentation on theme: "A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented.

Similar presentations

Presentation on theme: "A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented."— Presentation transcript:

Similar presentations

About project

Feedback