Configuration and Programming of Heterogeneous Multiprocessors on a Multi-FPGA System Using TMD-MPI by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow University of Toronto Department of Electrical and Computer Engineering 3rd International Conference on ReConFigurable Computing and FPGAs (ReConFig06) San Luis Potosi, Mexico September, 2006
Agenda Motivation Background New Developments Example Application TMD-MPI Classes of HPC Design Flow New Developments Example Application Heterogeneity test Scalability test Conclusions 11/14/2018 Manuel Saldaña
Motivation How Do We Program This? 64-MicroBlaze MPSoC (Ring,2D-Mesh) topologies XC4VLX160 Not the largest one! 11/14/2018 Manuel Saldaña
Motivation How Do We Program This? Network 512-MicroBlaze Multiprocessor System 11/14/2018 Manuel Saldaña
Background: Classes of HPC Machines Class 1 Machines Supercomputers or clusters of workstations Interconnection Network 06/09/2006 Connections 2006
Background: Classes of HPC Machines Class 1 Machines Supercomputers or clusters of workstations Interconnection Network Class 2 Machines Hybrid network of CPU and FPGA hardware FPGA acts as external co-processor to CPU Interconnection Network 06/09/2006 Connections 2006
Background: Classes of HPC Machines Class 1 Machines Supercomputers or clusters of workstations Interconnection Network Class 2 Machines Hybrid network of CPU and FPGA hardware FPGA acts as external co-processor to CPU Interconnection Network Class 3 Machines FPGA-based multiprocessor Recent area of academic and industrial focus Interconnection Network 06/09/2006 Connections 2006
Background: MPSoC and MPI MPSoC (Class 3) has many similarities to typical multiprocessor computers (Class 1), but also many special requirements Similar concepts but different implementations MPI for MPSoC is desirable (TIMA labs, OpenFPGA, Berkeley BEE2, U. of Queensland, U. Rey Juan Carlos, UofT TMD,...) MPI is a broad standard and designed for big machines MPI Implementations are too big for embedded systems 11/14/2018 Manuel Saldaña
Background: TMD-MPI MPSoC (TMD-MPI) Linux Cluster (MPICH) Network mP Linux Cluster (MPICH) the same code… 11/14/2018 Manuel Saldaña
Background: TMD-MPI Use multiple chips to have massive resources Network mP Network Use multiple chips to have massive resources Network mP Network mP Network TMD-MPI hides the complexity 11/14/2018 Manuel Saldaña
Background: TMD-MPI Implementation Layers TMD-MPI MPI_Barrier MPI_Send/MPI_Recv csend/send fsl_cput / fsl_put (macros) put/get (assembly instructions) Application MPI Application Interface Point-to-Point MPI TMD-MPI Communication Functions Hardware Access Functions Hardware 11/14/2018 Manuel Saldaña
Background: TMD-MPI MPI Functions Implemented Point-to-Point MPI_Send MPI_Recv MPI Functions Implemented Miscellaneous MPI_Init MPI_Finalize MPI_Comm_Rank MPI_Comm_Size MPI_Wtime Collective Operations MPI_Barrier MPI_Bcast MPI_Gather MPI_Reduce 11/14/2018 Manuel Saldaña
Background: Design Flow Flexible Hardware-Software Co-design Flow Previous work: Patel et al.[1] (FCCM 2006) Saldaña et al.[2] (FPL 2006) ReConFig06 11/14/2018 Manuel Saldaña
New Developments TMD-MPI for MicroBlaze TMD-MPI for PowerPC405 TMD-MPE for Hardware engines 11/14/2018 Manuel Saldaña
New Developments: TMD-MPE and TMD-MPI light Hardware Engine With Message-Passing 11/14/2018 Manuel Saldaña
New Developments TMD-MPE uses the Rendezvous message-passing protocol 11/14/2018 Manuel Saldaña
New Developments TMD-MPE includes: message queues to keep track of unexpected messages packetizing/depacketizing logic to handle large messages top queue 11/14/2018 Manuel Saldaña
temperature distribution Heterogeneity Test Heat Equation Application / Jacobi Iterations Observe the change of temperature distribution over time TMD-MPI TMD-MPE TMD-MPI 11/14/2018 Manuel Saldaña
Heterogeneity Test Heat Equation Application / Jacobi Iterations TMD-MPI TMD-MPE TMD-MPI 11/14/2018 Manuel Saldaña
Heterogeneity Test MPSoC Heterogeneous Configurations (9 Processing Elements, single FPGA) 11/14/2018 Manuel Saldaña
Heterogeneity Test Execution Time PPC405 Jacobi Engines MicroBlazes 11/14/2018 Manuel Saldaña
Scalability Test Heat Equation Application 5 FPGAS (XC2VP100) (7 mB + 2 PPC405 per FPGA) 45 Processing Elements (35 mB + 10 PPC405) 11/14/2018 Manuel Saldaña
Scalability Test Fixed-size Speedup up to 45 Processors 11/14/2018 Manuel Saldaña
UofT TMD Prototype 11/14/2018 Manuel Saldaña
Conclusions TMD-MPI and TMD-MPE enable the parallel programming of heterogeneous MPSoC across multiple FPGAs including hardware engines TMD-MPI hides the complexity of using heterogeneous links The Heat equation application code was executed in a Linux Cluster and in our multi-FPGA system with minimal changes TMD-MPI can be adapted to a particular architecture TMD prototype is a good platform for further research on MPSoC 11/14/2018 Manuel Saldaña
References [1] Arun Patel, Christopher Madill, Manuel Saldaña, Christopher Comis, Régis Pomès, and Paul Chow. A Scalable FPGA-based Multiprocessor. In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’06), April 2006 [2] Manuel Saldaña and Paul Chow. TMD-MPI: An MPI Implementation for Multiple Processors across Multiple FPGAs. In IEEE International Conference on Field-Programmable Logic and Applications (FPL 2006), August 2006. 11/14/2018 Manuel Saldaña
Thank you! (¡Gracias!) 11/14/2018 Manuel Saldaña
Rendezvous Overhead Rendezvous Synchronization Overhead 11/14/2018 Manuel Saldaña
Testing the Functionality TMD-MPIbench on-chip communication Internal RAM (BRAM) off-chip communication round-trip tests on-chip communication External RAM (DDR) off-chip communication 11/14/2018 Manuel Saldaña
TMD-MPI Implementation TMD-MPI communication protocols 11/14/2018 Manuel Saldaña
Communication Tests TMD-MPIbench.c round trip bisection bandwidth round trips with congestion worst case traffic scenario all-node broadcasts synchronization performance (barriers/sec) 11/14/2018 Manuel Saldaña
Communication Tests Latency: Testbed (internal link) 17 mS @ 40 MHz Testbed (external link) 22 mS P3-NOW 100 Mb/s Ethernet 75 mS P4-Cluster 1000 Mb/s Gigabit Ethernet 92 mS 11/14/2018 Manuel Saldaña
Communication Tests MicroBlaze throughput limit with external RAM 11/14/2018 Manuel Saldaña
Communication Tests MicroBlaze throughput limit with internal RAM Memory access time MicroBlaze throughput limit with external RAM 11/14/2018 Manuel Saldaña
Communication Tests Measured Bandwidth @ 40 MHz Startup Frequency Overhead Frequency Measured Bandwidth @ 40 MHz P4-Cluster P3-NOW 11/14/2018 Manuel Saldaña
Communication Tests 11/14/2018 Manuel Saldaña
Many variables are involved… 11/14/2018 Manuel Saldaña
Background: TMD-MPI TMD-MPI provides a parallel programming model for MPSoC in FPGAs with the following features: Portability - application unaffected by changes in HW Flexibility - to move from generic to application-specific Scalability - for large scale applications Reusability - do not learn a new API for similar applications 11/14/2018 Manuel Saldaña
Testing the Functionality Hardware Testbed 11/14/2018 Manuel Saldaña
Testing the Functionality Hardware Testbed 11/14/2018 Manuel Saldaña
New Developments: TMD-MPE TMD-MPE use and the network 11/14/2018 Manuel Saldaña
Background: TMD-MPI TMD-MPI is a lightweight subset of the MPI standard is tailored to a particular application does not require an Operating System has a small memory footprint ~8.7KB uses a simple protocol 11/14/2018 Manuel Saldaña
New Developments: TMD-MPE and TMD-MPI light 11/14/2018 Manuel Saldaña