by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow

Slides:



Advertisements
Similar presentations
Professur für Technische Informatik A Self Distributing Virtual Machine for FPGA Multicores Klaus Waldschmidt J. W. Goethe-University Technische Informatik.
Advertisements

Computer Science, University of Oklahoma Reconfigurable Versus Fixed Versus Hybrid Architectures John K. Antonio Oklahoma Supercomputing Symposium 2008.
The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
History of Distributed Systems Joseph Cordina
Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Seven Minute Madness: Special-Purpose Parallel Architectures Dr. Jason D. Bakos.
A Scalable FPGA-based Multiprocessor for Molecular Dynamics Simulation Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1,
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
Synchronization and Communication in the T3E Multiprocessor.
LIGO-G Z 8 June 2001L.S.Finn/LDAS Camp1 How to think about parallel programming.
Design and Characterization of TMD-MPI Ethernet Bridge Kevin Lam Professor Paul Chow.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 Presenter: Min Yu,Lo 2015/10/9 Lauri Matilainen, Erno Salminen, Timo D. Hamalainen, and Marko Hannikainen International Conference on Embedded.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Hybrid Prototyping of MPSoCs Samar Abdi Electrical and Computer Engineering Concordia University Montreal, Canada
ESC499 – A TMD-MPI/MPE B ASED H ETEROGENEOUS V IDEO S YSTEM Tony Zhou, Prof. Paul Chow April 6 th, 2010.
A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.
March 9, 2015 San Jose Compute Engineering Workshop.
A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented.
Interconnection network network interface and a case study.
By Chi-Chang Chen.  Cluster computing is a technique of linking two or more computers into a network (usually through a local area network) in order.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Multiprocessor Systems Using FPGAs Presented By: Manuel Saldaña Connections 2006 The University of Toronto ECE Graduate Symposium.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Background Computer System Architectures Computer System Software.
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
PVM and MPI.
F1-17: Architecture Studies for New-Gen HPC Systems
These slides are based on the book:
Dynamo: A Runtime Codesign Environment
Berkeley Cluster Projects
OCP: High Performance Computing Project
Instructor: Dr. Phillip Jones
Architecture & Organization 1
Programmable HPC Network Fabrics for Adaptive Computing
Is System X for Me? Cal Ribbens Computer Science Department
Anne Pratoomtong ECE734, Spring2002
Distributed Shared Memory
Department of Computer Science University of California, Santa Barbara
Architecture & Organization 1
Parallel I/O System for Massively Parallel Processors
ASSERT: System Level Wireless Networking Testbed
The Globus Toolkit™: Information Services
Storage area network and System area network (SAN)
CLUSTER COMPUTING.
Chirag Dekate Department of Computer Science
Characteristics of Reconfigurable Hardware
MPJ: A Java-based Parallel Computing System
RECONFIGURABLE NETWORK ON CHIP ARCHITECTURE FOR AEROSPACE APPLICATIONS
Chapter 4 Multiprocessors
Types of Parallel Computers
Emulating Massively Parallel (PetaFLOPS) Machines
Cluster Computers.
Presentation transcript:

Configuration and Programming of Heterogeneous Multiprocessors on a Multi-FPGA System Using TMD-MPI by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow University of Toronto Department of Electrical and Computer Engineering 3rd International Conference on ReConFigurable Computing and FPGAs (ReConFig06) San Luis Potosi, Mexico September, 2006

Agenda Motivation Background New Developments Example Application TMD-MPI Classes of HPC Design Flow New Developments Example Application Heterogeneity test Scalability test Conclusions 11/14/2018 Manuel Saldaña

Motivation How Do We Program This? 64-MicroBlaze MPSoC (Ring,2D-Mesh) topologies XC4VLX160 Not the largest one! 11/14/2018 Manuel Saldaña

Motivation How Do We Program This? Network 512-MicroBlaze Multiprocessor System 11/14/2018 Manuel Saldaña

Background: Classes of HPC Machines Class 1 Machines Supercomputers or clusters of workstations Interconnection Network 06/09/2006 Connections 2006

Background: Classes of HPC Machines Class 1 Machines Supercomputers or clusters of workstations Interconnection Network Class 2 Machines Hybrid network of CPU and FPGA hardware FPGA acts as external co-processor to CPU Interconnection Network 06/09/2006 Connections 2006

Background: Classes of HPC Machines Class 1 Machines Supercomputers or clusters of workstations Interconnection Network Class 2 Machines Hybrid network of CPU and FPGA hardware FPGA acts as external co-processor to CPU Interconnection Network Class 3 Machines FPGA-based multiprocessor Recent area of academic and industrial focus Interconnection Network 06/09/2006 Connections 2006

Background: MPSoC and MPI MPSoC (Class 3) has many similarities to typical multiprocessor computers (Class 1), but also many special requirements Similar concepts but different implementations MPI for MPSoC is desirable (TIMA labs, OpenFPGA, Berkeley BEE2, U. of Queensland, U. Rey Juan Carlos, UofT TMD,...) MPI is a broad standard and designed for big machines MPI Implementations are too big for embedded systems 11/14/2018 Manuel Saldaña

Background: TMD-MPI MPSoC (TMD-MPI) Linux Cluster (MPICH) Network mP Linux Cluster (MPICH) the same code… 11/14/2018 Manuel Saldaña

Background: TMD-MPI Use multiple chips to have massive resources Network mP Network Use multiple chips to have massive resources Network mP Network mP Network TMD-MPI hides the complexity 11/14/2018 Manuel Saldaña

Background: TMD-MPI Implementation Layers TMD-MPI MPI_Barrier MPI_Send/MPI_Recv csend/send fsl_cput / fsl_put (macros) put/get (assembly instructions) Application MPI Application Interface Point-to-Point MPI TMD-MPI Communication Functions Hardware Access Functions Hardware 11/14/2018 Manuel Saldaña

Background: TMD-MPI MPI Functions Implemented Point-to-Point MPI_Send MPI_Recv MPI Functions Implemented Miscellaneous MPI_Init MPI_Finalize MPI_Comm_Rank MPI_Comm_Size MPI_Wtime Collective Operations MPI_Barrier MPI_Bcast MPI_Gather MPI_Reduce 11/14/2018 Manuel Saldaña

Background: Design Flow Flexible Hardware-Software Co-design Flow Previous work: Patel et al.[1] (FCCM 2006) Saldaña et al.[2] (FPL 2006) ReConFig06 11/14/2018 Manuel Saldaña

New Developments TMD-MPI for MicroBlaze TMD-MPI for PowerPC405 TMD-MPE for Hardware engines 11/14/2018 Manuel Saldaña

New Developments: TMD-MPE and TMD-MPI light Hardware Engine With Message-Passing 11/14/2018 Manuel Saldaña

New Developments TMD-MPE uses the Rendezvous message-passing protocol 11/14/2018 Manuel Saldaña

New Developments TMD-MPE includes: message queues to keep track of unexpected messages packetizing/depacketizing logic to handle large messages top queue 11/14/2018 Manuel Saldaña

temperature distribution Heterogeneity Test Heat Equation Application / Jacobi Iterations Observe the change of temperature distribution over time TMD-MPI TMD-MPE TMD-MPI 11/14/2018 Manuel Saldaña

Heterogeneity Test Heat Equation Application / Jacobi Iterations TMD-MPI TMD-MPE TMD-MPI 11/14/2018 Manuel Saldaña

Heterogeneity Test MPSoC Heterogeneous Configurations (9 Processing Elements, single FPGA) 11/14/2018 Manuel Saldaña

Heterogeneity Test Execution Time PPC405 Jacobi Engines MicroBlazes 11/14/2018 Manuel Saldaña

Scalability Test Heat Equation Application 5 FPGAS (XC2VP100) (7 mB + 2 PPC405 per FPGA) 45 Processing Elements (35 mB + 10 PPC405) 11/14/2018 Manuel Saldaña

Scalability Test Fixed-size Speedup up to 45 Processors 11/14/2018 Manuel Saldaña

UofT TMD Prototype 11/14/2018 Manuel Saldaña

Conclusions TMD-MPI and TMD-MPE enable the parallel programming of heterogeneous MPSoC across multiple FPGAs including hardware engines TMD-MPI hides the complexity of using heterogeneous links The Heat equation application code was executed in a Linux Cluster and in our multi-FPGA system with minimal changes TMD-MPI can be adapted to a particular architecture TMD prototype is a good platform for further research on MPSoC 11/14/2018 Manuel Saldaña

References [1] Arun Patel, Christopher Madill, Manuel Saldaña, Christopher Comis, Régis Pomès, and Paul Chow. A Scalable FPGA-based Multiprocessor. In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’06), April 2006 [2] Manuel Saldaña and Paul Chow. TMD-MPI: An MPI Implementation for Multiple Processors across Multiple FPGAs. In IEEE International Conference on Field-Programmable Logic and Applications (FPL 2006), August 2006. 11/14/2018 Manuel Saldaña

Thank you! (¡Gracias!) 11/14/2018 Manuel Saldaña

Rendezvous Overhead Rendezvous Synchronization Overhead 11/14/2018 Manuel Saldaña

Testing the Functionality TMD-MPIbench on-chip communication Internal RAM (BRAM) off-chip communication round-trip tests on-chip communication External RAM (DDR) off-chip communication 11/14/2018 Manuel Saldaña

TMD-MPI Implementation TMD-MPI communication protocols 11/14/2018 Manuel Saldaña

Communication Tests TMD-MPIbench.c round trip bisection bandwidth round trips with congestion worst case traffic scenario all-node broadcasts synchronization performance (barriers/sec) 11/14/2018 Manuel Saldaña

Communication Tests Latency: Testbed (internal link) 17 mS @ 40 MHz Testbed (external link) 22 mS P3-NOW 100 Mb/s Ethernet 75 mS P4-Cluster 1000 Mb/s Gigabit Ethernet 92 mS 11/14/2018 Manuel Saldaña

Communication Tests MicroBlaze throughput limit with external RAM 11/14/2018 Manuel Saldaña

Communication Tests MicroBlaze throughput limit with internal RAM Memory access time MicroBlaze throughput limit with external RAM 11/14/2018 Manuel Saldaña

Communication Tests Measured Bandwidth @ 40 MHz Startup Frequency Overhead Frequency Measured Bandwidth @ 40 MHz P4-Cluster P3-NOW 11/14/2018 Manuel Saldaña

Communication Tests 11/14/2018 Manuel Saldaña

Many variables are involved… 11/14/2018 Manuel Saldaña

Background: TMD-MPI TMD-MPI provides a parallel programming model for MPSoC in FPGAs with the following features: Portability - application unaffected by changes in HW Flexibility - to move from generic to application-specific Scalability - for large scale applications Reusability - do not learn a new API for similar applications 11/14/2018 Manuel Saldaña

Testing the Functionality Hardware Testbed 11/14/2018 Manuel Saldaña

Testing the Functionality Hardware Testbed 11/14/2018 Manuel Saldaña

New Developments: TMD-MPE TMD-MPE use and the network 11/14/2018 Manuel Saldaña

Background: TMD-MPI TMD-MPI is a lightweight subset of the MPI standard is tailored to a particular application does not require an Operating System has a small memory footprint ~8.7KB uses a simple protocol 11/14/2018 Manuel Saldaña

New Developments: TMD-MPE and TMD-MPI light 11/14/2018 Manuel Saldaña