A Scalable FPGA-based Multiprocessor for Molecular Dynamics Simulation Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1,

Slides:



Advertisements
Similar presentations
ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
Advertisements

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering.
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
History of Distributed Systems Joseph Cordina
Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.
Seven Minute Madness: Special-Purpose Parallel Architectures Dr. Jason D. Bakos.
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.
Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
LIGO-G Z 8 June 2001L.S.Finn/LDAS Camp1 How to think about parallel programming.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics Presented by: Chris Comis September 23,
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
1 of 14 1/15 Synthesis-driven Derivation of Process Graphs from Functional Blocks for Time-Triggered Embedded Systems Master thesis Student: Ghennadii.
ESC499 – A TMD-MPI/MPE B ASED H ETEROGENEOUS V IDEO S YSTEM Tony Zhou, Prof. Paul Chow April 6 th, 2010.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.
1 SIAC 2000 Program. 2 SIAC 2000 at a Glance AMLunchPMDinner SunCondor MonNOWHPCGlobusClusters TuePVMMPIClustersHPVM WedCondorHPVM.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
A Performance Comparison of DSM, PVM, and MPI Paul Werstein Mark Pethick Zhiyi Huang.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented.
ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
FPL Sept. 2, 2003 Software Decelerators Eric Keller, Gordon Brebner and Phil James-Roxby Xilinx Research Labs.
MPI: Portable Parallel Programming for Scientific Computing William Gropp Rusty Lusk Debbie Swider Rajeev Thakur.
Interconnection network network interface and a case study.
Outline Why this subject? What is High Performance Computing?
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
Multiprocessor Systems Using FPGAs Presented By: Manuel Saldaña Connections 2006 The University of Toronto ECE Graduate Symposium.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Parallel Programming By J. H. Wang May 2, 2017.
CRESCO Project: Salvatore Raia
University of Technology
by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow
Message Passing Models
By Brandon, Ben, and Lee Parallel Computing.
Department of Electrical Engineering Joint work with Jiong Luo
Chapter 01: Introduction
Presentation transcript:

A Scalable FPGA-based Multiprocessor for Molecular Dynamics Simulation Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented By: Arun Patel Connections 2006 The University of Toronto ECE Graduate Symposium Toronto, Ontario, Canada June 9 th, : Department of Electrical and Computer Engineering, University of Toronto 2: Department of Structural Biology and Biochemistry, The Hospital for Sick Children 3: Department of Biochemistry, University of Toronto

06/09/2006Connections Introduction –FPGAs can accelerate many computing tasks by up to 2 or 3 orders of magnitude –Supercomputers and computing clusters have been designed to improve computing performance –Our work focuses on developing a computing cluster based on a scalable network of FPGAs –Initial design will be tailored for performing Molecular Dynamics simulations

06/09/2006Connections Molecular Dynamics – Combines empirical force calculations with Newton’s equations of motion – Predicts the time trajectory of small atomic systems – Computationally demanding   F  1.Calculate interatomic forces 2.Calculate the net force 3.Integrate Newtonian equations of motion

06/09/2006Connections Molecular Dynamics – Combines empirical force calculations with Newton’s equations of motion – Predicts the time trajectory of small atomic systems – Computationally demanding   F  1.Calculate interatomic forces 2.Calculate the net force 3.Integrate Newtonian equations of motion

06/09/2006Connections Molecular Dynamics – Combines empirical force calculations with Newton’s equations of motion – Predicts the time trajectory of small atomic systems – Computationally demanding   F  1.Calculate interatomic forces 2.Calculate the net force 3.Integrate Newtonian equations of motion

06/09/2006Connections Molecular Dynamics – Combines empirical force calculations with Newton’s equations of motion – Predicts the time trajectory of small atomic systems – Computationally demanding   F  1.Calculate interatomic forces 2.Calculate the net force 3.Integrate Newtonian equations of motion

06/09/2006Connections Molecular Dynamics U =    

06/09/2006Connections Why Molecular Dynamics? 2. Computationally Demanding 30 CPU Years 1. Inherently Parallelizable

06/09/2006Connections Motivation for Architecture Majority of hardware accelerators achieve ~ x improvement over S/W by –Pipelining a serially-executed algorithm - or - –Performing operations in parallel Such techniques do not address large- scale computing applications (such as MD) –Much greater speedups are required ( ) –Not likely with a single hardware accelerator Ideal solution for large-scale computing? –Scalability of modern HPC platforms –Performance of hardware acceleration

06/09/2006Connections The “TMD” Machine An investigation of a FPGA-based architecture –Designed for applications that exhibit high compute-to- communication ratio –Made possible by integration of microprocessors, high-speed communication interfaces into modern FPGA packages

06/09/2006Connections Inter-Task Communication Based on Message Passing Interface (MPI) –Popular message-passing standard for distributed applications –Implementations available for virtually every HPC platform TMD-MPI –Subset of MPI standard developed for TMD architecture –Software library for tasks implemented on embedded microprocessors –Hardware Message Passing Engine (MPE) for hardware computing tasks

06/09/2006Connections MD Software Implementation Atom Store r → r → Force Engine Atom Store r → F → Force Engine Atom Store r → F → F → F → mpiCC Interconnection Network Design Flow – Testing and validation – Parallel design – Software to hardware transition

06/09/2006Connections Current Work XC2VP100 PPC-405 Force Engine Atom Store Force Engine Atom Store + TMD-MPI + ppc-g++ Force Engine C++ → HDL + TMD-MPE + Synthesis Replace software processes with hardware computing engines

06/09/2006Connections Acknowledgements SOCRN David Chui Christopher Comis Sam Lee Dr. Paul Chow Andrew House Daniel Nunes Manuel Saldaña Emanuel Ramalho Dr. Régis Pomès Christopher Madill Arun Patel Lesley Shannon TMD Group: Past Members:

06/09/2006Connections Large-Scale Computing Solutions Class 1 Machines –Supercomputers or clusters of workstations –~ interconnected CPUs Interconnection Network

06/09/2006Connections Large-Scale Computing Solutions Class 1 Machines –Supercomputers or clusters of workstations –~ interconnected CPUs Class 2 Machines –Hybrid network of CPU and FPGA hardware –FPGA acts as external co-processor to CPU –Programming model still evolving Interconnection Network

06/09/2006Connections Large-Scale Computing Solutions Class 1 Machines –Supercomputers or clusters of workstations –~ interconnected CPUs Class 2 Machines –Hybrid network of CPU and FPGA hardware –FPGA acts as external co-processor to CPU –Programming model still evolving Class 3 Machines –Network of FPGA-based computing nodes –Recent area of academic and industrial focus Interconnection Network

06/09/2006Connections TMD Communication Infrastructure Tier 1: Intra-FPGA Communication –Point-to-Point FIFOs are used as communication channels –Asynchronous FIFOs isolate clock domains –Application-specific network topologies can be defined

06/09/2006Connections TMD Communication Infrastructure Tier 1: Intra-FPGA Communication –Point-to-Point FIFOs are used as communication channels –Asynchronous FIFOs isolate clock domains –Application-specific network topologies can be defined Tier 2: Inter-FPGA Communication –Multi-gigabit serial transceivers used for inter-FPGA communication –Fully-interconnected network topology using 2N*(N-1) pairs of traces

06/09/2006Connections TMD Communication Infrastructure Tier 1: Intra-FPGA Communication –Point-to-Point FIFOs are used as communication channels –Asynchronous FIFOs isolate clock domains –Application-specific network topologies can be defined Tier 2: Inter-FPGA Communication –Multi-gigabit serial transceivers used for inter-FPGA communication –Fully-interconnected network topology using 2N*(N-1) pairs of traces Tier 3: Inter-Cluster Communication –Commercially-available switches interconnect cluster PCBs –Built-in features for large-scale computing: fault-tolerance, scalability

06/09/2006Connections TMD “Computing Tasks” (1/2) Computing Tasks –Applications are defined as collection of computing tasks –Tasks communicate by passing messages Task Implementation Flexibility –Software processes executing on embedded microprocessors –Dedicated hardware computing engines Task Computing Engine Embedded Microprocessor Processor on CPU Node Class 3Class 1

06/09/2006Connections TMD “Computing Tasks” (2/2) Computing Task Granularity –Tasks can vary in size and complexity –Not restricted to one task per FPGA FPGAsTasks A B C DEF GHI JKLM

06/09/2006Connections TMD-MPI Software Implementation Application Hardware MPI Application Interface Point-to-Point MPI Functions Send/Receive Implementation FSL Hardware Interface Layer 4: MPI Interface All MPI functions implemented in TMD-MPI that are available to the application.

06/09/2006Connections TMD-MPI Software Implementation Application Hardware MPI Application Interface Point-to-Point MPI Functions Send/Receive Implementation FSL Hardware Interface Layer 4: MPI Interface All MPI functions implemented in TMD-MPI that are available to the application. Layer 3: Collective Operations Barrier synchronization, data gathering and message broadcasts.

06/09/2006Connections TMD-MPI Software Implementation Application Hardware MPI Application Interface Point-to-Point MPI Functions Send/Receive Implementation FSL Hardware Interface Layer 4: MPI Interface All MPI functions implemented in TMD-MPI that are available to the application. Layer 3: Collective Operations Barrier synchronization, data gathering and message broadcasts. Layer 2: Communication Primitives MPI_Send and MPI_Recv methods are used to transmit data between processes.

06/09/2006Connections TMD-MPI Software Implementation Application Hardware MPI Application Interface Point-to-Point MPI Functions Send/Receive Implementation FSL Hardware Interface Layer 4: MPI Interface All MPI functions implemented in TMD-MPI that are available to the application. Layer 3: Collective Operations Barrier synchronization, data gathering and message broadcasts. Layer 2: Communication Primitives MPI_Send and MPI_Recv methods are used to transmit data between processes. Layer 1: Hardware Interface Low level methods to communicate with FSLs for both on and off-chip communication.

06/09/2006Connections TMD Application Design Flow Step 1: Application Prototyping –Software prototype of application developed –Profiling identifies compute-intensive routines Application Prototype

06/09/2006Connections TMD Application Design Flow Step 1: Application Prototyping –Software prototype of application developed –Profiling identifies compute-intensive routines Step 2: Application Refinement –Partitioning into tasks communicating using MPI –Each task emulates a computing engine –Communication patterns analyzed to determine network topology Application Prototype Process AProcess BProcess C

06/09/2006Connections TMD Application Design Flow Step 1: Application Prototyping –Software prototype of application developed –Profiling identifies compute-intensive routines Step 2: Application Refinement –Partitioning into tasks communicating using MPI –Each task emulates a computing engine –Communication patterns analyzed to determine network topology Step 3: TMD Prototyping –Tasks are ported to soft-processors on TMD –Software refined to utilize TMD-MPI library –On-chip communication network verified Application Prototype Process AProcess BProcess C ABC

06/09/2006Connections TMD Application Design Flow Step 1: Application Prototyping –Software prototype of application developed –Profiling identifies compute-intensive routines Step 2: Application Refinement –Partitioning into tasks communicating using MPI –Each task emulates a computing engine –Communication patterns analyzed to determine network topology Step 3: TMD Prototyping –Tasks are ported to soft-processors on TMD –Software refined to utilize TMD-MPI library –On-chip communication network verified Step 4: TMD Optimization –Intensive tasks replaced with hardware engines –MPE handles communication for hardware engines Application Prototype Process AProcess BProcess C ABC B

06/09/2006Connections Future Work – Phase 2 TMD Version 2 Prototype

06/09/2006Connections Future Work – Phase 3 The final TMD architecture will contain a hierarchical network of FPGA chips