A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics Presented by: Chris Comis September 23,

Slides:

Advertisements

Similar presentations

System Integration and Performance

Advertisements

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

Umut Girit  One of the core members of the Internet Protocol Suite, the set of network protocols used for the Internet. With UDP, computer.

Simplifying the Integration of Processing Elements in Computing Systems using a Programmable Controller By Lesley Shannon and Paul Chow University of Toronto.

Chapter 10 Input/Output Organization. Connections between a CPU and an I/O device Types of bus (Figure 10.1) –Address bus –Data bus –Control bus.

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.

Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

A Scalable FPGA-based Multiprocessor for Molecular Dynamics Simulation Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1,

Reliable Data Storage using Reed Solomon Code Supervised by: Isaschar (Zigi) Walter Performed by: Ilan Rosenfeld, Moshe Karl Spring 2004 Part A Final Presentation.

1: Operating Systems Overview

Input-Output Problems L1 Prof. Sin-Min Lee Department of Mathematics and Computer Science.

1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.

Review on Networking Technologies Linda Wu (CMPT )

Switch EECS 252 – Spring 2006 RAMP Blue Project Jue Sun and Gary Voronel Electrical Engineering and Computer Sciences University of California, Berkeley.

INPUT/OUTPUT ORGANIZATION INTERRUPTS CS147 Summer 2001 Professor: Sin-Min Lee Presented by: Jing Chen.

TCP: Software for Reliable Communication. Spring 2002Computer Networks Applications Internet: a Collection of Disparate Networks Different goals: Speed,

Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.

Gursharan Singh Tatla Transport Layer 16-May

System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.

SEPT, 2005CSI Part 2.2 Protocols and Protocol Layering Robert Probert, SITE, University of Ottawa.

Network Management Concepts and Practice Author: J. Richard Burke Presentation by Shu-Ping Lin.

The University of New Hampshire InterOperability Laboratory Serial ATA (SATA) Protocol Chapter 10 – Transport Layer.

Data Communications and Networks

Chapter 6 High-Speed LANs Chapter 6 High-Speed LANs.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

G64INC Introduction to Network Communications Ho Sooi Hock Internet Protocol.

CSC 311 IEEE STANDARD ETHERNET Common Bus topology Uses CSMA/CD Named after “ether”, the imaginary substance many once believed occupied all of space.

Ethernet. Ethernet Goals Simplicity Low Cost Compatibility Address flexibility Fairness –All nodes have equal access to the network High speed Stability.

1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.

Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

LOGO BUS SYSTEM Members: Bui Thi Diep Nguyen Thi Ngoc Mai Vu Thi Thuy Class: 1c06.

1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.

Brierley 1 Module 4 Module 4 Introduction to LAN Switching.

1 Computer Communication & Networks Lecture 13 Datalink Layer: Local Area Network Waleed Ejaz

1 Module 15: Network Structures n Topology n Network Types n Communication.

TCP/IP Yang Wang Professor: M.ANVARI.

Design and Characterization of TMD-MPI Ethernet Bridge Kevin Lam Professor Paul Chow.

Chapter 7 Low-Level Protocols

1 Chapter 16 Protocols and Protocol Layering. 2 Protocol  Agreement about communication  Specifies  Format of messages (syntax)  Meaning of messages.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

August 1, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 9: I/O Devices and Communication Buses * Jeremy R. Johnson Wednesday,

Cisco 3 - Switching Perrine. J Page 16/4/2016 Chapter 4 Switches The performance of shared-medium Ethernet is affected by several factors: data frame broadcast.

An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

Hot Interconnects TCP-Splitter: A Reconfigurable Hardware Based TCP/IP Flow Monitor David V. Schuehler

Lecture 4 Overview. Ethernet Data Link Layer protocol Ethernet (IEEE 802.3) is widely used Supported by a variety of physical layer implementations Multi-access.

L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.

A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented.

Bart Hommels (for Matthew Wing) EUDET ext. steering board JRA3 DAQ System DAQ System Availability updates: – DIF: Detector Interface – LDA:

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

Multiprocessor Systems Using FPGAs Presented By: Manuel Saldaña Connections 2006 The University of Toronto ECE Graduate Symposium.

Fast Lookup for Dynamic Packet Filtering in FPGA REPORTER: HSUAN-JU LI 2014/09/18 Design and Diagnostics of Electronic Circuits & Systems, 17th International.

Mohamed Younis CMCS 411, Computer Architecture 1 CMCS Computer Architecture Lecture 26 Bus Interconnect May 7,

1 Protocols and Protocol Layering. 2 Protocol Agreement about communication Specifies –Format of messages –Meaning of messages –Rules for exchange –Procedures.

1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.

UDP: User Datagram Protocol Chapter 12. Introduction Multiple application programs can execute simultaneously on a given computer and can send and receive.

Chapter Objectives After completing this chapter you will be able to: Describe in detail the following Local Area Network (LAN) technologies: Ethernet.

Serial Communications

Real-time Software Design

by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Chapter 13: I/O Systems.

Serial Communications

ECE 671 – Lecture 8 Network Adapters.

Presentation transcript:

A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics Presented by: Chris Comis September 23, 2005 Supervisor:Professor Paul Chow

2 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions

3 What is Molecular Dynamics? A method of calculating the time-evolution of molecular configurations A method of calculating the time-evolution of molecular configurations Useful in the analysis of protein folding Useful in the analysis of protein folding Many applications in rational drug design Many applications in rational drug design

4 1. Forces (i.e. F=ma) are calculated between an atom and all other atoms in the system An O(n 2 ) problem across 10,000+ atoms An O(n 2 ) problem across 10,000+ atoms 2. Force calculations are performed at femtosecond timesteps Interesting results may take several μs of simulation ( timesteps required) Interesting results may take several μs of simulation ( timesteps required) MD is Computationally Challenging MD simulations are typically run on supercomputers

5 An FPGA-based MD Accelerator An ongoing collaborative project involves the development of an FPGA-based MD Accelerator An ongoing collaborative project involves the development of an FPGA-based MD Accelerator Advantages to an FPGA-based approach: Advantages to an FPGA-based approach: 1. Massive parallel computation 2. Forces can be parallelized 3. Force computations can be accelerated ~88x 4. High-speed Serial I/O (SERDES) may be leveraged

6 Area of Focus Develop communication protocol using high-speed SERDES links Develop communication protocol using high-speed SERDES links Requirements: Requirements: Reliability Reliability Light-weight Light-weight Minimal trip-time for small packets Minimal trip-time for small packets Must be abstracted at the hardware and software levels Must be abstracted at the hardware and software levels

7 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions

8 Blocks → computation Arrows → communication A Partial MD Simulator Computation blocks can be hardware or software executed on MicroBlaze soft processors Computation blocks can be hardware or software executed on MicroBlaze soft processors Software must be written using a programming model Software must be written using a programming model

9 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model

10 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development

11 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development 1. Producers and consumer hardware blocks are implemented

12 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development 1. Producers and consumer hardware blocks are implemented 2. An FSL (FIFO) is used as an abstracted method of data transport with SERDES logic

13 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development 1. Producers and consumer hardware blocks are implemented 2. An FSL is used as an abstracted method of data transport with SERDES logic 3. An OPB bus interface is added for register access of components

14 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development 1. Producers and consumer hardware blocks are implemented 2. An FSL is used as an abstracted method of data transport with SERDES logic 3. An OPB bus interface is added for register access of components 4. Deep FIFOs are added for logging high-speed data

15 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions

16 Protocol Overview A synchronous acknowledgement-based protocol was chosen A synchronous acknowledgement-based protocol was chosen Simple and predictable Simple and predictable An inherent delay in waiting for acknowledgements An inherent delay in waiting for acknowledgements To mask this delay: To mask this delay: Multiple producers are connected to the SERDES interface Multiple producers are connected to the SERDES interface The link is time-multiplexed across multiple producers The link is time-multiplexed across multiple producers

17 Protocol Overview All data has a word width of 4 bytes All data has a word width of 4 bytes Data packets: Data packets: Variable size (between 32 and 2016 bytes) Variable size (between 32 and 2016 bytes) A 32-bit CRC is appended A 32-bit CRC is appended Acknowledgements: Acknowledgements: 8 bytes in size 8 bytes in size Can interrupt transmission of data packets Can interrupt transmission of data packets

18 Transmit Logic Transmitter consists mainly of two components Transmitter consists mainly of two components 1. Dual-port buffers: The start address of the packet is kept in case a resend is necessary The start address of the packet is kept in case a resend is necessary 2. Scheduler: Schedules ready packets in a round-robin fashion Schedules ready packets in a round-robin fashion From Producer via FSLTo Scheduler of SERDES Link

19 Receive Logic Receiver consists mainly of two components: Receiver consists mainly of two components: 1. Dual-port buffers: The start address of the packet is kept in case errors occur The start address of the packet is kept in case errors occur 2. Three-stage Dataflow Pipeline: Stage 1: Determine if incoming data is properly formatted Stage 2: Evaluate incoming data against all possible errors Stage 3: Pass results to acknowledgement handler From SERDES LinkTo Consumer via FSL

20 Design Effort Majority of design effort was in error handling: Majority of design effort was in error handling: Transmitter: Transmitter: Determine which packet combinations corrupt the system Determine which packet combinations corrupt the system Establish a priority among conflicting packet types Establish a priority among conflicting packet types Receiver: Receiver: Handle all possible combinations of transmission errors Handle all possible combinations of transmission errors

21 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions

22 Test Environment All SERDES tests performed across a Xilinx Virtex-II Pro XC2VP7 and XC2VP30 series FPGAs All SERDES tests performed across a Xilinx Virtex-II Pro XC2VP7 and XC2VP30 series FPGAs Ribbon cables were used to transfer serial data between non-impedance controlled connectors Ribbon cables were used to transfer serial data between non-impedance controlled connectors

23 Reliability and Sustainability Verification test environment: Verification test environment: Send data concurrently from three producers to three respective consumers Send data concurrently from three producers to three respective consumers Pseudo-random packet length Pseudo-random packet length Consumers read from FSL at variable rates Consumers read from FSL at variable rates Reliability: Reliability: Run this test under extremely poor line conditions Run this test under extremely poor line conditions Sustainability: Sustainability: Run this test under normal line conditions for a long period of time Run this test under normal line conditions for a long period of time

24 Reliability Reliability: 128-second Test Results Reliability: 128-second Test Results Type of Error Average # of Errors Soft Error (x10 6 ) Hard Error Frame Error 22 CRC Error Receive Buffer Full (x10 6 ) Lost Acknowledgment 81769

25 Sustainability Sustainability: 8-hour Test Results Sustainability: 8-hour Test Results MeasurementResult Resent Packets due to Receive Buffer Full (x10 6 ) Successful Packets (x10 6 ) Total Packets (x10 6 ) Approximate Bit-Rate (x10 9 ) 1.755

26 Comparison Against Other Communication Mechanisms Two configurations are used Two configurations are used Configuration A: Saturate the channel with packets Configuration A: Saturate the channel with packets Configuration B: Loop-back test Configuration B: Loop-back test Compare against: Compare against: Simple FPGA-based 100BaseT Ethernet Simple FPGA-based 100BaseT Ethernet TCP/IP FPGA-based 100BaseT Ethernet TCP/IP FPGA-based 100BaseT Ethernet TCP/IP Cluster-based Gigabit Ethernet TCP/IP Cluster-based Gigabit Ethernet

27 Throughput Results

28 One-way Trip Time Results

29 Area Consumption Each SERDES Interface takes approximately 8% of a Xilinx XC2VP30 Each SERDES Interface takes approximately 8% of a Xilinx XC2VP30 Debug logic substantially increases area consumption: Debug logic substantially increases area consumption: FF usage increases 68% FF usage increases 68% LUT usage increases 43% LUT usage increases 43% Area Measurement FFsLUTs Area with Debug Logic Area without Debug Logic

30 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions

31 Integration into a Programming Model while (1) { MPI_Send(data_outgoing, 64, MPI_INT, 0, 0, MPI_COMM_WORLD); MPI_Recv(data_incoming, 64, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); } Hardware abstraction: FSL Hardware abstraction: FSL Software abstraction: An MPI-based Programming Model Software abstraction: An MPI-based Programming Model Modified MPI_Send and MPI_Recv function calls Modified MPI_Send and MPI_Recv function calls

32 Integration into a Programming Model Replaced producers and consumers with a MicroBlaze processor Replaced producers and consumers with a MicroBlaze processor Several communication scenarios were tested Several communication scenarios were tested Scenario Bit-Rate (Mbps) MicroBlaze to MicroBlaze (no traffic) 4.30 MicroBlaze to MicroBlaze (traffic) 4.10 MicroBlaze to Hardware Consumer (no traffic) 7.78 Hardware Producer to MicroBlaze (no traffic) 8.90

33 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Incorporation into a Programming Model 6. Conclusions/Questions

34 Conclusions Final Results: Final Results: Reliable and sustainable Reliable and sustainable Abstracted at the software and hardware level Abstracted at the software and hardware level 2074 FFs and 2244 LUTs required for SERDES logic only 2074 FFs and 2244 LUTs required for SERDES logic only Given a channel rate of 2.5Gbps, maximum bidirectional throughput of 1.928Gbps Given a channel rate of 2.5Gbps, maximum bidirectional throughput of 1.928Gbps Minimum packet trip-time of 1.23μs Minimum packet trip-time of 1.23μs

35 Acknowledgements Y. Gu, T. VanCourt, M. C. Herbordt, FPGA Acceleration of Molecular Dynamics Computations, To appear: Proceedings of Field Programmable Logic and Applications, August Professor Régis Pomès, Chris Madill Professor Régis Pomès, Chris Madill Professor Paul Chow, Professor C.Y. Chen, Lesley Shannon, Arun Patel, Manuel Saldaña, David Chui, Sam Lee, Andrew House,, Nathalie Chan, Lorne Applebaum, Patrick Akl Professor Paul Chow, Professor C.Y. Chen, Lesley Shannon, Arun Patel, Manuel Saldaña, David Chui, Sam Lee, Andrew House,, Nathalie Chan, Lorne Applebaum, Patrick Akl References

36 Transmitter Packet Collision Handling Packets are enclosed by 8B/10B control characters (K-characters) Packets are enclosed by 8B/10B control characters (K-characters) The type of packet is distinguished by the K-characters used The type of packet is distinguished by the K-characters used Certain combinations of control characters cannot be nested Certain combinations of control characters cannot be nested Clock correction has priority over acknowledgement Clock correction has priority over acknowledgement Acknowledgement cannot interrupt the end of a data packet Acknowledgement cannot interrupt the end of a data packet Clock correction must avoid the beginning and end of a data packet Clock correction must avoid the beginning and end of a data packet

37 Receiver Error Handling All combinations of errors at the receiver are handled correctly All combinations of errors at the receiver are handled correctly Data errors (CRC errors) Data errors (CRC errors) Disparity errors or invalid characters (soft errors) Disparity errors or invalid characters (soft errors) Errors in framing (frame errors) Errors in framing (frame errors) Channel failures (hard errors) Channel failures (hard errors) Lost acknowledgements/repeat packets Lost acknowledgements/repeat packets Receiver buffers full Receiver buffers full

38 Test Configuration A Send data concurrently from three producers to three respective consumers Send data concurrently from three producers to three respective consumers Producers write to FSL as fast as possible Producers write to FSL as fast as possible Consumers read from FSL as fast as possible Consumers read from FSL as fast as possible Analyze best-case throughput results Analyze best-case throughput results

39 Test Configuration B Send data from a producer to a consumer Send data from a producer to a consumer Delay a packet write from a producer until a packet has been completely received by the consumer on the same FPGA Delay a packet write from a producer until a packet has been completely received by the consumer on the same FPGA A communication loop results that determines round-trip trip time (and therefore one-way trip time) A communication loop results that determines round-trip trip time (and therefore one-way trip time)