A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics Presented by: Chris Comis September 23, 2005 Supervisor:Professor Paul Chow
2 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions
3 What is Molecular Dynamics? A method of calculating the time-evolution of molecular configurations A method of calculating the time-evolution of molecular configurations Useful in the analysis of protein folding Useful in the analysis of protein folding Many applications in rational drug design Many applications in rational drug design
4 1. Forces (i.e. F=ma) are calculated between an atom and all other atoms in the system An O(n 2 ) problem across 10,000+ atoms An O(n 2 ) problem across 10,000+ atoms 2. Force calculations are performed at femtosecond timesteps Interesting results may take several μs of simulation ( timesteps required) Interesting results may take several μs of simulation ( timesteps required) MD is Computationally Challenging MD simulations are typically run on supercomputers
5 An FPGA-based MD Accelerator An ongoing collaborative project involves the development of an FPGA-based MD Accelerator An ongoing collaborative project involves the development of an FPGA-based MD Accelerator Advantages to an FPGA-based approach: Advantages to an FPGA-based approach: 1. Massive parallel computation 2. Forces can be parallelized 3. Force computations can be accelerated ~88x 4. High-speed Serial I/O (SERDES) may be leveraged
6 Area of Focus Develop communication protocol using high-speed SERDES links Develop communication protocol using high-speed SERDES links Requirements: Requirements: Reliability Reliability Light-weight Light-weight Minimal trip-time for small packets Minimal trip-time for small packets Must be abstracted at the hardware and software levels Must be abstracted at the hardware and software levels
7 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions
8 Blocks → computation Arrows → communication A Partial MD Simulator Computation blocks can be hardware or software executed on MicroBlaze soft processors Computation blocks can be hardware or software executed on MicroBlaze soft processors Software must be written using a programming model Software must be written using a programming model
9 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model
10 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development
11 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development 1. Producers and consumer hardware blocks are implemented
12 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development 1. Producers and consumer hardware blocks are implemented 2. An FSL (FIFO) is used as an abstracted method of data transport with SERDES logic
13 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development 1. Producers and consumer hardware blocks are implemented 2. An FSL is used as an abstracted method of data transport with SERDES logic 3. An OPB bus interface is added for register access of components
14 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development 1. Producers and consumer hardware blocks are implemented 2. An FSL is used as an abstracted method of data transport with SERDES logic 3. An OPB bus interface is added for register access of components 4. Deep FIFOs are added for logging high-speed data
15 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions
16 Protocol Overview A synchronous acknowledgement-based protocol was chosen A synchronous acknowledgement-based protocol was chosen Simple and predictable Simple and predictable An inherent delay in waiting for acknowledgements An inherent delay in waiting for acknowledgements To mask this delay: To mask this delay: Multiple producers are connected to the SERDES interface Multiple producers are connected to the SERDES interface The link is time-multiplexed across multiple producers The link is time-multiplexed across multiple producers
17 Protocol Overview All data has a word width of 4 bytes All data has a word width of 4 bytes Data packets: Data packets: Variable size (between 32 and 2016 bytes) Variable size (between 32 and 2016 bytes) A 32-bit CRC is appended A 32-bit CRC is appended Acknowledgements: Acknowledgements: 8 bytes in size 8 bytes in size Can interrupt transmission of data packets Can interrupt transmission of data packets
18 Transmit Logic Transmitter consists mainly of two components Transmitter consists mainly of two components 1. Dual-port buffers: The start address of the packet is kept in case a resend is necessary The start address of the packet is kept in case a resend is necessary 2. Scheduler: Schedules ready packets in a round-robin fashion Schedules ready packets in a round-robin fashion From Producer via FSLTo Scheduler of SERDES Link
19 Receive Logic Receiver consists mainly of two components: Receiver consists mainly of two components: 1. Dual-port buffers: The start address of the packet is kept in case errors occur The start address of the packet is kept in case errors occur 2. Three-stage Dataflow Pipeline: Stage 1: Determine if incoming data is properly formatted Stage 2: Evaluate incoming data against all possible errors Stage 3: Pass results to acknowledgement handler From SERDES LinkTo Consumer via FSL
20 Design Effort Majority of design effort was in error handling: Majority of design effort was in error handling: Transmitter: Transmitter: Determine which packet combinations corrupt the system Determine which packet combinations corrupt the system Establish a priority among conflicting packet types Establish a priority among conflicting packet types Receiver: Receiver: Handle all possible combinations of transmission errors Handle all possible combinations of transmission errors
21 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions
22 Test Environment All SERDES tests performed across a Xilinx Virtex-II Pro XC2VP7 and XC2VP30 series FPGAs All SERDES tests performed across a Xilinx Virtex-II Pro XC2VP7 and XC2VP30 series FPGAs Ribbon cables were used to transfer serial data between non-impedance controlled connectors Ribbon cables were used to transfer serial data between non-impedance controlled connectors
23 Reliability and Sustainability Verification test environment: Verification test environment: Send data concurrently from three producers to three respective consumers Send data concurrently from three producers to three respective consumers Pseudo-random packet length Pseudo-random packet length Consumers read from FSL at variable rates Consumers read from FSL at variable rates Reliability: Reliability: Run this test under extremely poor line conditions Run this test under extremely poor line conditions Sustainability: Sustainability: Run this test under normal line conditions for a long period of time Run this test under normal line conditions for a long period of time
24 Reliability Reliability: 128-second Test Results Reliability: 128-second Test Results Type of Error Average # of Errors Soft Error (x10 6 ) Hard Error Frame Error 22 CRC Error Receive Buffer Full (x10 6 ) Lost Acknowledgment 81769
25 Sustainability Sustainability: 8-hour Test Results Sustainability: 8-hour Test Results MeasurementResult Resent Packets due to Receive Buffer Full (x10 6 ) Successful Packets (x10 6 ) Total Packets (x10 6 ) Approximate Bit-Rate (x10 9 ) 1.755
26 Comparison Against Other Communication Mechanisms Two configurations are used Two configurations are used Configuration A: Saturate the channel with packets Configuration A: Saturate the channel with packets Configuration B: Loop-back test Configuration B: Loop-back test Compare against: Compare against: Simple FPGA-based 100BaseT Ethernet Simple FPGA-based 100BaseT Ethernet TCP/IP FPGA-based 100BaseT Ethernet TCP/IP FPGA-based 100BaseT Ethernet TCP/IP Cluster-based Gigabit Ethernet TCP/IP Cluster-based Gigabit Ethernet
27 Throughput Results
28 One-way Trip Time Results
29 Area Consumption Each SERDES Interface takes approximately 8% of a Xilinx XC2VP30 Each SERDES Interface takes approximately 8% of a Xilinx XC2VP30 Debug logic substantially increases area consumption: Debug logic substantially increases area consumption: FF usage increases 68% FF usage increases 68% LUT usage increases 43% LUT usage increases 43% Area Measurement FFsLUTs Area with Debug Logic Area without Debug Logic
30 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions
31 Integration into a Programming Model while (1) { MPI_Send(data_outgoing, 64, MPI_INT, 0, 0, MPI_COMM_WORLD); MPI_Recv(data_incoming, 64, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); } Hardware abstraction: FSL Hardware abstraction: FSL Software abstraction: An MPI-based Programming Model Software abstraction: An MPI-based Programming Model Modified MPI_Send and MPI_Recv function calls Modified MPI_Send and MPI_Recv function calls
32 Integration into a Programming Model Replaced producers and consumers with a MicroBlaze processor Replaced producers and consumers with a MicroBlaze processor Several communication scenarios were tested Several communication scenarios were tested Scenario Bit-Rate (Mbps) MicroBlaze to MicroBlaze (no traffic) 4.30 MicroBlaze to MicroBlaze (traffic) 4.10 MicroBlaze to Hardware Consumer (no traffic) 7.78 Hardware Producer to MicroBlaze (no traffic) 8.90
33 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Incorporation into a Programming Model 6. Conclusions/Questions
34 Conclusions Final Results: Final Results: Reliable and sustainable Reliable and sustainable Abstracted at the software and hardware level Abstracted at the software and hardware level 2074 FFs and 2244 LUTs required for SERDES logic only 2074 FFs and 2244 LUTs required for SERDES logic only Given a channel rate of 2.5Gbps, maximum bidirectional throughput of 1.928Gbps Given a channel rate of 2.5Gbps, maximum bidirectional throughput of 1.928Gbps Minimum packet trip-time of 1.23μs Minimum packet trip-time of 1.23μs
35 Acknowledgements Y. Gu, T. VanCourt, M. C. Herbordt, FPGA Acceleration of Molecular Dynamics Computations, To appear: Proceedings of Field Programmable Logic and Applications, August Professor Régis Pomès, Chris Madill Professor Régis Pomès, Chris Madill Professor Paul Chow, Professor C.Y. Chen, Lesley Shannon, Arun Patel, Manuel Saldaña, David Chui, Sam Lee, Andrew House,, Nathalie Chan, Lorne Applebaum, Patrick Akl Professor Paul Chow, Professor C.Y. Chen, Lesley Shannon, Arun Patel, Manuel Saldaña, David Chui, Sam Lee, Andrew House,, Nathalie Chan, Lorne Applebaum, Patrick Akl References
36 Transmitter Packet Collision Handling Packets are enclosed by 8B/10B control characters (K-characters) Packets are enclosed by 8B/10B control characters (K-characters) The type of packet is distinguished by the K-characters used The type of packet is distinguished by the K-characters used Certain combinations of control characters cannot be nested Certain combinations of control characters cannot be nested Clock correction has priority over acknowledgement Clock correction has priority over acknowledgement Acknowledgement cannot interrupt the end of a data packet Acknowledgement cannot interrupt the end of a data packet Clock correction must avoid the beginning and end of a data packet Clock correction must avoid the beginning and end of a data packet
37 Receiver Error Handling All combinations of errors at the receiver are handled correctly All combinations of errors at the receiver are handled correctly Data errors (CRC errors) Data errors (CRC errors) Disparity errors or invalid characters (soft errors) Disparity errors or invalid characters (soft errors) Errors in framing (frame errors) Errors in framing (frame errors) Channel failures (hard errors) Channel failures (hard errors) Lost acknowledgements/repeat packets Lost acknowledgements/repeat packets Receiver buffers full Receiver buffers full
38 Test Configuration A Send data concurrently from three producers to three respective consumers Send data concurrently from three producers to three respective consumers Producers write to FSL as fast as possible Producers write to FSL as fast as possible Consumers read from FSL as fast as possible Consumers read from FSL as fast as possible Analyze best-case throughput results Analyze best-case throughput results
39 Test Configuration B Send data from a producer to a consumer Send data from a producer to a consumer Delay a packet write from a producer until a packet has been completely received by the consumer on the same FPGA Delay a packet write from a producer until a packet has been completely received by the consumer on the same FPGA A communication loop results that determines round-trip trip time (and therefore one-way trip time) A communication loop results that determines round-trip trip time (and therefore one-way trip time)