Download presentation
Presentation is loading. Please wait.
Published byVincent Chambers Modified over 9 years ago
1
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with a wide margin for actual implementation overheads. Room for performance optimization in the areas of pipelining and parallel processing on the FPGA are also plausible with the additional bandwidth communication provided by the HyperTransport links between the XD1000 host and FPGA. A fully optimized hardware version is under development including a more rigorous performance analysis. The underlying fundamental approach for most biomolecular simulators is the use of classical Molecular Dynamics (MD). MD is a method for treating atoms as points with both mass and charge thereby allowing the use of classical mechanics. Predicting the behavior of these atoms requires a large number of force calculations that can be summarized as shown in the overall energy potential (Equation 1). Computing the above equation for every atom in every time step over a small number of atoms quickly becomes computationally intensive when time steps are one nanosecond and total simulation time is on the order of hundreds of nanoseconds. Phillip M. Martin 1, Melissa C. Smith 1, Sadaf Alam 2, Pratul Agarwal 2 XD1000 2 Oak Ridge National Laboratory {sralam, agarwalpk}@ornl.gov Conclusion Abstract The widespread adoption of reconfigurable (FPGA) accelerator devices for scientific computing faces two critical challenges: first, sustaining performance in the presence of data transfer overheads and second, the availability of a portable interface in a high-level language to target multiple devices. The XtremeData XD1000 is one available system that addresses the first challenge by coupling the reconfigurable hardware with the host µP on a dual-core Opteron motherboard. The ImpulseC development environment, addresses the challenge of portability and programming by enabling development of applications in C that are capable of targeting multiple reconfigurable hardware platforms. We present a workflow methodology for designing and accelerating a production-level biomolecular simulation framework called LAMMPS in ImpulseC. In addition, we explore the design space and characterize the performance of our implementation on the XD1000 platform. Model of the Rhodopsion Protein(1LN6) used in the LAMMPS benchmarking run. Future Work The final production level agenda includes performance analysis and comparison of the XtremeData XD1000 platform and the DS1000 system by DRC. These two systems are similar in specifications with the main difference being that the DRC DS1000 utilizes a Xilinx Virtex 4 FPGA as opposed to the Altera Stratix II FPGA in the XtremeData XD1000 platform. It is anticipated that this research will help pinpoint platforms and strategies in hardware and software that best meet with the needs of the scientific community. Additional long-term work will focus on the analysis of multiple RC modules (multicore) within a large-scale Opteron-based HPC system. Introduction XtremeData XD1000 Motherboard with interface bandwidths shown. During the LAMMPS Rhodpsin protein benchmark, the pair_lj_charmm- coul_long:compute function, which computes the pairwise interactions in equation 1, took almost 70% of the total execution time. From this result, the function was determined to be the initial prime candidate for acceleration. To create the custom co-processor for accelerating the pair_lj_charmm_coul_long: compute function, the function from the original application code must be decoupled and an interface built to marshal data between the host code and the accelerated code running on the FPGA module. ImpulseC provides a modified C development environment that allows for ease in implementing these C algorithms on RC hardware such as the XtremeData XD1000. ImpulseC will automatically, based on project settings and the users choice of a streaming or shared memory approach, generate VHDL capable of targeting one or several different reconfigurable computing platforms. Taking advantage of these tools, the pair_lj_charmm_coul_long:compute function ported using ImpulseC toolset was simulated within the ImpulseC development environment to determine the simulated functionality and performance. The simulated design has a maximum combinational path of 64 clock cycles (i.e. computing one atom on the FPGA takes 64 clock cycles). This result is obtained with minimum optimization techniques leaving the potential for further improvements. At a conservative clock rate of 20Mhz, 32K atoms or one time step can be computed in 100ms using this FPGA implementation neglecting communication delays between host and FPGA. Implementation Calculating the effective speedup for this computation function, results in over an order of magnitude speedup for the function. This translates to a total run time of 70 seconds and a speedup for the entire application when running the Rhodopsin benchmark of 2.78, again neglecting transfer delay overheads. 1 ECE Department, Clemson University {pmmarti, smithmc}@clemson.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.