Protein Explorer: A Petaflops Special Purpose Computer System for Molecular Dynamics Simulations David Gobaud Computational Drug Discovery Stanford University 7 March 2006
Outline Overview Background Delft Molecular Dynamics Processor GRAPE Protein Explorer Summary MDGRAPE-3 Chip Force Calculation Pipeline J-Particle Memory and Control Units System Architecture Software Cost Questions
Overview Protein Explorer Petaflop special-purpose computer system for molecular dynamics simulations High-precision screening for drug design Large-scale simulations of huge proteins/complexes PC cluster with special-purpose engines to perform the most time-consuming calculations Dedicated LSI MDGRAPE-3 chip performs force calculations at 165 Gflops or higher ETA 2006
Background PCs are universal machines Various applications Hardware can be designed independent of applications Obstacles to high-performance Memory bandwidth bottleneck Heat dissipation problem Can be overcome by developing specialized architectures
Delft Molecular Dynamics Processor (DMDP) Pioneered high-performance special-purpose systems Not able to achieve effective cost-performance Demanded too much time and money in development state Speed of development is a crucial factor affecting cost- performance because electronic device technology continues to develop rapidly Almost all calculations performed by DMDP making hardware very complex
GRAPE (GRAvity PipE) One of the most successful attempts to develop high-performance special-purpose systems Specialized for simulations of classical particles Most time spent on calculation of long-range forces (gravitational, Coulomb, and van der Waals) Thus special hardware only performs these calculations Hardware very simple and cost-effective
GRAPE (GRAvity PipE) In 1995 first machine to break teraflops barrier in nominal peak performance Since 2001 leader in performance has been Molecular Dynamics Machine at RIKEN at 78- TFlops University of Tokyo a 64-TFlop GRAPE-6 completed Protein Explorer launched based on 2002 University of Tokyo success
Protein Explorer Summary Host PC cluster with special purpose boards attached Boards calculate only non-bounded forces Very simple hardware and software No detailed knowledge of hardware needed to write programs Communication time between host and boards is proportional to number of particles Calculation time proportional to N^2 for direct summation of long-range forces N*Nc for short range forces where Nc is the average number of particles within the cutoff radius 0.25 byte/1000 operations
MDGRAPE-3 Chip - Force Calculation Pipeline 3 subtractor units 6 adder units 8 multiplier units 1 function-evaluation unit Can perform ~33 equivalent operations/sec when it calculates the Coulomb force
MDGRAPE-3 Chip - Force Calculation Pipeline
Most operations done in 32-bit single precision floating point format Force accumulation is 80-bit fixed point format Can be converted to 64-bit double precision floating point Coordinates stored in 40-bit fixed-point format Makes implementation of periodic boundary condition easy
MDGRAPE-3 Chip - Force Calculation Pipeline Function Evaluator Most important part of pipeline Allows calculation of arbitrary smooth function Has memory unit which contains a table for polynomial coefficients and exponents and a hardwired pipeline for fourth-order polynomial evaluation Interpolates an arbitrary smooth function g(x) using segmented fourth-order polynomials by Homer’s method
MDGRAPE-3 Chip - J-Particle Memory and Control Units 20 Force Calculation Pipelines j-Particle Memory Unit 32,768 bodies “Main Memory” 6.6 Mbits constructed by static RAM Cell-Index Controller Controls j-Particle memory – generates addresses Force Simulation Unit Master Controller Manages timings and inputs/outputs of the chip
MDGRAPE-3 Chip 2 virtual pipelines/physical pipeline Physical bandwidth of j-particle unit 2.5 Gbytes/sec but virtual bandwidth will reach 100 Gbytes/sec 340 arithmetic units 20 function-evaluator units which work simultaneously 165 Gflops at 250MHz
MDGRAPE-3 Chip
Chip made by Hitachi 6M gates 10M bits of memory Chip size is ~220 mm^2 Dissipate 20 watts at core voltage of +1.2V.12 W/Gflops much better than P4 3GHz which is 14 W/Gflop
System Architecture Host PC cluster will use Itanium or Opteron CPU 256 nodes with 512 CPUs each Performance of node is 3.96 Tflops Total reaches a petaflop Require 10G-bit/sec network Infiniband 10G Ethernet or future Myrinet Network topology will be a 2D hyper-crossbar Each node has 24 MDGRAPE-3 chips MDGRAPE-3 chips connected via 2 PCI-X busses at 133 MHz 19” rack can house 6 nodes 43 racks total Power dissipation ~150 KWatts Occupy 100 m^2
System Architecture
Protein Explorer Board
Software Very easy to create programs for All computational abilities provided in a library No special knowledge of device needed
Cost $20 million including labor Less than $10/Gflop At least ten times better than general- purpose computers even when compared with relatively cheap BlueGene/L ($140/Gflop)
Questions What is Myrinet? What is a two-dimensional hyper- crossbar network topology? How does this compare to massive distributed computing such as Advantages? Disadvantages?