Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK
MD Simulation Recall Time evolution of a set of interacting atoms Interaction between molecules Intra: bonds and bends between atoms Inter: interaction of point charges Interaction can simply follow Newton’s law (F=ma) but in certain range
Intuitive Sequential Algorithm Input: an array of atoms Physical attributes: position, force, velocity etc. Algorithm in one timestep: for each atom A in the array for each atom B in the array if(distance(A,B)<=cutoff) do force calculation for A and B Update physical attributes of atoms
Ideas of Parallelization Data decomposition Chunking atom array Distribute every chunk to processors Atoms information be communicated between every chunk pair
Implementation in Shared-Memory Barrier before next time step Simple in OpenMP Only add compiler directives on sequential code Automatic chunking by compiler Not difficult in others (HPF, UPC, CAF, GAs etc.) Chunking done by programmer Different code form for remote array access
Access Conflicts in Shared Data Be careful! Add programming complexity
Implementations in Message-Passing Basic scheme of one time step Overlapping communication/computation Charm++ and GAs for every other processor i Send local atoms chunk processor i Receiving atoms chunk from processor i Compute force between local chunk and received one Update local atoms Barrier()
Other Paradigms Fit Moderately fit Probably not fit BSP STAPL DSM (Treadmarks, CID, CASHMERe) Probably not fit Cilk BSP: local computation->global communication->barrier Cilk: mainly for state-search problems (few communication between two processes) DSM: atoms chunk size less then one page! STAPL: atoms array can be stored in a vector so that fit good
Implementation Performance Simulation Input system not big and not real Overhead in multiple software layers Possible implementation issues Cause charm++ not good
Wait…think again! The above parallelization approach is not good for high-performance Possible wasteful communication
Better ideas of parallelization Space decomposition (1-away) based on cutoff
Pros/Cons of Better Ideas Reduced communication Update in every step change atoms’ position Re-decompose space at every step? Do every certain amount steps Possible imbalanced computation every cell contains different number of atoms
What We Needed for Implementation? Cell arrays (each with size of cutoff) A way to know which two cells need to exchange data A global table (can be indexed by two Cells) Neighbor list associated with each cell Load-balancing
Charm++ Perfectly Fits Better Ideas Object-oriented programming One-sided communication for overlapping computation Support for multidimensional array Data-driven computation Built-in load-balancing
Summary MD simulation is not very difficult to implement both in shared memory and message-passing paradigms Parallelization techniques matters Simple one not difficult to implement Better one motivates more features in language
Thank You!