Simulating extended time and length scales using parallel kinetic Monte Carlo and accelerated dynamics Jacques G. Amar, University of Toledo Kinetic Monte.

Slides:



Advertisements
Similar presentations
Time averages and ensemble averages
Advertisements

PRAGMA – 9 V.S.S.Sastry School of Physics University of Hyderabad 22 nd October, 2005.
1 An Efficient, Hardware-based Multi-Hash Scheme for High Speed IP Lookup Hot Interconnects 2008 Socrates Demetriades, Michel Hanna, Sangyeun Cho and Rami.
Generating Random Numbers
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
1 Traffic jams (joint work with David Griffeath, Ryan Gantner and related work with Levine, Ziv, Mukamel) To view the embedded objects in this Document,
UNIVERSITY OF MASSACHUSETTS Dept
Classification of Simulation Models
MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.
Latest Advances in “Hybrid” Codes & their Application to Global Magnetospheric Simulations A New Approach to Simulations of Complex Systems H. Karimabadi.
May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.
Simulated Annealing 10/7/2005.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Assets and Dynamics Computation for Virtual Worlds.
Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Kinetic Lattice Monte Carlo Simulations of Dopant Diffusion/Clustering in Silicon Zudian Qin and Scott T. Dunham Department of Electrical Engineering University.
CS401 presentation1 Effective Replica Allocation in Ad Hoc Networks for Improving Data Accessibility Takahiro Hara Presented by Mingsheng Peng (Proc. IEEE.
FLANN Fast Library for Approximate Nearest Neighbors
Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.
Introduction to Monte Carlo Methods D.J.C. Mackay.
Advanced methods of molecular dynamics Monte Carlo methods
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Parallel Computing for Urban Cellular Automata Qingfeng. Gene. Guan 2004-Nov-18 Geography Department Colloquium Univ. of California, Santa Barbara.
Javier Junquera Molecular dynamics in the microcanonical (NVE) ensemble: the Verlet algorithm.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
A Kinetic Monte Carlo Study Of Ordering in a Binary Alloy Group 3: Tim Drews (ChE) Dan Finkenstadt (Physics) Xuemin Gu (MSE) CSE 373/MatSE 385/Physics.
1 Worm Algorithms Jian-Sheng Wang National University of Singapore.
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
1 Optimization Based Power Generation Scheduling Xiaohong Guan Tsinghua / Xian Jiaotong University.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Simulated Annealing.
Challenges of Large-scale Vehicular & Mobile Ad hoc Network Simulation Thomas D. Hewer, Maziar Nekovee, Radhika S. Saksena and Peter V. Coveney
For a new configuration of the same volume V and number of molecules N, displace a randomly selected atom to a point chosen with uniform probability inside.
Introduction to Lattice Simulations. Cellular Automata What are Cellular Automata or CA? A cellular automata is a discrete model used to study a range.
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
Monte Carlo Methods So far we have discussed Monte Carlo methods based on a uniform distribution of random numbers on the interval [0,1] p(x) = 1 0  x.
Stochastic Growth in a Small World and Applications to Scalable Parallel Discrete-Event Simulations H. Guclu 1, B. Kozma 1, G. Korniss 1, M.A. Novotny.
Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
STATISTICAL COMPLEXITY ANALYSIS Dr. Dmitry Nerukh Giorgos Karvounis.
Introduction: Lattice Boltzmann Method for Non-fluid Applications Ye Zhao.
Molecular Modelling - Lecture 2 Techniques for Conformational Sampling Uses CHARMM force field Written in C++
NCAF Manchester July 2000 Graham Hesketh Information Engineering Group Rolls-Royce Strategic Research Centre.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Discrete, Amorphous Physical Models Erik Rauch Digital Perspectives on Physics Workshop.
Comparison of Algorithms Objective: Understand how and why algorithms are compared. Imagine you own a company. You’re trying to buy software to process.
Monte Carlo Simulation of Canonical Distribution The idea is to generate states i,j,… by a stochastic process such that the probability  (i) of state.
Computational Physics (Lecture 10) PHY4370. Simulation Details To simulate Ising models First step is to choose a lattice. For example, we can us SC,
Pattern Formation via BLAG Mike Parks & Saad Khairallah.
Experiments on low-temperature thin-film growth carried out by Stoldt et al [PRL 2000] indicate that the surface roughness exhibits a complex temperature.
Systematic errors of MC simulations Equilibrium error averages taken before the system has reached equilibrium  Monitor the variables you are interested.
Computational Physics (Lecture 10)
Maria Okuniewski Nuclear Engineering Dept.
Presented by Rhee, Je-Keun
Introduction to Simulated Annealing
COMP60621 Fundamentals of Parallel and Distributed Systems
Kinetic Monte Carlo Simulation of Epitaxial Growth
Xin-She Yang, Nature-Inspired Optimization Algorithms, Elsevier, 2014
Materials Computation Center, University of Illinois
COMP60611 Fundamentals of Parallel and Distributed Systems
Computational issues Issues Solutions Large time scale
Simulated Annealing & Boltzmann Machines
Presentation transcript:

Simulating extended time and length scales using parallel kinetic Monte Carlo and accelerated dynamics Jacques G. Amar, University of Toledo Kinetic Monte Carlo (KMC) is an extremely efficient method to carry out dynamical simulations when relevant thermally-activated atomic-scale processes are known. Used to model a variety of dynamical processes from catalysis to thin-film growth Temperature-accelerated dynamics (TAD - Sorensen & Voter, 2000) may be used to carry out realistic simulations even when relevant atomic-scale processes are extremely complicated and are not known. GOAL: to extend both of these techniques in order to carry out realistic simulations over larger system-sizes, longer time scales *In collaboration with Yunsic Shim Supported by NSF DMR

Parallel Kinetic Monte Carlo While standard KMC is extremely efficient it is inherently a serial algorithm! No matter how large the system, at every step only one event can be simulated! In contrast, Nature is inherently parallel! We would like to use KMC to carry out simulations of thin-film growth over longer time and length scales How to “parallelize” the KMC algorithm in order to simulate larger system-sizes, longer time scales?

Temperature Accelerated Dynamics (TAD) KMC simulations are limited by requirement that complete catalog of all relevant processes and their rate constants must be specified. However, often all relevant transition mechanisms are not known. TAD allows realistic simulations of low temperature processes over timescales of seconds and even hours Computational work for TAD scales as N 3 where N = # of atoms, so can only be applied to extremely small systems (a few hundred atoms) How to “parallelize” the TAD algorithm in order to simulate larger system-sizes?

Parallel KMC - Domain Decomposition Domain decomposition is a natural approach since intuitively one expects that widely separated regions may evolve independently “in parallel” Problems In parallel KMC, time evolves at different rates in different regions! How to deal with time synchronization? How to deal with conflicts between neighboring processors? 1234

Only update processors whose next event times correspond to local minima in time horizon (Chang, 1979; Lubachevsky, 1985) Advantages: works for Metropolis Monte Carlo since acceptance probability depends on local configuration but event- times do not. t3t3 Time Horizon P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 t1t1 t2t2 t4t4 t5t5 t6t6 t = 0 Disadvantages: does not work for kinetic Monte Carlo since event-times depend on local configuration. Fast events can “propagate” from processor to processor and lead to rollbacks. Parallel Discrete Event Simulation (PDES) Conservative Algorithm

Three approaches to parallel KMC Rigorous Algorithms Conservative asynchronous algorithm Lubachevsky (1988), Korniss et al (1999), Shim & Amar (2004) Synchronous relaxation algorithm Lubachevsky & Weiss (2001), Shim & Amar (2004) Semi-rigorous Algorithm Synchronous sublattice algorithm Shim & Amar (2004)

Thin-film growth models studied “Fractal model” Deposition rate F per site per unit time Monomer hopping rate D Irreversible sticking/attachment (i =1) “Edge-diffusion model” Same as above with edge-diffusion (relaxation) of singly-bonded cluster atoms “Reversible attachment model” Detachment of singly and multiply bonded atoms (bond-counting model) D/F = 10 7

Methods of domain decomposition (2D) Square decomposition (8 nbors) Strip decomposition (2 nbors)

Synchronous relaxation (SR) algorithm (Lubachevsky & Weiss, 2001) All processors ‘in-synch’ at beginning & end of each cycle Iterative relaxation - at each iteration processors use boundary info. from previous iteration Relaxation complete when current iteration identical to previous iteration for all processors 2 processors 12 Bdy event t = 0 t = T t 11 t 12 t 22 t 23 t 21 P1P1 P2P2 Disadvantages: Complex: requires ‘keeping list’ of all events, random numbers used in each iteration Algorithm does not scale: faster than CA algorithm but still slow due to global synchronization and requirement of multiple iterations per cycle One Cycle

Average calc. time per cycle T for parallel simulation may be written: t av (N p ) = N iter (t 1p /n av ) + t com where: /n av ~ T -1/2 log(N p ) 2/3 and N iter ~ T log(N p )  t com ~ (a + bT) log(N p ) In limit of zero communication time fluctuations still play a role: Maximum PE PE max = (1/ N iter ) (n av / ) ~ 1/log(N p ) Parallel efficiency (PE) of SR algorithm Optimize PE by varying cycle length T (feedback)

Parallel Efficiency of SR algorithm Fractal model Edge-diffusion model ---- PE ideal = 1/[ ln(N p ) 1.1 ]

Synchronous sublattice (SL) algorithm (Shim & Amar, 2004) At beginning of each synchronous cycle one subregion (A,B,C, or D) randomly selected. All processors update sites in selected sublattice only => eliminates conflicts between PE’s. Sublattice event in each processor selected as in usual KMC. At end of synchronous cycle processors communicate changes to neighboring processors. 2D (square) decomposition (2 send/receives per cycle) 1D (strip) decomposition (1 send/receive per cycle) Advantages: No global communication required Many events per cycle => reduced communication overhead due to latency Disadvantages: Not rigorous, PE still somewhat reduced due to fluctuations

Synchronous sublattice algorithm (Shim & Amar, 2004) 4-processors A A A A B B B B C C C C D D D D Each processor sets its time t = 0 at beginning of cycle, then carries out KMC sublattice events (time increment  t i = -ln(r)/R i ) until time of next event exceeds time interval T. Processors then communicate changes as necessary to neighboring processors. 0 T t2t2 t1t1 t3t3 2 events Maximum time interval T determined by maximum possible single-event rate in KMC simulation. For simple model of deposition on a square lattice with deposition rate F per site and monomer hopping rate D, T = 1/D Many possible events per cycle! X

Comparison with serial results (Fractal model D/F = 10 5, L = 256) 1D strip decomposition System size 256 x 256 Processor size N x x 256 N p = 4 (N x = 64) N p = 8 (N x = 32) N p = 16 (N x = 16)

N x = 64 N y = 1024 N p = by 512 portion of 1k by 1k system Reversible growth model T = 300 K, D/F = 10 5, E 1 = 0.1 eV, and E b = 0.07 eV 128

Average time per cycle for parallel simulation may be written: t av = t 1p + t com + (t 1p /n av ) where is (average) delay per cycle due to fluctuations in number of events in neighboring processors. Parallel efficiency (PE = t 1p /t av ) may be written: PE = [ 1 + (t com / t 1p ) + /n av ] -1 In limit of no communication time fluctuations still play important role: Ideal PE PE ideal = [ 1 + /n av ] -1 where /n av ~ 1/ n av 1/2 Parallel efficiency (PE) of SL algorithm P2P2 Fluctuations n2n2 P1P1 n1n1 

Results for /n av Fractal model D/F dependence (N p = 4) N p dependence (D/F = 10 5 ) /n av ~ (D/F) 1/3 /n av saturates for large N p

Edge-diffusion Model PE max Parallel efficiency as function of D/F (N p = 4) Fractal Model PE max PE max = 1/[ (D/F) 1/3 /(N x N y ) 1/2 ]

Parallel efficiency as function of N p (D/F = 10 5 )

Comparison of SR and SL algorithms Fractal model, D/F =10 5 N x = 256 N y = 1024

Summary We have studied 3 different algorithms for parallel KMC: conservative asynchronous (CA), synch. relaxation (SR), synch. sublattice (SL) CA algorithm not efficient due to rejection of bdy events SL algorithm significantly more efficient than SR algorithm SR algorithm: PE ~ 1/log(N p )  where  ≥ 1 SL algorithm: PE independent of N p ! For all algorithms, communication time, latency, fluctuations play significant role For more complex models, we expect that parallel efficiency of SR and SL algorithms will be significantly increased Global synch. Local synch.

Future work Extend SL algorithm to simulations with realistic geometry in order to carry out pKMC simulations of Cu epitaxial growth => properly include fast processes such as edge-diffusion Apply SR and SL algorithms to parallel TAD simulations of Cu/Cu(100) growth at low T (collaboration with Art Voter) => Vacancy formation and mound regularization in low temperature metal epitaxial growth Develop hybrid algorithm combining SR + SL algorithms Develop local SR algorithm Implement SL and SR algorithms on shared memory machines