Atomistic Protein Folding Simulations on the Submillisecond Timescale Using Worldwide Distributed Computing Qing Lu CMSC 838 Presentation
CMSC 838T – Presentation Overview u Overview of talk Motivation Challenge Methods l Ensemble Dynamics l Evaluation Observations
CMSC 838T – Presentation Motivation u Atomistic simulation of protein folding understand dynamics of folding real-time folding in full atomic detail large-scale parallelization methods u Benefits protein folding & disease l protein self-assemble to function l proteins misfold diseases nanotechnology l nanomachines l self-assemble on the nanoscale
CMSC 838T – Presentation Challenge u Difficulties limited by current computational techniques l fastest folding in microseconds l one CPU: 1ns/day, 30 years l 10,000 fold computational gap u 1,000 CPUs, 1 microsecond / day traditional parallelization scheme l hard to scale to a large amount of processors l extremely fast communication l complexity of coordination l expensive supercomputers u cost u time-sharing
CMSC 838T – Presentation Method u ensemble dynamics a new simulation algorithm parallel simulation u heterogeneous network, Internet large-scale distributed platform
CMSC 838T – Presentation Simulation of Dynamics u free energy barrier progress from one state to another: transition thermal fluctuations to push system over free energy barrier u previous approaches: sampling maybe stuck in meta-stable free energy minima expensive computational cost of sampling
CMSC 838T – Presentation Ensemble Dynamics u application scenario waiting time of transitions dominates total time protein folding l transition: free energy barrier crossing coupled simulations: transition coupling u Algorithm M independent simulations from a initial condition first simulation to cross free energy barrier l M times less to cross barrier than average time restart M simulations with the new location after transition u Near linear speed up in #processors exponential kinetics: f(t) = 1 – exp(-k t) If k * t is small, f(t) = k * t M simulations M * f(t) = M * k * t folding events
CMSC 838T – Presentation Limitations u barrier crossing probability exponential assumptions u correct transition detection transition: free energy barrier crossing a large variance in energy: threshold correct detection is not guaranteed u multiple possible transition not addressed selection of the first transition
CMSC 838T – Presentation Distributed Computing u Distributed simulations M processors for each run simulate folding in atomic detail on each processor restart once a crossing barrier event occurs u Implementation: worldwide distributed computing: Internet started in October 2000 l more than 200,000 participants l 10,000 CPU-years in the first 12 months
CMSC 838T – Presentation
CMSC 838T – Presentation u client-server architecture server assign jobs(work unit) to client client sends back results after computation ~100K data transfer between client and server u why is ensemble dynamics good for CPU intensive job: a few hours, often days connection speed: modem, good enough suitable for
CMSC 838T – Presentation Work u search for intelligent life outside Earth data analysis of signals u find drug therapy for HIV how drugs interact with various HIV virus mutations u distributed projects Divide-and-Conquer CPU intensive jobs small pieces of data(kilobytes) transfer communication not a major concern
CMSC 838T – Presentation Evaluation u based on Tinker molecular dynamics code voluntary participants worldwide, over 400,000 CPUs u simulate folding and unfolding folding rates simulations on small proteins
CMSC 838T – Presentation Folding Rates
CMSC 838T – Presentation Folding & Unfolding
CMSC 838T – Presentation Observations u Sampling too expensive to run for a long timescales waste too much time lingering in local energy minima u Ensemble dynamics speed up simulations of dynamics biological meaning of simulations results? results on large protein folding? limitations: correct transition detection, transition probability u cheap way to achieve super computation power huge distributed computing platform: over 400,000 CPUs an efficient approach for CPU intensive job u Complexity of problems and size of data increase rapidly find better algorithm is preferable to buying supercomputers