Cluster Based Protein Folding Douglas Fuller and Brandon McKethan
Overview: What is Screensaver-based distributed computing program run by Stanford Utilizes unused processing power to fold proteins through a finite number of frames The same work units are run on numerous computers to confirm accuracy
What Does Accomplish? Helps in showing the process of linear amino acid chain to 3D protein structure Is used on proteins involved with many diseases in order to elucidate how misfolding occurs Will eventually lead to mutation-to- phenotype simulations
Computational Aspects Monte Carlo simulation using lowest energy state calculations Non-parallel, unimolecular program Heuristic approaches
Implications of Heuristics/Unimolecular The environment of the cell and molecular interactions Solvents and extramolecular interactions cannot be ignored in the process of folding Many diseases arise from misfolding that is not influenced by the internal energy state
Diseases of Protein Misfolding May Require Multimolecular Interactions Cancers – B-RAF, Hsp-90 and 17-AAG Prions – Infectious Proteins BSE (Mad Cow Disease) Kuru Sheep Scrapie
Computational Weight of Multimolecular Interactions The number of energy states and inter/intra-molecular interactions are much higher than unimolecular Pushes the computational return time above appreciable limits for the project Desktop computing and the Lowest Common Denominator
Cluster Based Folding and Future Aspects of Folding Use cluster computing as the testing ground for truly parallel simulations Individual proteins are discrete units Allows the program to be refined while highly parallel desktop computing comes to fruition 5-10 year timeframe
Post-Multithreading Possibilities Next truly discrete unit is the atom itself Atom-per-processor modeling vs. Monte Carlo Requires incredibly high number of processors – 100’s of thousands Once again clusters provide testing ground
Parallelizing a computation Considered “re-bugging” your code Distribute work to multiple processors Requires communication to deal with dependencies Requires computation to distribute work and recombine results Now what?
Domain Decomposition Decide how to divide work Spatially Temporally Other? Introduces overhead Can pessimize instead of optimize
Cheat! “Embarassingly parallel” code Splits naturally into small pieces Small pieces can ignore each other Small pieces can be computed by a single node Problem: fold all proteins they care about Decomposition: individual proteins Dependencies: none!
Domain Decomposition: Challenges Analyze dependencies Communication patterns Communication volume Data distribution Overlap computation/communication Consider system characteristics Communication latency/bandwidth Computational efficiency Computation/communication ratio Do this all ahead of time?
Domain Decomposition: Pitfalls Parallel overhead Computation waiting on communication Feed-forward dependencies Dynamic decomposition schemes Pick two: performance, portability, scalability