David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf
Embarrassingly Parallel Problems GWAS, with huge numbers of SNPs Sequence analysis, assembly, and mapping Testing and validating statistical models Protein folding and threading Molecular docking and compound screening Tomographic reconstruction
Tsai et al., Mol. Biochem. Parasitology, online preprint 2008 Protein folding calculations with Rosetta++ 100,000 cpu hours Characterization of Surface Protein 3 from Malaria Parasite P. Falciparum
How to run multiple independent processes in parallel 16 independent processes input command outputinputoutput command
Biowulf Cluster Batch System batch job1 job1.out script batch job16 job16.out script
Node 1Node 2Node 3Node 4 job1job2job3job4 job1.outjob2.outjob3.outjob4.out biowulf% swarm -f file Swarm
Node 1 job1 job1.out biowulf% swarm -f file -b 4 Bundled Swarm
Swarm Facts Written and maintained by Helix Systems Staff swarm introduced in late % of all batch jobs run on the cluster since 2002 are swarm jobs ~60% of all wall time spent on swarm jobs swarm has been shared with clusters around the world
Swarm World Records Largest swarm: 683,445 commands Largest bundle: 24,000 commands per CPU
Future Challenges How to deal with larger multicore nodes? Node 1 Node 2Node 3