Harnessing the Power of Condor for Human Genetics Bret A. Payseur Laboratory of Genetics University of Wisconsin
Our research: evolutionary genetics Analysis of DNA variation across human populations to understand: Roles of different evolutionary forces Prospects for finding genes that cause disease Analysis of crosses between mouse strains to understand: How anatomy evolves How new species arise
Our computational needs Multi-dimensional statistical inference: we measure many different (partially correlated) features of DNA variation Genome-scale analyses: we measure variation at thousands to millions of sites Replicates: we conduct population simulations to measure stochastic effects
Haplotype phasing Each human has two copies of each site on a chromosome (one from each parent) A T G C Site 1 Site 2
We want to know which variant goes with which on the chromosome Haplotype phasing We want to know which variant goes with which on the chromosome A T G C Site 1 Site 2
Haplotype phasing Genotyping technology cannot distinguish between these two possibilities in individuals that vary at both sites A T T A G C G C Configuration 1 Configuration 2
Solution: PHASE algorithm Uses Markov Chain Monte Carlo (MCMC) sampling scheme Uses coalescent simulations based on population genetic principles Identifies haplotypes for each individual with statistical uncertainty (posterior probability) State of the art method in human genetics
Scope of problem Goal: reconstruct phase in a human dataset of genomic proportions Dataset is large 720 regions of the genome 100 variable sites per region 3 populations 60 individuals per population Computational approach is intensive
720 regions x 3 populations x 8 hours = Scope of problem Average run time 8 hours 720 regions x 3 populations x 8 hours = 17,280 hours
5 Payseur lab computers: Scope of problem Running full time on 5 Payseur lab computers: 144 days!
ENTER CONDOR
Approach Create submit file for each job – automated using perl script Submit each job – automated using a perl script
CONDOR submit file universe = standard executable = PHASE error = phase.err log = phase.log should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = phase.in transfer_output_files = phase.out Requirements = ((OpSys == "LINUX") && ((Arch == "INTEL") || (Arch == "X86_64"))) Arguments = -MR -P1 phase.in phase.out queue
Running on vanilla universe Huge increase in efficiency Challenge Run times often exceeded allocated CPU time Many jobs did not finish
CONDOR solution Use condor_compile on the standard universe to allow checkpointing Expand machine pool to include X86_64/LINUX and INTEL/LINUX nodes
Result Vanilla universe Standard universe Jobs finished Required time 500 2 months 720 10 days
We have also used CONDOR to… Simulate genetic mapping of complex diseases in mice (Payseur and Place 2007; Genetics) Infer relationships among mouse strains used in biomedical research
We hope to use CONDOR for… EVERYTHING
Acknowledgments Miron Livny Zach Miller David Schwartz