Parallel Genehunter: Implementation of a linkage analysis package for distributed memory architectures Michael Moran CMSC 838T Presentation May 9, 2003
CMSC 838T – Presentation Introduction u Goals Link Genes to specific loci in the genome Decrease time and memory requirements through parallelization u Motivation Locate genes for specific phenotypes Test for inherited diseases and risk factors Gene therapy
CMSC 838T – Presentation Talk Overview u Introduction u Talk Overview u Genetic Linkage Problem u Previous Work u Parallel Genehunter u Evaluation u Observations
CMSC 838T – Presentation Genetic Linkage Problem u Sexual Reproduction Offspring created by two haploid gametes Gametes are produced from diploid/polyploid cells during meiosis
CMSC 838T – Presentation Genetic Linkage Problem u Recombination occurs in two ways 1. Random segregation of chromatids 2 x 23 human chromosomes => 2 23 possible haploid combinations Genes on different chromosomes recombine with probability
CMSC 838T – Presentation Genetic Linkage Problem u Recombination occurs in two ways 1. Random segregation of chromatids 2. Crossover between homologous pairs of chromosomes Genes on the same chromosome recombine with probability depending on their distance and location on the chromosome
CMSC 838T – Presentation Genetic Linkage Problem Given This model of recombination Data for a particular pedigree (family) l Phenotype information for each individual l Genetic markers for each individual Recombination frequencies for each pair of markers Can we apply probabilistic methods to Reconstruct the inheritance patterns Link phenotypes to the markers
CMSC 838T – Presentation Previous Work u Fisher, Haldane, Smith, Morton ( ) Methods to infer genetic maps using maximum likelihood estimators u Elston, Stewart (1971) Genetic Linkage Algorithm l Linear in pedigree size l Exponential in number of markers u Lander, Green (1987) Genetic Linkage Algorithm l Linear in number of markers l Exponential in pedigree size
CMSC 838T – Presentation Previous Work u Genehunter (2001) Implementation of Lander & Green Analyzes a pedigree containing n non-founders The inheritance of a gene by one non-founder can be summarized by two bits The entire pedigree’s inheritance pattern can be summarized by a 2n bits
CMSC 838T – Presentation Previous Work u 3 steps of Genehunter: Step 1 : For each marker, calculate the probability of each of the possible inheritance pattern. Store probabilities in a vector of size 2 2n 0: grandfather’s chromatid 1: grandmother’s chromatid Pr([0,0]) =.5 Pr([0,1]) =.5 Pr([1,0]) = 0 Pr([1,1]) = 0
CMSC 838T – Presentation Previous Work u 3 steps of Genehunter: Step 2 : For each marker, calculate the conditional probably of each inheritance pattern conditional on all of the markers to the left, and to the right For two markers’ inheritance vectors, each disagreeing bit requires a crossover event The probability of transitioning between inheritance vectors i, j differing in d bits is
CMSC 838T – Presentation Previous Work u 3 steps of Genehunter: Step 2 : For each marker, calculate the conditional probably of each inheritance pattern conditional on all of the markers to the left, and to the right M i,j = cost of transitioning between inheritance vectors i&j P 1, P 2 = probability vectors for every inheritance pattern given markers 1 and 2 respectively P 2|1 = P 2 (M P 1 ) Calculate the probabilities of each marker’s inheritance conditional on all others by Markov Chain or FFT convolution
CMSC 838T – Presentation Previous Work u 3 steps of Genehunter: Step 3 : For each marker, calculate the probability of unknown gene being located at specific locations Hypothesizes phenotype has a gene located at a particular location. By default tries 5 evenly-spaced locations between consecutive pairs of markers Calculates P D, the probabilities of each inheritance pattern for based on this phenotype (as in step 1) For a location between markers i&i+1, p= P D P x|1...i P x|i+1...m u Space Requirement: O(2 2n ) O(2 2n-f ) exploiting symmetry of f founders u Time Requirement: O(m2 2n ) O(m2 2n-f ) with f founders
CMSC 838T – Presentation Parallel Genehunter u Approach Parallelize the 3 Genehunter steps separately Divides each 2 2n -sized marker vector evenly among the P processors l allows greater distribution of memory than assigning O(m/P) entire vectors to each processor
CMSC 838T – Presentation Parallel Genehunter u Parallelization of step 1 For each marker, calculate the probability of each of the possible inheritance pattern Each processor calculates the probabilities for a particular 2 2n / P inheritance patterns for ever marker
CMSC 838T – Presentation Parallel Genehunter u Parallelization of step 2 For each marker, calculate the conditional probably of each inheritance pattern conditional on all of the markers to the left, and to the right FFT convolution l As in serial genehunter, 2 2n x 2 2n matrix-vector multiplication is replaced FFT-based convolution: 1. 2 forward 1D FFTs on 2 2n -length vectors 2. element-by-element multiplication 3. inverse FFT l Each 1D FFT is equivalent to a 2D FFT on a P x 2 2n / P matrix l There are well-known distributed algorithms for this FFT using all-to-all communication. Dot Product in P 2|1 = P 2 (M P 1 ) l trivially parallelized: each processor has the same portion of each vector.
CMSC 838T – Presentation Parallel Genehunter u Parallelization of step 3 For each marker, calculate the probability of unknown gene being located at specific locations computing P x|1...i and P x|i+1...m l FFTs parallelized as in step 2 Final dot product p = (P D P x|1...i P x|i+1...m ) l parallelized as in step 2 u each processor holds all the same portion of each vector
CMSC 838T – Presentation Evaluation u Experimental Environment Input data sets l 51 family member pedigree l {19,21,24}-bit data sets (# bits = 2n-f ) Computing Facilities l Cplant Cluster (Sandia National Laboratories) u DEC Alpha EV6 processors u Myrinet connection
CMSC 838T – Presentation Evaluation u Runtimes For 19,21 and 24 bit problems
CMSC 838T – Presentation Evaluation u Runtimes For 19,21 and 24 bit problems
CMSC 838T – Presentation Observations Pro: Performs Genehunter computation exactly Pro: Effective for “multipoint linkage” of phenotypes Con: Old-fashioned compared to protein-based methods (?) Pro: Distributes memory requirements Pro: More computers allows larger feasible inputs Con: Experiments based on 1 pedigree Pro: Efficient parallelization up to 32 or 64 processors Con: Only allows pedigrees to grow by only 3 or 4 individuals in equal time
CMSC 838T – Presentation References u Genetic Recombination Dr. Craig Woodworth, Genetic Recombination in Eukaryotes, Lecture Notes, ( u Genehunter K. Markianos, M.J. Daly, & L. Kruglyak. Efficient Multipoint Linkage Analysis Through Reduction of Inheritance Space. American Journal of Human Genetics 68, u Parallel Genehunter G. Conant, S. Plimpton, W. Old, A. Wagner, P. Fain, & G. Heffelfinger. Parallel Genehunter: Implementation of a Linkage Analysis Package for Distributed- Memory Architectures, Proceedings of the First IEEE Workshop on High Performance Computational Biology, International Parallel and Distributed Computing Symposium, 2002.
CMSC 838T – Presentation Questions?