Estimating Genealogies from Marker Data Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Biometry Group Department of Mathematics and Statistics.

Estimating Genealogies from Marker Data Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Biometry Group Department of Mathematics and Statistics University of Helsinki SCB Workshop 14.12.2006 Isaac Newton Institute, Cambridge

Outline of the presentation The Problem Description of the method Probability model Computational aspects Example 1 unlinked markers relatedness estimation (with a pedigree) Gasbarra et al.(2006): Estimating Genealogies from Unlinked Marker Data: a Bayesian Approach (under revision) Example 2 linked markers haplotyping relatedness estimation (with IBD-alleles) Gasbarra et al.(2006): Estimating Genealogies from Linked Marker Data: a Bayesian Approach (under preparation)

A Basic question in statistical genetics Consider a population evolving in time Inverse problem Current state of the process is known -individuals alive at the moment What was the path leading to this state? -family structures (pedigree) -inheritance patterns

Why is the recent past important? Relatedness estimation In which parts of the genome a group of individuals share alleles (identical-by-descent)? gene mapping Haplotyping Ancestral meioses have formed the haplotypes of the contemporary individuals

Current methods on KNOWN pedigrees Exact calculations on known pedigrees Elston-Stewart algorithm -A few markers, not too complex pedigrees Lander-Green algorithm -Small pedigrees, many markers Approximative calculations on known pedigrees McMC methods (e.g. Simwalk2 [Sobel et al.], Loki [Heath])

What if the pedigree is not known? There may be only partial pedigree data available. Small pedigrees might share common ancestors already within a couple of generations backwards in time

What we do … Consider a sample of individuals from a population Genotype data on (possibly linked) markers Model the pedigree and the gene flow explicitly, applying a construction which proceeds backwards in time Recombinations modelled based on genetic distance Non-random mating allowed Devise an McMC sampler with good mixing properties Extends, because of computational reasons, only tens of generations backwards in time

… and what we hope to get Obtain useful summary statistics E.g. estimates of IBD-probabilities between pairs of sampled individuals Use the algorithm to perform numerical intergration over model unobservables E.g. in gene mapping, when combined with a phenotype model, to account for shared ancestry

The frame of study Assume that we have fixed A population whose size we know for T-1 (non-overlapping) generations backwards in time (T~10) N sampled individuals from the current generation Marker map with M markers and known recombination fractions Allele frequencies at the population level for each of the markers

A (prior) model for a possible history A configuration C consists of a pedigree allelic paths Specify probabilities for Pedigree graph, P g (C) Recombination events, P r (C) Founder alleles, P a (C) The total probability for C is P(C) = P g (C) x P r (C) x P a (C)

A probability model for pedigrees For fixed number of generations,T-1, backwards in time population size in each generation (number of ♂ and ♀) sample of size N from the current generation mating parameters α and β To simulate a pedigree from the distribution we use Proceed generation by generation from 0,…,T-1. Let children choose parents according to Pólya urn scheme, where α affects the correlation of choices of fathers and β affects the correlation of choices of mothers given the choices of fathers. Gasbarra D, Sillanpää M, Arjas E (2005) Backward Simulation of Ancestors of Sampled Individuals. Theor Pop Biol 67:75-83.Theor Pop Biol 67:75-83.

Children choosing fathers Suppose k children have chosen their fathers from among N_m males of the population Ch(m) is the number of children that have chosen male m P(k+1 chooses father m) ~ α + Ch(m) Small α implies dominant males Large α implies that the number of offspring does not vary much between different males

Children choosing mothers Suppose k children have chosen their mothers from among N_F females of the population Ch(m,f) is the number of children who have chosen male m and female f as his/her parents P(k+1 chooses mother f | the father of k+1 is m) ~ Ch(m,f)+β Small β implies faithful males (monogamy in large populations) Large β implies random mating

Examples with different parameters Left: a few dominant males + monogamy Middle: a few dominant males Right: Random mating

Probability for allelic paths For each non-founder haplotype in the pedigree form the expression Take the product of these over all haplotypes to obtain P r (C) Consider all founder alleles and take the product of the corresponding population allelle frequencies to get P a (C) (founders are assumed to be in H-W and linkage equilibrium)

Data Assume that we also have Genotype data of the sampled individuals on M markers The (posterior) probability in our model is π(C) ~ P g (C) x P r (C) x P a (C) x I(C cons. with the data) We are able to sample efficiently from the prior but not from the posterior

Markov chain Monte Carlo sampling We generate a Markov chain whose state space consists of all configurations consistent with the data and whose stationary distribution is our posterior (Metropolis-Hastings algorithm) Highly dependent variables (close relatives and linked markers) require large block updates

Proposals Different versions of proposals A (randomly chosen) group of children chooses (possibly new) parents and transmits their alleles to these parents All children of a fixed father/mother choose (possibly new) mother/father and transmit their alleles to her/him One child at a time chooses parent(s) and transmits alleles All children within the group jointly choose new parents and transmit alleles Pedigree is not changed but new allele paths are proposed

Schematic representation of some updates in the MCMC algorithm

Example 1: Relatedness estimation with unlinked markers Simulated data 20 generations ago a single founder population divided into 3 population isolates Our sample contains 10 sibships of 3 individuals from each of the 3 populations (i.e. 90 individuals altogether)

Relatedness matrix estimated from pedigrees

Qualitative reconstruction with dendrogram

Same data analyzed by STRUCTURE 3 pop 10 pop 30 pop

Real data example: individuals sampled from Eastern and Western Finland: 31 unlinked microsatellite markers

Example 2: The case of linked markers Simulated pedigree 10 generations Youngest generation 39 individuals divided into 13 nuclear families Genotype data 20 markers / 10 alleles Recombination fraction 0.05

Reconstruction We gave the algorithm The genotype data on the youngest generation The (correct) marker map The (correct) allele frequencies The population structure The algorithm was run for 500,000 iterations

Reconstructing the pedigree

Reconstructing the haplotypes The accuracy of the haplotype reconstruction can be measured with the concept of switch distance (SD) SD between two pairs of haplotypes is the number of phase relations between neighboring loci that need to be changed in order to turn the first pair of haplotypes to the other If correct haplotypes were (111111,222222) then (111222,222111) has SD=1 (112211,221122) has SD=2 (121212,212121) has SD=5

Reconstructing the haplotypes The SDs between the reconstructed and the true haplotype pairs of the youngest generation (sum over all 39 individuals)

Reconstructing the IBD sharing We consider those alleles IBD (identical by descent) that trace back to a common ancestral allele at the founder level (9 generations backwards in time) It is possible to calculate a single quantity that measures the proportion of the genome that two individuals share (coefficient of relatedness r) It is also possible to compare the IBD sharing more accurately along the chromosome

Comparison with IBS-based estimators Distribution of L_2 errors (741values) 1.933.253.273.51 Sums: Lynch (1988) Lynch et Ritland (1999) Wang (2002)

Reconstructing IBD

Future work Possibility of fixing some parts of the pedigree Extending partially known genotype data to the known pedigree Pirinen, Gasbarra (2006): Finding consistent gene transmission patterns on large and complex pedigrees. IEEE Trans. Comp. Biol. Bioinf. 3:252-262

Future work Adding a QTL or phenotype model to the algorithm Allowing for mutations and considering evolutionary time scales (Ancestral Recombination Graph) Running many chains in parallel ”in different temperatures” McMcMC with 20 processors achieved a slightly better accuracy in 12 hours (of wall-clock time) than a single processor in 5 days

Thanks Dario Gasbarra Mikko Sillanpää Matti Pirinen

Estimating Genealogies from Marker Data Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Biometry Group Department of Mathematics and Statistics.

Similar presentations

Presentation on theme: "Estimating Genealogies from Marker Data Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Biometry Group Department of Mathematics and Statistics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Estimating Genealogies from Marker Data Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Biometry Group Department of Mathematics and Statistics.

Similar presentations

Presentation on theme: "Estimating Genealogies from Marker Data Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Biometry Group Department of Mathematics and Statistics."— Presentation transcript:

Similar presentations

About project

Feedback