Download presentation
Presentation is loading. Please wait.
Published byLynette Dorsey Modified over 9 years ago
1
Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University
2
DNA Barcoding is great! But it is useful to keep in mind that species taxa are provisional – they are hypothesis to be revised with more dataBut it is useful to keep in mind that species taxa are provisional – they are hypothesis to be revised with more data Taxa are tools, not truthTaxa are tools, not truth Mitochondrial-based DNA barcodesMitochondrial-based DNA barcodes –Can be misleading due to chance factors (different genes have different histories) –Can be misleading due to deterministic factors (mitochondria are a large target for natural selection)
3
A general problem… For example, a gene sequenced multiple times Or a microsatellite locus genotyped in a number of individuals Suppose you are willing to assume that positive or balancing selection has not played a big role in the history of the data What could you figure out about the history of the organisms from which the genes came? You have some genetic data
4
A General Parameterization for questions on population demography, population divergence, speciation, population identification etc Xgenetic data (e.g. aligned sequences, microsatellites) may (or may not) come with population labels may (or may not) be given as diploid genotypes may include multiple loci for each sampled organism Ppopulation phylogeny Tsplitting times – i.e. the times of branch points in the phylogeny P ΘDemography - population size and migration rate parameters IPopulation labels – assignment of genes to populations - which genes came from which populations or species GGenealogy – the gene tree for the data G is a necessary ‘nuisance’ parameter – it provides a mathematical connection between X and (P,T, Θ and I) It is possible to calculate the probability of G as a function of P,T, Θ and I, p(G| P,T, Θ,I), using coalescent models It is possible to calculate the probability of a data set given G, p(X|G), using mutation models.
5
Sequence1 ACgTACgACgCACgAAT Sequence2 ACgTACgACgCACgAAT Sequence3 ACCTTCgACgTACgAGT Sequence4 ACgTTCgACgTACgAAT Sequence5 ACCTTCgACgTACgAAT Sequence6 ACgTTCgACgTATgAAT 5 6 4 3 1 2 Specify a random G with topology and branch lengths for example : for example : Unlabeled Data With a mutation model, and a value of G, we can calculate the probability of G given the data: p(G|X) 1 2 3 Connecting Data to the General Model – Parts 1-3 For unlabeled data - without information on the number of populations, or on which populations were sampled
6
Connecting Data to the General Model – Parts 4&5 With a phylogeny that depicts populations in time, we can also pick random values for population sizes and migration rates – Θ = {N 1, N 2... m 1>2, m 2>1 …} 4 5 Specify a random phylogeny P with multiple populations and with splitting times T … for example: ← T2← T2← T2← T2 ← T1← T1← T1← T1 Pop 1 Pop (2,3),1 Pop (2,3) Pop 3 Pop 2 N1N1N1N1 N2N2N2N2 N3N3N3N3 N (2,3) N (2,3),1
7
Connecting Data to the General Model – Parts 6-8 add implied migration events and other random migration events to the phylogeny 6 7 Overlay the genealogy on the phylogeny 5 6 4 3 1 2 8 Identify I, the data labels representing the populations containing the data 5 6 4 3 1 2 Pop 3 Pop 2 Pop 1
8
Calculating the likelihood of P, T, Θ, and I, given the data If we can solve this then we can obtain maximum likelihood estimates of P,T, I and ΘIf we can solve this then we can obtain maximum likelihood estimates of P,T, I and Θ We know how to calculate p(X|G) and p(G|P,T,I,Θ)We know how to calculate p(X|G) and p(G|P,T,I,Θ) –The math is not the hard part The greatest challenge is finding efficient ways to sample the space of genealogies and the space of P, T, Θ, and IThe greatest challenge is finding efficient ways to sample the space of genealogies and the space of P, T, Θ, and I
9
Genetic Data and different types of data labels ACgTACgACgCACgAATACgTACgACgCACgAATACCTTCgACgTACgAGTACgTTCgACgTACgAATACCTTCgACgTACgAATACgTTCgACgTATgAAT Often Population Labels are known (come with data) Often Population Labels are known (come with data) Aligned DNA Sequences Population Labels AABBCC Population labels are already known and do not need to be estimated. Parameter I (population labels) is not included in the model.
10
ACgTACgACgCACgAATACgTACgACgCACgAATACCTTCgACgTACgAGTACgTTCgACgTACgAATACCTTCgACgTACgAATACgTTCgACgTATgAAT Case 1 Data has no labeling at all ?????? Aligned DNA Sequences Population Labels
11
ACgTACgACgCACgAATACgTACgACgCACgAATACCTTCgACgTACgAGTACgTTCgACgTACgAATACCTTCgACgTACgAATACgTTCgACgTATgAAT Case 2, no population labels, but data comes in diploid genotypes pairs Aligned DNA Sequences Population Labels |Genotype Pairs Individual #1 Individual #1 Individual #2 Individual #3 Individual #3 Gene copies are identified in genotype pairs only. Parameter I (Population labels) is unknown (?) and needs to be estimated. ??????
12
Two kinds of data sets without population labels 1.Alleles or gene copies provided without any additional information on populations - e.g. locus may be haploid - or for whatever reason, data not collected in a way that yields diploid genotypes 2. Alleles or sequences provided in diploid (genotype) pairs This is a common situation for population assignment
13
Case 1: Alleles or gene copies come without any additional information on populations The only available information on population labels (parameter I) and all other parameters (P, T, Θ) is in the actual variation in the dataThe only available information on population labels (parameter I) and all other parameters (P, T, Θ) is in the actual variation in the data This is a lot to ask of single locus data set.This is a lot to ask of single locus data set. With multiple loci, can be possible to to estimate P, T, Θ, and IWith multiple loci, can be possible to to estimate P, T, Θ, and I Can include information from a database on the same locus (loci) – i.e. DNA barcodingCan include information from a database on the same locus (loci) – i.e. DNA barcoding
14
Case 2: Data comes in diploid (genotype) pairs Such data contains two types of information for population identification:Such data contains two types of information for population identification: –Patterns of variation (as in case 1) –Knowledge that both gene copies from a single individual must come from the same population (assume no hybrids) This problem (identifying populations based on diploid genotypes) is traditionally called population assignmentThis problem (identifying populations based on diploid genotypes) is traditionally called population assignment
15
Population Assignment based on diploid genotype data Many methods exist for population assignment, using allelic data, based on an assumption of Hardy-Weinberg equilibrium within populationsMany methods exist for population assignment, using allelic data, based on an assumption of Hardy-Weinberg equilibrium within populations These methods do not otherwise incorporate phylogenetics or population genetics (no P, T, or Θ)These methods do not otherwise incorporate phylogenetics or population genetics (no P, T, or Θ) Have to overcome difficulty of not knowing the underlying allele frequenciesHave to overcome difficulty of not knowing the underlying allele frequencies
16
Considering the probability of a particular genotype configuration, D 1 ACgTACgACgCACgAAT 2 ACgTACgACgCACgAAT 3 ACCTTCgACgTACgAGT 4 ACgTTCgACgTACgAAT 5 ACCTTCgACgTACgAAT 6 ACgTTCgACgTATgAAT 6 Sequences 3 Genotype pairs The actual configuration D that comes with the data is one of many possible configurations.
17
Calculating the probability of a particular genotype configuration, D Assume that genes come together and form zygotes at random with respect to their time of common ancestryAssume that genes come together and form zygotes at random with respect to their time of common ancestry –This is a genealogical version of the assumption of random mating that is usually made with respect to segregating alleles (e.g. in Hardy Weinberg) Assume that both gene copies within an individual are in the same populationAssume that both gene copies within an individual are in the same population
18
Given a genealogy, G, Some genotype configurations are more probable than others under an assumption of random union of gametes
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.