Phylogenetics and Coalescence. Goals Construct phylogenetic trees using the UPGMA method Use nucleotide sequences to construct phylogenetic trees using.

Phylogenetics and Coalescence

Goals Construct phylogenetic trees using the UPGMA method Use nucleotide sequences to construct phylogenetic trees using UPGMA, NJ, and Maximum Parsimony methods Use coalescent simulation to determine historical change in N e Interpret coalescent trees to draw inferences about human migrations

Kamilah the gorilla A female western lowland gorilla (Gorilla gorilla gorilla) Scally et al. 2012 Nature HC G H C G H C G

Phylogenetic Methods Scope of the problem – Number of possible unrooted trees for n OTUs: – For 10 taxa -> 2,027,025 possible unrooted trees. – Need an optimality criterion

Phylogenetic methods A.Distance methods. 1. Unweighted Pair Group Methods using Arithmetic averages(UPGMA). 2. Neighbor Joining (NJ). 3. Minimum evolution(ME). B. Character based methods. 1. Maximum Parsimony (MP). 2. Maximum Likelihood (ML). 3. Bayesian Method (BA)

UPGMA Taxa1234567 HumanTGCGTAT ChimpanzeeTGGGTAT GorillaTGCGCTT OrangutanTGCTGTG GibbonTAGTAGC Step 1: Generate data (Sequence/ Genotype/ Morphological) for each OTU.

Distance can be calculated by using different substitution models : 1. # of nucleotide differences. 2. p-distance. 3. JC distance 4. K2P distance. 5. F81 6. HKY85 7.GTR etc Step 2: Calculate p- distance for all pairs of taxa. Taxa123*4567 HumanTGCGTAT ChimpanzeeTGGGTAT = 0.142857143

Step 3: Calculate distance matrix for all pairs of taxa and select pair of taxa with minimum distance as new OTU. TaxaHuChGoOrGi Hu0 Ch0.14280 Go0.28570.42850 Or0.57140.71420.42850 Gi0.85710.71420.85710.71420 Human Chimpanzee 0.714

Step 4: Recalculate new distance matrix, assuming human and chimpanzee as one OTU. taxaHu+chGoOrGi Hu+Ch Go Or Gi = 0.3571 taxaHu+chGoOrGi Hu+Ch0 Go0.357140 Or0.642850.42850 Gi0.785710.85710.71420 TaxaHuChGoOrGi Hu0 Ch0.14280 Go0.28570.42850 Or0.57140.71420.42850 Gi0.85710.71420.85710.71420

Step 5: Select pair of taxa with minimum distance as new OTU. Human Chimpanzee 0.071 Gorilla 0.179 0.107

Step 6: Again select pair of OTU with minimum distance as new OTU and recalculate distance matrix. taxa(Hu+ch)GoOrGi (Hu+ch)Go Or Gi = 0.5714 taxa(Hu+ch)GoOrGi (Hu+ch)Go0 Or0.57140 Gi0.80950.71420 TaxaHuChGoOrGi Hu0 Ch0.14280 Go0.28570.42850 Or0.57140.71420.42850 Gi0.85710.71420.85710.71420

Step 7: Again select pair of taxa with minimum distance as new OTU. Chimpanzee Human 0.071 Gorilla 0.179 0.107 Orangutan 0.286 0.107

Step 8: Again select pair of OTU with minimum distance as new OTU and recalculate distance matrix. taxa((Hu+ch)Go)OrGi ((Hu+ch)Go)Or Gi = 0.7857 taxa((Hu+ch)Go)OrGi ((Hu+ch)Go)Or0 Gi0.78570 TaxaHuChGoOrGi Hu0 Ch0.14280 Go0.28570.42850 Or0.57140.71420.42850 Gi0.85710.71420.85710.71420

Step 9: Again select pair of OTU with minimum distance as new OTU and make final rooted tree. Chimpanzee Human 0.071 Gorilla 0.179 0.107 Orangutan 0.286 0.107 Gibbon 0.393 0.107

Branch Supports 1.Bootstrap support. 2.Jack-knife support. 3.Bremer support. 4.Posterior probability support.

Bootstrap support Step 1: Randomly make “n” pseudo-replicates of the data with replacement and make tree from each replicate. Taxa2234677 HumanGGCGATT ChimpanzeeGGGGATT GorillaGGCGTTT Taxa1356724 HumanTCTATGG ChimpanzeeTGTATGG GorillaTCCTTGG Taxa1234567 HumanTGCGTAT ChimpanzeeTGGGTAT GorillaTGCGCTT

Bootstrap support Step 2: Make consensus tree of trees obtained from all pseudo replicates.

Phylogenetic Software available 1.PAUP. 2.Phyllip. 3.MrBayes. 4.Mega.

Problem 1. File mt_primates.meg contains the sequence data used to calculate the genetic distances in Example 1. Use Mega to build phylogenetic trees based on: (i)UPGMA. (ii)The NJ Method. (iii)Maximum Parsimony. a)Compute bootstrap confidence in the internal nodes of each tree. b)Compare the trees derived using each of these methods. Which do you think is the most informative? Does the computational efficiency of the UPGMA method result in misleading results in this case? Problem 2. File pdha1_human.meg contains haplotypes detected by sequencing a 4.2- kb region of the X-linked Pyruvate Dehydrogenease E1 α Subunit (PDHA1) in 16 African and 19 non-African males. Use Mega to build a phylogenetic tree based on the NJ Method and interpret the results in the light of hypotheses about the origin of modern humans.

UPGMA (Sokal and Sneath 1963) molecular clock assumption The clock is often violated when the sequences are divergent!! Molecular clock Yes No UPGMA NJ,ME Distance methods do not perform well when the sequences are highly divergent or contain many alignment gaps, mainly because of difficulties in obtaining reliable distance estimates From Computational Molecular Evolution -- Ziheng Yang. 2006. Oxford University Press

MP Minimum number of changes, minimized over ancestral states ML Log likelihood score, optimized over branch lengths and model parameters ME Tree length (sum of branch lengths, often estimated by least squares) Bayesian Posterior probability, calculated by integrating over branch lengths and substitution parameters The level of sequence divergence has a great impact on the performance of tree reconstruction methods. Highly similar sequences lack information, so that no method can recover the true tree with any confidence. Highly divergent sequences contain too much noise, as substitutions may have saturated. From Computational Molecular Evolution -- Ziheng Yang. 2006. Oxford University Press

Coalescence Wright-Fisher Model Until now we have implicitly used the Wright- Fisher Model Computationally expensive

Wright Fisher

The Discrete Coalescent Probability that two genes have MRCA j generations ago Probaility that 2 genes out of k have a common ancestor j generations ago Probability of no coalescence for j – 1 generations Probability of coalescence in the jth generation Probability of no coalescence in k lineages for j – 1 generations Probability of coalescence in the jth generation

The Continuous Coalescent Can derive continuous exponential function from discrete geometric representation Waiting time (T) for k genes to have k-1 ancestors (See math box 3.2 in Hamilton, 2009)

Coalescent Applications Coalescent topologies can be dependent upon convolution of N e and μ, migration rate, selection, recombination rate. Applications – Estimating recombination rates – Estimating historical migration rates between poulations – Estimating tMRCA – Estimating historical effective population size – Estimating strength of selection

Bayesian inference H, hypothesis, probability of H may be affected by data E, evidence, corresponds to new data that were not used in computing the prior probability. P(H), prior probability, is the probability of H before E is observed P(H|E) posterior probability P(E|H) likelihood Posterior is proportional to likelihood times prior.

MCMC and Bayesian Inference In Bayesian inference, answering what is the value of P(E) could be very complex in some cases. This is obvious especially when parameters influencing your P(E) are unknown. We usually use P(E|θ) in place of P(E) to indicate the influence of θ Remember that we know the proportion of two types of coins in Nathan’s story from lab8 Now imagine what if we have no clue on what the proportion is?

There’s toy factory which produces two types of coins. One type is normal, like any other coins in the world, and the other one is “magic” because it produces a “heads” result on 90% of flips. However, there’s no difference between the coins in appearance or weight. One night, Nathan decided to steal some coins from the factory, but he mixed up the two types. Now…

Now H1, it’s a normal coin H2, it’s a magic coin The victim (owner of the factory) also told you in your investigation that “To maximize profit, 90% of the products in our pipeline are magic coins, and only 10% are normal coins” Nathan tossed each coin ten times to decide its type Can you use the prior information to decide which coin is most likely to yield 70% Heads?

We know P(E|H1,θ), P(E|H2,θ) Use θ to represent the proportion, then θϵ[0,1] Simple solution is to estimate under which θ value, we will have the maximum probability of P(E|θ) θP(H1|θ)P(H2|θ)P(E|H1,θ)P(E|H2,θ)P(E|θ)P(H1|E,θ) 0010.1170.057 0.00 0.1 0.90.1170.0570.0630.19 0.2 0.80.1170.0570.0690.34 0.3 0.70.1170.0570.0750.47 0.4 0.60.1170.0570.0810.58 0.5 0.1170.0570.0870.67 0.6 0.40.1170.0570.0930.75 0.7 0.30.1170.0570.0990.83 0.8 0.20.1170.0570.1050.89 0.9 0.10.1170.0570.1110.95 1100.1170.0570.1171.00 Posterior Prior

From Data to Coalescence Suppose we observe n genes with k mutations We want to get θ=4N e μ but do not know its true value Can calculate likelihood of θ for a bunch of possible values and find the one with highest probability

MCMC 1.Sample a new history from a distribution of histories (topologies + waiting times) 2.Divide the likelihood of this new history by the likelihood of the last history sampled 3.With probability proportional to this likelihood ratio, move to the new point. 4.Repeat steps 1-4.

MCMC Issues Beginning of chain not in highest density region – Burn-in (~1/10 the chain length) Samples close to one another are correlated – Effective sample size (>200) Good Not so good sas.com

Problem 3. Fossil and molecular based evidence have both provided strong evidence for the divergence of the human and chimpanzee (Pan troglodytes) lineages approximately 6 million years ago. However, timings and locations of human expansions beyond Africa have proved controversial. Use the Bayesian MCMC software BEAST to derive coalescent trees for sequences from the X-linked Pyruvate Dehydrogenase E1-alpha Subunit gene that you also analyzed in Problem 2. a)Have you effectively sampled parameter space for all estimates? How do you know? What might cause insufficient sampling? b)What are times to the most recent common ancestor of all Europeans? Africans? Human beings? The human/Pan split? Do these seem reasonable? What is your interpretation of these results? Use the available literature on human evolution to support your claims. c)How has the effective population size of humans changed over time? What might this indicate? d)How does the best-fit coalescent tree derived by TreeAnnotator compare to the Neighbor Joining tree you generated in Problem 2? What might account for the differences? e)GRADUATE STUDENTS ONLY: Does this tree support the hypothesis for a single African origin of Eurasian populations? (i)If so, which African lineage is most closely related to the Eurasian lineages? (ii)Are the African and Eurasian lineages monophyletic? (iii)How do you interpret this result? What are the limitations of your inferences?

Phylogenetics and Coalescence. Goals Construct phylogenetic trees using the UPGMA method Use nucleotide sequences to construct phylogenetic trees using.

Similar presentations

Presentation on theme: "Phylogenetics and Coalescence. Goals Construct phylogenetic trees using the UPGMA method Use nucleotide sequences to construct phylogenetic trees using."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Phylogenetics and Coalescence. Goals Construct phylogenetic trees using the UPGMA method Use nucleotide sequences to construct phylogenetic trees using.

Similar presentations

Presentation on theme: "Phylogenetics and Coalescence. Goals Construct phylogenetic trees using the UPGMA method Use nucleotide sequences to construct phylogenetic trees using."— Presentation transcript:

Similar presentations

About project

Feedback