Trees & Topologies Chapter 3, Part 2

Slides:



Advertisements
Similar presentations
Direct-Current Circuits
Advertisements

Sampling distributions of alleles under models of neutral evolution.
Coalescence with Mutations Towards incorporating greater realism Last time we discussed 2 idealized models – Infinite Alleles, Infinite Sites A realistic.
Algebra Problems… Solutions Algebra Problems… Solutions © 2007 Herbert I. Gross Set 4 By Herb I. Gross and Richard A. Medeiros next.
Phylogenetic reconstruction
Forward Genealogical Simulations Assumptions:1) Fixed population size 2) Fixed mating time Step #1:The mating process: For a fixed population size N, there.
FUNCTIONS AND MODELS Chapter 1. Preparation for calculus :  The basic ideas concerning functions  Their graphs  Ways of transforming and combining.
Evaluating Hypotheses
Adding and Subtracting Integers.
6.1 Solving Linear Inequalities in One Variable
Integrals 5.
Adding Vectors, Rules When two vectors are added, the sum is independent of the order of the addition. This is the Commutative Law of Addition.
Copyright © Cengage Learning. All rights reserved. CHAPTER 9 COUNTING AND PROBABILITY.
Trees & Topologies Chapter 3, Part 1. Terminology Equivalence Classes – specific separation of a set of genes into disjoint sets covering the whole set.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Chapter 8 Integration Techniques. 8.1 Integration by Parts.
Population genetics. coalesce 1.To grow together; fuse. 2.To come together so as to form one whole; unite: The rebel units coalesced into one army to.
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
Trees & Topologies Chapter 3, Part 2. A simple lineage Consider a given gene of sample size n. How long does it take before this gene coalesces with another.
Chapter 4 Some basic Probability Concepts 1-1. Learning Objectives  To learn the concept of the sample space associated with a random experiment.  To.
1 Chapter 8 The Discrete Fourier Transform (cont.)
© 2012 Pearson Prentice Hall. All rights reserved. CHAPTER 3 Number Theory and the Real Number System.
Calculus, Section 1.4.
Trigonometric Identities
Statistical Intervals Based on a Single Sample
Linear Algebra Review.
Chapter 12 Analysis of count data.
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Copyright © Cengage Learning. All rights reserved.
Polymorphism Polymorphism: when two or more alleles at a locus exist in a population at the same time. Nucleotide diversity: P = xixjpij considers.
Copyright © Cengage Learning. All rights reserved.
Number Theory and the Real Number System
Montgomery Slatkin  The American Journal of Human Genetics 
COALESCENCE AND GENE GENEALOGIES
Trigonometric Identities
Distinct Distances in the Plane
Copyright © Cengage Learning. All rights reserved.
Tests for Gene Clustering
Introduction to Summary Statistics
Chapter 4 Linear Programming: The Simplex Method
Analysis of count data 1.
Introduction to Summary Statistics
Number Theory and the Real Number System
Place Value, Names for Numbers, and Reading Tables
Inferential Statistics
6.1 Solving Linear Inequalities in One Variable
Copyright © Cengage Learning. All rights reserved.
The sum of a geometric sequence
1 FUNCTIONS AND MODELS.
Copyright © Cengage Learning. All rights reserved.
The coalescent with recombination (Chapter 5, Part 1)
Graphs, Linear Equations, and Functions
Copyright © Cengage Learning. All rights reserved.
Montgomery Slatkin  The American Journal of Human Genetics 
Adding and Subtracting Integers.
Trees & Topologies Chapter 3, Part 2
Solving Linear Equations
Copyright © Cengage Learning. All rights reserved.
Chapter 6 The Definite Integral
The Normal Curve Section 7.1 & 7.2.
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
The Selection Problem.
Divide-and-Conquer 7 2  9 4   2   4   7
Top 10 maths topics that GCSE students struggle with
Independence and Counting
Lactase Haplotype Diversity in the Old World
Messages through Bottlenecks: On the Combined Use of Slow and Fast Evolving Polymorphic Markers on the Human Y Chromosome  Peter de Knijff  The American.
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Presentation transcript:

Trees & Topologies Chapter 3, Part 2

A simple lineage Consider a given gene of sample size n. How long does it take before this gene coalesces with another gene in the sample? Wn – denotes the time until the gene merges with another gene, E(w2) = E(t2) = 1 For a given n, the gene is involved in the first coalescent event with probability 2/n. If this is so, then Wn = Tn, which provides the first term on the right side of the equality sign. With probability 1 – 2/n the gene is not involved in the coalescent event which implies that Wn = Tn + Wn-1; Tn plus the time Wn-1 until the gene coalesces in a sample of size n-1. Now that E(Tn) = 2 / (n(n-1)), the recursion simplifies and has the solution of E(Wn) = 2/n

Single Lineage How many events pass before it coalesces with another gene? E(Tn) is replaced by 1, and Wn by Cn. Initial condition is E(C2) = 1 because there is only 1 coalescent even in the history of the sample of size 2.

Disjoint subsamples Consider a sample of size n that is divided into two disjoint subsamples, A and B of sizes k and n-k, respectively. Last time we talked about nested subsamples, where the samples in B were also in A. This is the opposite situation where none of the samples in B are in A Illustration of division of a sample into disjoint subsamples A and B, where all genes in subsample A coalesce before any of them coalesce with a gene in subsample B. T(A-1) denotes the time of the MRCA of subsample A, T(A-0) denotes the time of absorption of the last lineage of A into the genealogy of B.

Disjoint Subsamples (cont’d) The probability that all genes in A find a MRCA coalescing with any gene in B is: The probability that one of the two samples finds a MRCA before coalescing with members of the other sample is: qn-k,k is the probability that all genes in B find a MRCA coalescing with any gene in A, so the roles of A and B are reversed The term in the red box is the probability that all genes in A and B find MRCAs before the two subsamples find a MRCA. In other words, it is the probability that k and n-k labelled genes subtend from the deepest split in the tree. This term is subtracted because it is included in both other terms. (illustrated in next slide).

Disjoint Subsamples (cont’d) probability that all genes in A and B find MRCAs before the two subsamples find a MRCA. Illustration of division of the sample into disjoint subsamples A and B, such that all genes in A and B find MRCAs before the 2 subsamples find a MRCA. The probability of this is given in the last term of the equation from the previous slide.

Jump Process of Disjoint Subsamples Jump processes: (i,j) -> (i-1, j) with probability (i+1)/(i+j) (i,j) -> (i,j-1) with probability (j-1)/(i+j) Process starts in (k, n-k) and continues until (1,j) for some j. Eventually jumps to (0,j) for some j and finally reaches (0,1), where 0 denotes that sample A has been fully absorbed into B. We talked about the jump process last time. This describes which pair of genes coalesce at each coalescence event. Here 3 quantities are of interest. To simulate the genealogy of A and B, one might first simulate the jump process and the subsequently simulate the time between events. Ta-1 = time of the MRCA of A Ta-0 = time until the last lineage of A is absorbed into the genealogy of B ANCB = the number of ancestors of B at the time of the MRCA of A These quantities are key numbers in the description of the genealogy of the whole sample. One can give full probabilistic treatments of these quantities deriving the densities of all three. However, we’re just going to look at their mean values. If both k and n-k are large, E(ANCb) ~ 3/x – 2, where x = k/n is the frequency of A in the sample

Disjoint Subsamples Example Sequence 4,200 bp of the PHDA1 gene in 35 male individuals from different human populations. 16 individuals were from African populations and 19 were from non-African populations. The gene tree constructed after removing 1 incompatible site and rooted using a chimpanzee sequence is shown above. Note that the MRCA is estimated to be almost 2 million years old. Haplotypes A and B were found in non-African population only, the remaining haplotypes in African populations only. The tree was rooted with a chimpanzee sequence. What is the probability that a subsample A of size k = 19 coalesces before coalescing with an ancestor of the n – k = 16 remaining genes. Look back at the equation in slide 5 (qk,n-k) and we get 4.5 x 10-11 Gene tree of the PHDA1 gene from a sample of Africans and non-Africans.

A sample partitioned by a mutation Now, consider a sample of size n where a polymorphism divides the sample into two disjoint subsamples, A and B, of size k and n-k, respectively. Assume that just 1 mutation separates them and that the scaled mutation rate is very low. This is often believed to be the case for many SNPs. Here the mutation rate for a single locus might be as low as u = 10-8 per generation. If we put N=104 and then theta = 4Nu = 4x10-4, which is almost zero. Let us for now assume that we know which allele is the oldest, say sample B carries the oldest allele. In order for the whole sample to find a MRCA sample A must find a MRCA before coalescing with any sequence in B and further, a mutation must have occurred on the branch connecting the genealogy of A with that of B. (figure on slide 4) In this case, TA-0 is now TMUT, the age of the mutation. These expressions are fairly unintuitive and difficult to handle, but become more manageable if k and n-k are large. Figure on the next slide compares the mean values conditioned on a mutation to the values obtained without conditioning on a mutation causing the split between A and B.

Comparing the mean values Compares the mean values conditioned on the mutation to the value obtained without conditioning on a mutation causing the split between A and B. The dotted curve is the ratio of the E(Tmut|Mutataion) to E(Ta-o). The solid curve is the E(ANCb|Mutation) to E(ANCb). Since Tmut and Ta-o do not exactly measure the same time, the dashed line falls below 1 for large frequencies. Jump processes: (i,j) -> (i-1,j) with probability i/(i+j-1) (i,j) -> (i, j-1) with probability (j-1)/(i+j-1)

Unknown ancestral state If we do not know which of the two alleles is older, we have a slightly different situation. Probability that an allele found in frequency k out of n genes is the oldest is k/n. Probability that A carries the mutant allele is 1-k/n = (n-k)/n. Jump processes become: (i,j) -> (i-1,j) with probability i/(i+j) (i,j) -> (i, j-1) with probability j/(i+j) Both i and j can become zero (but never both of them). j/(i+j) – probability that A carries the mutant allele i/(i+j) – probability that B carries the mutant allele

The age of the MRCA for two sequences Now consider the situation of two sequences with S2 = k segregating sites. Before we assumed that there was only 1 mutation in the sample and that the mutation rate was negligible. Here we consider that the situation of two sequences with S2 = k segregating sites and want to evaluate the time until their MRCA, conditional on the k mutations. In particular the mean is (equal above) which is larger or smaller than the unconditional mean time E(T2) = 1, according to whether k >= theta or k <= theta, respectively.

Probability of going from n ancestors to k ancestors Where n[i] = n(n-1)…(n-i+1) and n(i) = n(n+1)…(n+1-1) The probability of different # of ancestors starting with 7 ancestors at time 0. The curve starting with probability 1 at time 0 and then descending towards 0, is the probability of the 7 ancestors. The curve that peaks first is the probability of 6 ancestors, then followed by the curve for having 5 ancestors, etc. The curve that starts in 0, but converges to 1 as time increases, is the probability of only 1 ancestor to the sample. At each time point the curves sum up to 1. Probability of different number of ancestors starting with seven ancestors at time 0.

Probability of going from n ancestors to k ancestors Probability of different number of ancestors starting with seven ancestors at time 0 and ending with 4 ancestors at a different time. The probability of different numbers of ancestors starting with 7 ancestors at time 0 and ending with 4 ancestors at different times. The curve that has probability 1 at time 0 is the probability of seven ancestors. The curve that has probability 1 at the right end of the interval is the probability of four ancestors. The curve that peaks first is the probability of 6 ancestors, followed by the curve for having 5 ancestors. At each time point the curves add up to 1. Since the coalescent process is Markovian, it is easy to find the probability of a descending # of ancestors to a series of increasing times. First I show the probability that there are m ancestors at time t1 and k at time t. Thus the probability of m ancestors at time t1, when we started with n and ended with k at time t is P(m|n… ) The figure shows where n = 7, k =4.

Probability of going from n ancestors to k ancestors The probability that a sample of 3 genes have 2 ancestors at time t. An area is indicated with a set of sequences – these sequences have not found any common ancestors. The area with 1,2,3 still have distinct ancestors after time t. It has probability e-3t. Elementary calculations allow the probability of all events to be found. Illustration of 3 genes having 2 ancestors. Each circle is the event where the 2 genes have not found an ancestor after time t and has probability e-t. The intersection of all 3 circles has probability e-3t , since it is the event that no genes have found common ancestors. The 3 intersections of the two circles not including the 3rd circle have probability (e-t – e-3t) / 2. Note that this figure can be misleading because the parts of the circle outside the intersections is empty: If 2 pairs of genes have found common ancestors at time t then all 3 pairs have. Therefore, all probabilities of the areas are defined and it is a simple addition-subtraction exercise to obtain probabilities of interest. The event where the 3 genes have 2 ancestors are the areas that are contained in 2 circles. The area outside any circle are the events where all have found a common ancestor. Probability that a sample of three genes have two ancestors at time r.

Questions? Slides are available on the Wiki at: http://compgen.unc.edu/Courses/index.php/Comp_790-087