Combinatorics & the Coalescent (26.2.02) Tree Counting & Tree Properties. Basic Combinatorics. Allele distribution. Polya Urns + Stirling Numbers. Number.

Slides:



Advertisements
Similar presentations
Many useful applications, especially in queueing systems, inventory management, and reliability analysis. A connection between discrete time Markov chains.
Advertisements

Hierarchical Dirichlet Processes
Chapter 2 Concepts of Prob. Theory
Population Genetics, Recombination Histories & Global Pedigrees Finding Minimal Recombination Histories Global Pedigrees Finding.
Combinatorics of Phylogenies for open problems Semple and Steel (2003)
Chapter 4 Probability and Probability Distributions
Sampling distributions of alleles under models of neutral evolution.
Preview What does Recombination do to Sequence Histories. Probabilities of such histories. Quantities of interest. Detecting & Reconstructing Recombinations.
Entropy Rates of a Stochastic Process
6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 2.
Molecular Evolution Revised 29/12/06
Combinatorial optimization and the mean field model Johan Wästlund Chalmers University of Technology Sweden.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
DATA ANALYSIS Module Code: CA660 Lecture Block 2.
Chapter 4: Stochastic Processes Poisson Processes and Markov Chains
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
My wish for the project-examination It is expected to be 3 days worth of work. You will be given this in week 8 I would expect 7-10 pages You will be given.
Advanced Questions in Sequence Evolution Models Context-dependent models Genome: Dinucleotides..ACGGA.. Di-nucleotide events ACGGAGT ACGTCGT Irreversibility.
Incorporating Mutations
Introduction to Stochastic Models GSLM 54100
Applied Discrete Mathematics Week 10: Equivalence Relations
The Coalescent & Human Sequence Variation ( ) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes.
The Erdös-Rényi models
The Poisson Process. A stochastic process { N ( t ), t ≥ 0} is said to be a counting process if N ( t ) represents the total number of “events” that occur.
Chapter 8: Probability: The Mathematics of Chance Lesson Plan Probability Models and Rules Discrete Probability Models Equally Likely Outcomes Continuous.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Fall 2002CMSC Discrete Structures1 One, two, three, we’re… Counting.
ELEC 303, Koushanfar, Fall’09 ELEC 303 – Random Signals Lecture 4 – Counting Farinaz Koushanfar ECE Dept., Rice University Sept 3, 2009.
Lecture 3: population genetics I: mutation and recombination
Trees & Topologies Chapter 3, Part 1. Terminology Equivalence Classes – specific separation of a set of genes into disjoint sets covering the whole set.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
What is of interest to calculate ? for open problems Semple and Steel.
Week 11 What is Probability? Quantification of uncertainty. Mathematical model for things that occur randomly. Random – not haphazard, don’t know what.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Models and their benefits. Models + Data 1. probability of data (statistics...) 2. probability of individual histories 3. hypothesis testing 4. parameter.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Chapter 01 Probability and Stochastic Processes References: Wolff, Stochastic Modeling and the Theory of Queues, Chapter 1 Altiok, Performance Analysis.
Chapter 01 Probability and Stochastic Processes References: Wolff, Stochastic Modeling and the Theory of Queues, Chapter 1 Altiok, Performance Analysis.
Relevant Subgraph Extraction Longin Jan Latecki Based on : P. Dupont, J. Callut, G. Dooms, J.-N. Monette and Y. Deville. Relevant subgraph extraction from.
Population genetics. coalesce 1.To grow together; fuse. 2.To come together so as to form one whole; unite: The rebel units coalesced into one army to.
CS 103 Discrete Structures Lecture 13 Induction and Recursion (1)
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Main Menu Main Menu (Click on the topics below) Combinatorics Introduction Equally likely Probability Formula Counting elements of a list Counting elements.
Probability Theory Modelling random phenomena. Permutations the number of ways that you can order n objects is: n! = n(n-1)(n-2)(n-3)…(3)(2)(1) Definition:
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Testing the Neutral Mutation Hypothesis The neutral theory predicts that polymorphism within species is correlated positively with fixed differences between.
Restriction enzyme analysis The new(ish) population genetics Old view New view Allele frequency change looking forward in time; alleles either the same.
2/24/20161 One, two, three, we’re… Counting. 2/24/20162 Basic Counting Principles Counting problems are of the following kind: “How many different 8-letter.
Chapter 8: Probability: The Mathematics of Chance Lesson Plan Probability Models and Rules Discrete Probability Models Equally Likely Outcomes Continuous.
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
A Little Intro to Statistics What’s the chance of rolling a 6 on a dice? 1/6 What’s the chance of rolling a 3 on a dice? 1/6 Rolling 11 times and not getting.
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
Minimal Recombinations Histories and Global Pedigrees Finding Minimal Recombination Histories Acknowledgements Yun Song - Rune Lyngsø - Mike Steel - Carsten.
Discrete Structures Li Tak Sing( 李德成 ) Lectures
Trees & Topologies Chapter 3, Part 2. A simple lineage Consider a given gene of sample size n. How long does it take before this gene coalesces with another.
The Pigeonhole Principle
3. Random Variables (Fig.3.1)
Math a - Sample Space - Events - Definition of Probabilities
Polymorphism Polymorphism: when two or more alleles at a locus exist in a population at the same time. Nucleotide diversity: P = xixjpij considers.
COCS DISCRETE STRUCTURES
Testing the Neutral Mutation Hypothesis
The coalescent with recombination (Chapter 5, Part 1)
Recombination, Phylogenies and Parsimony
Trees & Topologies Chapter 3, Part 2
Trees & Topologies Chapter 3, Part 2
Presentation transcript:

Combinatorics & the Coalescent ( ) Tree Counting & Tree Properties. Basic Combinatorics. Allele distribution. Polya Urns + Stirling Numbers. Number af ancestral lineages after time t. Inclusion-Exclusion Principle.

A set of realisations (from Felsenstein)

Binomial Numbers 12345n n Binomial Expansion: Special Cases:

Recursion:Initialisation: n= k = r 34 n-1 n-r-1 34 n-1 n-r r r 34 6 n-r n n ++

The Exponential Distribution. The Exponential Distribution: R+ Expo(a) Density: f(t) = ae -at, P(X>t)= e -at Properties: X ~ Exp(a) Y ~ Exp(b) independent i. P(X>t 2 |X>t 1 ) = P(X>t 2 -t 1 ) (t 2 > t 1 ) ii. E(X) = 1/a. iii. P(X < Y) = a/(a + b). iv. min(X,Y) ~ Exp (a + b). v. Sums of k iid X i is  (k,a) distributed

The Standard Coalescent Two independent Processes Continuous: Exponential Waiting Times Discrete: Choosing Pairs to Coalesce WaitingCoalescing (4,5) (1,2)--(3,(4,5)) 1--2 {1}{2}{3}{4}{5} {1,2}{3,4,5} {1,2,3,4,5} {1,2}{3}{4,5} {1}{2}{3,4,5}

Tree Counting Tree: Connected undirected graph without cycles. k nodes (vertices) & k-1 edges. Nodes with one edge are leaves (tips) - the rest are internal. s1 s4 s6 s5 s3 s2 a2a1a3 a4 r Ignore root & branch lengths gives unrooted tree topology. If age ordering of internal nodes are retained this gives the coalescent topology. Labels of internal nodes are permutable without change of biological interpretation. If labels at leaves are ignored we have the shape of a tree. Most biological trees are bifurcating. Valency 3 (number of edges touching internal nodes) if made unrooted. Such unrooted trees have n-2 internal nodes & 2n-3 edges.

Counting by Bijection Bijection to a decision series: 321k1k1 Level 0 Level 1 Level 2 Level L 321k2k2 132N N=k 1 *k 2 *...*k L

Trees: Rooted, bifurcating & nodes time-ranked. k 1 ji k-1 1 (i,j) k-2 1 (i,j)(n,m) m n Recursion: T k = T k-1 Initialisation: T 1 = T 2 =1

Trees: Unrooted & valency Recursion: T n = (2n-5) T n-1 Initialisation: T 1 = T 2 = T 3 =1

Coalescent versus unrooted tree topologies 4 leaves: 3 unrooted trees & 18 coalescent topologies. 1 unrooted tree topology contains 6 coalescent topologies

External (  ) versus Internal  Branches. E(  ) = 2 E(  ) = Inner & outer branches Fu & Li (1993) Red - external. Others internal. Except for green branch, internal- external corresponds to singlet/non- singlet segregating sites if only one mutation can happen per position. ACTTGTACGA TCTTATACGA ACTTATACGA s n Let l i,n be length of i’th external branch in an n-tree. Obviously E(  ) = nE(l n,i ) (any i) l n-1,j + t n Pr= 1-2/n L n,i = t n Pr= 2/n

Probability of hanging Sub-trees. Kingman (1982b) 1243n t=0 t=t 1 12k For a coalescent with n leaves at time 0, with k ancestors at time t 1, let  be the groups of leaves of the k subtrees hanging from time t 1. Let 1, 2.., k be the number of leaves of these sub-trees. Example: n=8, k=3. Classes observed : 4, 3, 1 The basal division splits the leaves into (k,n-k) sets with probability: 1/(n-1).

Nested subsamples (Saunders et al.(1986) Adv.Appl.Prob ) t=0 Population Sub-sampleSample t=t 1 i’ i j j’ 2N Transitions i,j i-1,j i-1,j-1 i,j 1,1 2,12,2 3,13,23,3 4,14,24,34,4 5,15,25,35,45,5 6,16,26,36,46,56,6 7,17,27,37,47,57,67,7 8,18,28,38,48,58,68,78,8 9,19,29,39,49,59,69,79,89,9 i, j

Nested subsamples (Saunders et al.(1986) Adv.Appl.Prob ) Pr{MRCA(sub-sample) = MRCA(sample)} = Pr{MRCA(sub-sample) = MRCA(population)} =

Age of a Mutation Wiuf & Donnelly (1999) Wiuf (2000), Matthews (2000) The probability that there are k differences between two sequences. Going back in time 2 kinds of events can occur (mutations (  - or a coalescent (1). This gives a geometric distribution: --* *------* *----*----*----*--- Exp(1) Exp(  )

Polya Urns & Infinite Allele Model (Donnelly, Hoppe, ) The only observation made in the infinite allele models is identity/non-identity among all pairs of alleles. I.e. The central observation is a series of classes and their sizes. What is the next event - a duplication of an exiting type or a introduction of a ”new” allele. This model will give rise to distributions on partitions of {1,2,..,n} like {1,4,7}{2,3}{5}{6}. Since the labelling is arbitrary, only the information about the size of these groups is essential for instance represented as Expected number of mutations in unit interval (2N) is .

Classical Polya Urns Feller I. 213 Let X 0 be the initial configuration of the initial Urn. A step: take a random ball the urn and put it back together with an extra of the same colour. X k be the content after the k’th step. Let Y k be the colour of the k’th picked ball. i. P{Y k =j} = P{Y 1 =j}. ii. Sequences Y 1... Y k resulting in the same X k - has the same probability.

Labelling, Polya Urns & Age of Alleles (Donnelly, Hoppe, ) An Urn: 1   A ball is picked proportionally to its weight. Ordinary balls have weight 1. If the initial  -size ball is picked, it is replaced together with a completely new type. If an ordinary ball is picked, it is replaced together with a copy of itself. There is a simple relationship between the distribution of ”the alleles labeled with age ranking” is the same as ”the alleles labeled with size ranking” As they come By size By age

Ewens' formula. (1972 TPB ) P n (a 1,a 2,,a n ) = k is a minimal sufficient statistic for   the probability of the data conditioned on k is  -less and there is no simpler such statistic. P n (a 1,a 2,,a n ;k) = E n (k types) = P 5 (2,0,1,0,0) is the probability of seeing 2 singles and one allele in 3 copies in a sample of 5. Obviously, a 1 +2a 2 + +ia i +na n =n

Stirling Numbers Partitioning into k sets - Stirling Numbers (of second kind) - S n,k k unlabelled bins - all non-empty. Bell Numbers - B n - Partioning into any number of sets. k n n k Obviously: B

Stirling Numbers Basic Recursion: S n,k = kS n-1,k + S n-1,k-1 Initialisation: S n,1 = S n,n = 1. n-1 items - k classes: {..},{..},..,{..} (n-1,k-1): {..},{..},..,{..} (n,k) : {..},{..},..,{..} + ”n”

Ewens' formula - example. (1972 TPB ) P 5 (2,0,1,0,0) = P 5 (2,0,1,0,0;3) = Assume has been observed and that 0.5 mutation is expected per unit (2N) time. E 5 (k types) =

Ancestors to Ancestors Griffiths(1980), Tavaré(1984) h i,j = probability that i individuals has j ancestors after time t. i [k] = i(i-1)..(i-k+1) i (k) = i(i+1)..(i+k-1) Example: Disappearance of 7 lineages.

Y:# of Ancestors to time t. 3 methods of solution: i.Sum of different independent exponential distributions: ii. Distribution in markov chain: iii. Combination of known probabilities: a. Probability that i alleles has i/less ancestors. b. This probability is the same for all i-sets c. No coalescence within a set, implies no coalescence within all subsets. ij-11 i-1j+1j 1 t

3 Ancestors to 2 Ancestors : (3/2)(e -t - e -3t ) ?: (e -t - e -3t )/2 Exactly one coalescence:3(e -t - (e -t - e -3t )/2)-e -3t ) 1,2 2,3 1,3 1,2,3 (2,3) (1,2) (1,3) e -3t e -t (2,3) (1,3) (1,2) e -t ? ?? {1,2,3} {1,2}{1,3}{2,3} Jordan’s Sieve: A 1 : 3e -t - 2A 2 : 2 ((e -t + e -3t )/2) + 3A 3 : 3 e -3t

The exclusion-inclusion principle. 112I IIVenn Diagrams: {I + II} - {I} + {II} + {I&II} = III III {I+II+III} = {I}+ {II}+ {III} - ({I,II} + {I,III} + {II,III}) +{I,II,III}

Exclusion-inclusion& Jordan’s Sieve S j j=1,..,r the given sets, A k - sum of intersection of k sets in 1 sets A 1 - 2A 2 + 3A A 4 in 2 sets A 2 - 3A A 4 in 3 sets A A 4 in 4 sets A 4 in some set A 1 - A 2 + A 3 - A 4 Example: the elements above: Total number: In exactly m sets: (Jordan’s Sieve) (Jordan’s Sieve) exclusion-inclusion

Surviving Lineages Which probability statements can be made? Let s be subset of i {1,2,..i} and S(s) be the event that no coalescence has happened to s. Additionally, if s’ is a subset of s, then S(s) implies S(s’). {2,..,i} {1,2,..,i-1} {1,3..,i} {1,2,..,i} i 1 i-1 i Size number {1,2} e -t (i-1,i) e -t 2 j

Surviving Lineages where There aresets. We want events member of only one of them. Summation is over all k-subsets of {1,..,r} and intersection is between the k sets chosen.

0 t1t1 t P k (t 1 ) = h i,k (t 1 )* h k,j (t- t 1 )/ h i,j (t) Example: 7 --> 4 lineages.

Summary Tree Counting & Tree Properties. Basic Combinatorics. Allele distribution. Polya Urns + Stirling Numbers. Number af ancestral lineages after time t. Inclusion-Exclusion Principle.

Recommended Literature Bender(1974) Asymmptotic Methods in Enumeraion Siam Review vol Donnelly (1986) ”Theor.Pop.Biol. Ewens (1972) Theor.Pop.Biol. Ewens (1989) ”Population Genetics Theory - The Past and the Future” Feller ( ) Probability Theory and its Applications I + II Wiley Fu & Li (1993) ”Statistical Tests of Neutrality of Mutations” Genetics Griffiths (1980) Griffiths & Tavaré(1998) ”The Age of a mutation on a general coalescent tree. Griffiths & Tavaré(1999) ”The ages of mutations in gene trees” Griffiths & Tavaré(2001) ”The genealogy of a neutral mutation” Hoppe (1984) ”Polya-like urns and the Ewens’ sampling formula” J.Math.Biol Kingman (1982) ”On the Genealogy of Large Populations” Kingman (1982) ”The Coalescent” Stochastic Processes and their Applications Kingman (1982) Matthews,S.(1999) ”Times on Trees, and the Age of an Allele” Theor.Pop.Biol Möhle Pitman Schweinsberg Simonsen & Churchill (1997) Saunders et al.(1986) ”On the genealogy of nested subsamples from a haploid population” Adv.Apll.Prob Tajima (1983) Evolutionary Relationships of DNA Sequences in Finite Poulations Genetics Tavaré (1984) Line-of-Descent and Genealogical Processes, and Their Application in Population Genetics Models. Theor.Pop.Biol Thompson,R. (1998) ”Ages of mutations on a coalescent tree” Math.Bios van Lint & Wilson (1991) A Course in Combinatorics - Cambridge Wiuf (2000) On the Genealogy of a Sample of Neutral Rare Alleles. Theor.Pop.Biol Wiuf & Donnelly (1999) Conditional Genealogies and the Age of a Mutant. Theor. Pop.Biol