Slide 1: Thank you Elizabeth for the introduction, and hello everybody. So, I have been a PhD student with Charles Semple and Mike Steel at the UoC since February.
Slide 2: A central question in conservation biology is how to measure, predict and counter the loss of biodiversity as species face extinction. I am working on problems that are related to the so called phylogenetic diversity, which is a measure for describing how much of an evolutionary tree is spanned by a subset of species. I am particularly interested in questions which require tools in combinatorics, complexity theory, algorithms and probability theory.
Slide 3: My talk will present a joint work with Mike Steel and Fabio Pardi. It will discuss the distribution of phylogenetic diversity under random extinction.
Slide 4: Before presenting the results, I’m going to describe the probabilistic model under consideration. First I would like to define the notion of phylogenetic diversity or briefly PD. We have a set X of present species and a rooted phylogenetic X-tree, which represents the evolutionary development of these taxa from their common ancestor. In my example in the figure, present species are illustrated by coloured cyrcles. We have also lengths on the edges, more precisely, there is a map Lambda which assigns a non-negative length to each edge. We denote these lengths by lambdas. For example, edge e has length Lambda_e, and so on. Such a length don’t necessarily refer to the temporal duration of the development on the edge but rather it may represent the amount of genetic change on that edge or perhaps other features such as morphological diversity. For a subset X’ of X, the phylogenetic diversity of that species set is the sum of the lengths of the edges of the tree that connects this subset and the root vertex. So phylogenetic diversity is a quantitative tool for measuring how diverse genetically a species set is. Now look at the figure again; the PD score of the subset containing the blue and the green species is the sum of the lengths lambda_a, lambda_b, lambda_d, and so on, because these are the lengths of the edges of the subtree connecting the blue and the green species and the root. The edges of that subtree are indicated by coloured lines int he figure.
Slide 5: In the ’Field of bullets’ model, we assume that each species is given a so called survival probability, that is, we are given a map p that assigns to each taxon i a survival probability p_i. We construct a random set X’ by assigning each element of X to X’ independently with its survival probability. For example, taxon i will be in X’ with its survival probability p_i. We regard X’ as the set of taxa that will still exist at some time in the future. In the simple FOB model, each taxon has the same probability of surviving, whereas the more realistic general FOB model allows each species to have its own survival probability. Extinction events are in both models independent.
Slide 6: Under the FOB model, we define the future phylogenetic diversity as the random variable which is the PD score of the random future taxon set X’. Let Phi denote this random variable. The figure shows the situation where the random survival set consists of the blue and the green species. As we have seen, the PD of this set is the sum of the lengths of the coloured edges. So there are edges that we sort of count and other edges that we don’t count. It is easy to see that Phi can be written as the sum of the terms Lambda_e times Y_e, where Y_e is the random variable which takes the value 1 if e lies on at liest one path between an element of X’ and the root and which is 0 otherwise. In our example, the future PD would be 0 x lambda_g + 1 x lambda_h and so on. So this is the FOB model. Now imagine that we have a sequence of such models, a sequence of trees of increasing size, each tree having its edge length function and its survival probability function, and consider the sequence of the corresponding future phylogenetic diversities. Our goal was to determine the distribution of Phi along these sequences, that is, as the size of the trees goes to infinity. Lets forget about the sequences and the asymptotic behavior for a while and start with the mean and variance of Phi. In order to do that, please have a second look at the formula for Phi. Megvan?
Slide 7: Since Y_e is a binary random variable with values 0 and 1, its mean is just the probability that it takes the value 1. Let us denote this probability by P_e. With this notation, we get this nice and simple formula for the expectation. The variance is also quite easy to compute. It includes a sum over all edges in the tree, and another one, which is a sum over edge pairs (e,f), such that e and f are distinct edges and the path from the root to f includes edge e, or equivalently, the species set below f is a proper subset of the species set below e. So, we know how to determine the main parameters of the distribution we set out to study.
Slide 8: We have seen that Phi is sum of many random variables. This suggests that for large trees, Phi might be normally distributed. It turns out, that under two mild conditions, Phi has asymptotically a normal distribution, even under the general model. The first condition is, that most of the survival probabilities are not too extreme, so most of them are neither arbitrarily close to 0 nor arbitrarily close to 1. The second one is, that the pendant edge lengths on average are not too small in relation to the largest edge length in the tree, where pendant edges are edges that are incident with a leaf of the tree. Of course these conditions can be formulated in a mathematically precise way.
Slide 9: The first question one could ask is, does the result hold if we drop one of the or both conditions? The answer to this question is no. To see why, consider the situation where all species have survival probability 0 or 1. In this case the first condition fails. Since it leads to a degenerate distribution, it is clear that condition 1 can’t be dropped completely. For the second condition, consider the tree in the figure with n leaves. It has n-1 pendant edges that have a length of 1 over n-1 squared, and two more edges that have length 1. Furthermore, assume that all species have the same survival probability, which is strictly between 0 and 1. Consider now the sequence of such trees as n gets larger and larger. In particular, consider the sequence of the corresponding future phylogenetic diversities. It can be seen that the sequence of these random variables does not converge to a normal distribution. I didn’t tell you the precise form of condition2 but I can tell you that in this example, C1 is satisfied but C2 fails and this implies, that C2 can’t be removed.