Phylogenetic Trees Presenter: Michael Tung

Phylogenetic Trees Presenter: Michael Tung mtung@cs.stanford.edu

Overview Common definitions Motivation Phylogenetic Inference –Probabilistic model of evolution –The EM framework Simultaneous Alignment and Phylogeny –Improvements to SATCHMO

Def: Phylogenetic Tree N current day (species || sequences || letters) form the leaves of a (binary) tree N-1 internal nodes correspond to events of divergence (~ past-time species) A phylogenetic tree (T,t) is parameterized by a topology T (simply the set of edges) and a vector t (edge lengths)

Motivation Lots of sequence data! –Cost of collecting additional data is decreasing Little understanding of evolutionary divergence Inferring the Tree of Life (species tree)

Probabilistic Model of Evolution Evolution of a single position Standard Assumptions –Lack of Memory Transitions can be described by a single matrix of conditionals –Reversibility Assume a prior distribution over states

Probabilistic Interpretation What’s the probability of an observation given the tree? Probability of a complete assignment of nodes: But, we only see the leaves.

Probabilistic Interpretation Notice that the tree topology enforces local probabilistic influence A set of sequences are composed of the individual letters Problem: MSA required? We can perform inference on the graphical model by dynamic programming

DP on trees We can exploit the tree structure to compute these probabilities in linear time

Maximum Likelihood We want the most likely tree (in the probabilistic sense) The most likely tree is the one that maximizes the probability of your data occurring Assumption: each observation is independently drawn from this distribution (true?)

Maximum Likelihood Now, simply find the parameters (T,t) that maximize this likelihood. Well, not that simple. Iterative algorithm: Expectation Maximization(EM)

EM Expectation-Maximization is a framework for optimizing a model. E-step: estimate the posterior probability of the missing data using the current model M-step: maximize the expected log- likelihood using the posterior probabilities

Expected Log-likelihood Rewrite the likelihood Key Insight: the most likely tree is the Maximum Spanning Tree!

Edge case:Transforming a tree into an equivalent bifurcating tree Maximum spanning tree could return non- phylogenetic trees Simple transformation preserves likelihood

Avoiding local optima Greedy optimization (hill-climbing) doesn’t guarantee a global optima One solution is Simulated Annealing –Temperature parameter –Perturbed edge weights W

Summary of the structural EM Start with a “good” tree topology (perhaps NJ) Optimize edge lengths Optimize Tree –Make T a binary tree –Perturb weights Iterate

What's SATCHMO? SATCHMO = Simultaneous Alignment and Tree Construction using Hidden Markov mOdels! {set of sequences} to {tree with MSA at each internal node}

However......its too slow. SATCHMO is computationally prohibitive. For ~200 seqs: ClustalW takes around 3 minutes SATCHMO takes 1 hour and 30 minutes Solution: Pre-cluster sequences to jumpstart the tree construction Parallelization

Giving SATCHMO a jumpstart In a binary tree the work increases exponentially near the leaves. If we can cut even a small # of levels, we have done almost all of the work. How? Use a computationally palatable cluster Build down using BETE Build up using SATCHMO

Parallelizing Parallelizing SATCHMO Idea: Let's utilize the whole cluster instead of just one CPU on one machine. Make interactive mode feasible. Computation and data are distributed across machines. What is the parallelization architecture to optimize complexity/latency/caching behavior/bandwidth/paging/load-balancing ? Parallel caveats: Non-determinism Fragility Data type

Parallelizing the all-against-all: v1 An all-against-all computation takes place when we are computing the initial distance matrix. This is the most compute- intensive portion of SATCHMO. Each MSA needs to be scored against each HMM. Distribute the HMMs and MSAs. Rotate the MSAs. W1W1 W2W2 W3W3 M HMMs MSAs scores

Parallelizing the all-against-all: v2 Implementing blocked communications... W1W1 W2W2 W3W3 M HMMs MSAs scores

Parallelizing the all-against-all: v3 Implementing Load-balancing.. W1W1 W2W2 M HMMs MSAs scores HMMs MSAs work request indices

Performance Results Combination of jumpstarting and parallelization can do 200 seqs in 29.90 seconds. v3 performs the best (as expected) More perf testing remains to be done Perf testing being conducted on the NERSC Seaborg. Speedup plot!

Phylogenetic Trees Presenter: Michael Tung

Similar presentations

Presentation on theme: "Phylogenetic Trees Presenter: Michael Tung"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Phylogenetic Trees Presenter: Michael Tung

Similar presentations

Presentation on theme: "Phylogenetic Trees Presenter: Michael Tung"— Presentation transcript:

Similar presentations

About project

Feedback