Baysian Prob.. Example Red box has 2 apples and 6 oranges Blue box has 3 apples and 1 orange Pick a box, randomly select a fruit, and put it back Joint.

Slides:



Advertisements
Similar presentations
Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Machine Learning CMPT 726 Simon Fraser University
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Parsimony 2.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Chap 10. Gene Prediction. Sequence-based Gene Discovery Good for prokaryote Eukaryotes No Shine-Dalgarno sequence to mark 1 st start codon TSS in eukaryotes.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Chap. 7. Building Trees. Fixation in Neutral Model Advantageous mutation with a fitness 1 + s s: selection coefficient If m copies of the mutation, the.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
A brief introduction to phylogenetics
Lecture 2: Principles of Phylogenetics
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
Calculating branch lengths from distances. ABC A B C----- a b c.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Bayesian statistics named after the Reverend Mr Bayes based on the concept that you can estimate the statistical properties of a system after measuting.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Chap. 7. Building Trees. A B C D E F G H I time A B C D E one unit Tree nomenclature taxon.
Lecture 1.31 Criteria for optimal reception of radio signals.
Phylogenetic basis of systematics
Gibbs sampling.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Bayesian inference Presented by Amir Hadadi
Lecture 7 – Algorithmic Approaches
Presentation transcript:

Baysian Prob.

Example Red box has 2 apples and 6 oranges Blue box has 3 apples and 1 orange Pick a box, randomly select a fruit, and put it back Joint Prob. Assume, p(B=r) = 0.4 & p(B=b) = 0.6 p(B=r, F=a) = 0.4*(2/8) = 0.1 p(B=r, F=o) = 0.4*(3/4) = 0.3 p(B=b, F=a) = 0.6*(3/4) = 9/20 p(B=b, F=o) = 0.6*(1/4) = 3/20 Joint Prob. B\Fao r b9/203/20

Conditional Prob. Blue box is picked, what is the prob. Of selecting an apple ? p(F=a|B=b) = (9/20)/(9/20+3/20) = ¾ Or, p(F=a|B=b) = p(F=a, B=b)/p(B=b) p(F=a, B=b) = p(F=a|B=b)*p(B=b) = p(B=b|F=a)*p(F=a) => Product Rule Similarly, p(F=a|B=r) =2/8 = ¼ p(F=a) = p(F=a|B=r)*p(B=r) + p(F=a|B=b)*p{B=b) = Σp(F=a, B=x)=> Sum Rule = Σp(F=a|B=x)p(B=x) Conditional Prob. B\Fao r b9/203/20

When an apple is selected, which box is it from ? p(B=r|F=a) and p(B=b|F=a) ? Use the product rule p(F=a, B=b) = p(F=a|B=b)*p(B=b) = p(B=b|F=a)*p(F=a) P(B=r|F=a) = p(B=r, F=a)/p(F=a) = p(F=a|B=r)*p(B=r)/p(F=a) In general, p(X,Y) = p(X|Y)p(Y) = p(Y|X)p(X) Product Rule p(X) = Σ y p(X,Y)Sum Rule p(Y|X) = p(X,Y)/p(X) = p(X|Y)p(Y)/p(X)Bayes’ Theorem Bayes’ Theorem B\Fao r b9/203/20

Predict helices and loops in a protein Known info: helices have a high content of hydrophobic residues p h and p l : frequencies of AA being in the helix or loop L h and L l : likelihoods that a sequence of N AAs are in a helix or a loop L h = ∏ N p h, L l = ∏ N p l Rather than likelihoods, their ratios have more info L h /L l : is sequence more or less likely to be a helical or loop region S = ln(L h /L l ) = ∑ N ln(p h /p l ): positive for helical region Partition a sequence into N-AA segments (N=300) Another Example

Two models Two prior probs.: P prior 0, P prior 1 P post i = L i P prior i /(L 0 P prior 0 + L 1 P prior 1 ) Log-odd score: S΄ = ln(L 1 P prior 1 /L 0 P prior 0 ) = ln(L 1 /L 0 ) + ln(P prior 1 /P prior 0 ) = S + ln(P prior 1 /P prior 0 ) Difference between S΄and S is simply the additive constant, and ranking will be identical whether we use S΄or S Warning: if P prior 1 is small, S has to be high to make S΄positive When P prior 0 = P prior 1, S΄= S P post 1 = 1/(1 + L 0 P prior 0 /L 1 P prior 1 ) = 1/(1 + exp(- S΄)) S΄=0 →P post 1 =1/2; S΄is large and negative → P post 1 ≈1 Bayesian Prob.

Previous example has two hypotheses (Helix or Loop) The sequence is described by models 0 and 1 Models 0 and 1 are defined by p h and p l Generalize to k hypotheses: M k models (k=0,1,2,…) Given a test dataset D, what is the prob. that D is described by each of the models ? Known info: prior probs., P prior (M k ) for each modelfrom other info sources Compute likelihood of D according to each of the models: L(D|M k ) Of interest is not the prob of D arising from M k but the prob of D being described by M k Namely, P post (M k | D) ∞ L(D|M k ) P prior (M k ) : posterior prob. P post (M k | D) = L(D|M k ) P prior (M k )/∑ i L(D|i i ) P prior (M i ) => Bayesian prob. Basic principles We make inference using posterior probs. If a posterior prob. of one model is higher, it can be the best model with confidence Prior and Posterior Probs.

Max Likelihood Phylogeny

Given a model of sequence evolution and a proposed tree structure, compute the likelihood that the known sequences would have evolved on that tree ML chooses the tree that maximizes this likelihood Three parameters Tree toplogy Branch lengths Values of the parameters in the rate matrix Maximum Likelihood (ML) Phylogeny

Given a model of sequence evolution at a site Likelihood of ancestor X: L(X) = P XA (t 1 ) P XG (t 2 ) L(Y) = P YG (t 4 ) ∑ X L(X) P YX (t 3 ) L(W) = ∑ y ∑ Z L(Y)P WY (t 5 )L(Z)P WZ (t 6 ) Total likelihood for the site: L = ∑ W P W L(W) P W : equilibrium prob. Is equal to posterior prob. of different clades What is Likelihood in ML Tree ? X A Y W Z GGTT t1t1 t3t3 t5t5 t2t2 t4t4 t6t6

ML tree maximizes the total likelihood of the data given the tree, i.e., L(data|tree) We want to compute posterior prob: P(tree|data) From Bayes theorem, P t (tree|data)= L(data|tree)*P r (tree)/ ∑L(data|tree)*P r (tree) (summation is over all possible trees) Namely, posterior prob. ∞ L(data|tree)*P r (tree) Problem is the summation over all possible trees Moreover, what we really want is, given the data, the posterior prob. that a particular clade of interest is present P post (clade|data)= ∑P(data|tree) for trees containing the clade = ∑ clade L(data|tree)*P prior (tree)/ ∑ all trees L(data|tree)*P prior (tree) In practice, P psot (clade|data) = # of trees containing clade/total # of trees in the sample Computing Posterior Prob.

Maximum likelihood is an alternative to maximum parsimony. It is computationally intensive. A likelihood is calculated for the probability of each residue in an alignment, based upon some model of the substitution process. What are the tree topology and branch lengths that have the greatest likelihood of producing the observed data set? ML is implemented in the TREE-PUZZLE program, as well as PAUP and PHYLIP. Making trees using maximum likelihood Page 262

1.Reduce the problem to a series of quartet sequences with 1,2,3,4 -> Three possible topologies ()  Construct all quartet trees  For N sequences, ( N 4 ) possible quartets (N=12, 495)  Three quartet topologies are weighted by posterior probabilities Maximum likelihood: Tree-Puzzle

2. Puzzling step  Start with one quartet tree (N-4 sequences remain)  Add them to the branches systematically estimating the support for each internal branch 3. Generate a majority consensus tree  Branch lengths and max. likelihood values are estimated. Maximum likelihood: Tree-Puzzle

Quartet puzzling  Likelihood mapping indicates the frequency with which quartets are successfully resolved  495 points corresponding to all possible quartets  Only 9.7% of quartets are unresolved

An Example with 13 globins

We need to generate a large sample of trees with prob. of finding a tree in the sample being proportional to its likelihood*prior prob. Metropolis Algorithm Start with a trial tree and compute likelihood, L 1 Make a slight change (change branch length, move a vertex,..) Compute L 2 in the modified tree If L 2 >L 1 new tree is accepted Otherwise, new tree is accepted with prob. L 2 /L 1 if rejected, start with tree of L 1 Hill-climbing, and also downhill moves Generates trees with percentages ML Tree Generation Algorithm

Calculate: Pr [ Tree | Data] = Bayesian inference of phylogeny Pr [ Data | Tree] x Pr [ Tree ] Pr [ Data ] Pr [ Tree | Data ] is the posterior probability distribution of trees. Ideally this involves a summation over all possible trees. In practice, Monte Carlo Markov Chains (MCMC) are run to estimate the posterior probability distribution. Notably, Baysian approaches require you to specify prior assumptions about the model of evolution.

Bootstrapping A method of assessing the reliability of trees Numbers in the rooted tree – called bootstrap percentages Distances according to models are not realistic due to chance fluctuations Boostrapping addresses the question on if these fluctuations are influencing the tree configuration Boostrapping deliberately construct sequence data sets that differe by some small random fluctuations from real sequences And check if the same tree topology is obtained Randomized sequences are constructed by sampling columns Evaluation of Trees: Bootstrapping

Generate 100 or 1,000 randomized sequences And compute what percentage of randomized trees contain the same group 77% boostrap value is considered to be reliable e.g. 24% -- doubtful if they form a clade 71% - human/chimpanzee/pygmy chimapnzee Between two high figures Chimpanzee/pygmy always form a clade Gorilla/human/chimpanzee/pygmy always form a clade (Gorilla.(human,chimpanzees)) appear more frequently than (human,(gorilla, chimpanzees)) or (chimpanzees, (gorilla,human)) Thus, can conclude (human, chimpanzees) is more reliable Can construct a consensus tree Frequency of each possible clade is determined Construct a consensus tree by adding clades from more frequent clades Bootstrapping

In 61% of the bootstrap resamplings, ssrbp and btrbp (pig and cow RBP) formed a distinct clade. In 39% of the cases, another protein joined the clade (e.g. ecrbp), or one of these two sequences joined another clade.

Evaluate trees according to the least squared error E = ∑( d ij – d tree ij ) 2 / d 2 ij Fitch and Margliash, 1967 Clustering methods such as NJ and UPGMA have a well-defined algorithm and produces one tree, but no criterion Optimization approach has a well-defined criterion, but no well- defined algorithm Has to construct many alternative trees and test each one for the criterion Other optimization approaches Maximum likelihood – choose the tree on which likelihood of observing the given sequences is highest Parsimony – choose the tree for which the fewest number of substitutions are required in the sequences Tree Optimization

Number of distinct trees grows by n!! (product of odd numbers) For N species, (2N-5)!! for unrooted trees, (2N-3)!!, rooted N=7, 9*7*5*3*1 = 945, N=10, 2.0*10 6 Consider a ‘tree space’ as a set of all possible tree topologies Two trees are neighbors if they differ by a topological change known as a nearest-neighbor interchange (NNI) With NNI, an internal branch of a tree is selected Tree Space A subtree is swapped with any other at the other end of the internal branch Tree 4 is not a neighbor of tree 1 (rather by pruning and refgrafting (SPR))

Hill climbing algorithm Given an initial tree (from distance matrix, for example) Find a neighboring tree that is better if found, move to this new tree, and search neighbors Until a local optimum is reached (no neighbors are found) Cannot guarantee global optimum Heuristic search Start with random three species, and construct an unrooted tree Add one species at a time, connecting it in the optimal way Continue with different initial random three species, each time producing a local optimum Repeat this enough, and may claim a global optimum Optimization in Tree Space

Parsimony is fast – ML requires each tree topology to be optimized ML is model-based parsimony’s model is equal substitution Parsimony can incorporate models, but not clear what the weights have to be Parsimony tries to minimize the number of substitutions, irrespective of the branch lengths ML allows for changes more likely to happen on longer branches On a long branch, no reason to try to minimize the number of substitutions Parsimony is strong for evaluating trees based on qualitative characters ML vs. Parsimony

Example Experiment Stimulant (ST) present/not present (a priori prob. x) Extracellular signal (SI): high/medium/low (posterior prob. y) Stimulant High Med Low present not present Inference What is the prob. of ST being present when SI is high ? Need a priori prob. (present = 0.4) P(ST=p|SI=high) = p(SI=h|ST=p)p(ST=p)/Σ x p(y|x)p(x) = 0.8 Bayesian Network

Discrete data θ SI = p(SI|ST) Stimulant High Med Low present not present Continuous data – e.g. Gaussian Model Parameter Set

Multiple Variables Stimulant (ST), Signal (SI), inhibitor (IN) of the signal, a G protein-coupled receptor bind (RE), a G protein (GP) and the cellular response (CR) Express relationships ST may or may not generate a signal Concentration of the signal may affect the level of inhibitor If the signal binds with receptor depends on the concentration of both the signal and the inhibitor GP should become active if the receptor binds An active GP initiates a cascade of reaction that causes the cellular response Bayesian Network

CI if p(a,b|c) = p(a|c)p(b|c) Three cases with an example of regulation of three genes x,y, and z Serial: If expression level of y is unknown, its level of x affects that of z If that of y is known, that of z is conditionally independent from x Conditional Independence

Three cases with an example of regulation of three genes x,y, and z Diverging If expression level of y is unknown, its level of x affects that of z (they are co- regulated, and if x is highly expressed, then the likely level of y may be inferred, which in turn would influence the expression level of z) If that of y is known, that of z is conditionally independent from x Conditional Independence

Converging If expression level of y is unknown, its level of x does not help to infer that of z (x and z are independent) If that of y is known, that of x does help to infer that of z Y is dependent on both x and z, P(x,z|y) ≠ p(x|y)p(z|y) If y and x are known, it helps to infer the value of z, and x and z are no longer independent p(z|x,y)=p(z)p(y|x,z)/ Σ z p(z)p(y|x,z) ≠ p(z) Conditional Independence

BN with n variables (nodes) x = {x 1, x 2,…x n } and model parameter θ = {θ 1, θ 2, …, θ n } (θ i : a set of parameters describing the distribution for x i ) Each node has parents pa(x i ) p(x 1, x 2,…x n |θ) = Π i p(x i |pa(x i ), θ) from conditional independence P(ST,SI,IN,RE,GP,CR) = p(CR|GP)p(GP|RE)p(RE|SI,IN)p(IN|ST)p(ST)  simpler expression for joint prob. By using conditional independence Joint Prob. Distribution

Given the states of a variables, state of other variable can be inferred P(GP=active|ST=present) = ΣΣΣ P(GP=active|RE=x)P(RE=x|IN=y, SI=z) P(IN=y|SI=z) p(SI=z|ST=present) = Inference of GP active, knowing ST is present

Posterior prob P(ST=present | SI=high) = P(AI=h|ST=p) P(ST=p)/[P(SI=h|ST=p) P(ST=p) + P(SI=h|ST=n) P(ST=n) ] = 0.6*0.4/[0.6* *0.6] = 0.8 Prob. of ST present, given signal is high