Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
An Introduction to Phylogenetic Methods
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Phylogenetic Analysis
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Bioinformatics I Fall 2003 copyright Susan Smith 1 Phylogenetic Analysis.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
Phylogeny Tree Reconstruction
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Phylogenetic trees Sushmita Roy BMI/CS 576
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Molecular phylogenetics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
BINF6201/8201 Molecular phylogenetic methods
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Phylogeny Ch. 7 & 8.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Phylogenetic basis of systematics
Multiple Alignment and Phylogenetic Trees
Goals of Phylogenetic Analysis
Lecture 7 – Algorithmic Approaches
Phylogeny.
Presentation transcript:

Phylogenetic Analysis

General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure of evolutionary relatedness: e.g., morphological features Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure of evolutionary relatedness: e.g., morphological features

Phylogenetics on sequence data is an attempt to reconstruct the evolutionary history of those sequences Relationships between individual sequences are not necessarily the same as those between the organisms they are found in Phylogenetics on sequence data is an attempt to reconstruct the evolutionary history of those sequences Relationships between individual sequences are not necessarily the same as those between the organisms they are found in

The ultimate goal is to be able to use sequence data from many sequences to give information about phylogenetic history of organisms Phylogenetic relationships usually depicted as trees, with branches representing ancestors of “children”; the bottom of the tree (individual organisms) are leaves. Individual branch points are nodes. The ultimate goal is to be able to use sequence data from many sequences to give information about phylogenetic history of organisms Phylogenetic relationships usually depicted as trees, with branches representing ancestors of “children”; the bottom of the tree (individual organisms) are leaves. Individual branch points are nodes.

Phylogenetic trees ABCD time A rooted tree A B C D An unrooted tree time?

We will only consider binary trees: edges split only into two branches (daughter edges) rooted trees have an explicit ancestor; the direction of time is explicit in these trees unrooted trees do not have an explicit ancestor; the direction of time is undetermined in such trees We will only consider binary trees: edges split only into two branches (daughter edges) rooted trees have an explicit ancestor; the direction of time is explicit in these trees unrooted trees do not have an explicit ancestor; the direction of time is undetermined in such trees

Types of phylogenetic analysis methods Phenetic: trees are constructed based on observed characteristics, not on evolutionary history Cladistic: trees are constructed based on fitting observed characteristics to some model of evolutionary history Phenetic: trees are constructed based on observed characteristics, not on evolutionary history Cladistic: trees are constructed based on fitting observed characteristics to some model of evolutionary history Distance methods Parsimony and Maximum Likelihood methods

Similarity and Homology The evolutionary relationship between sequences is inferred from the similarity of the sequences Similarity is a measurable quantity (e.g., % identity, alignment score, etc.) Homology is the inference from sequence similarity data that sequences are evolutionarily related The evolutionary relationship between sequences is inferred from the similarity of the sequences Similarity is a measurable quantity (e.g., % identity, alignment score, etc.) Homology is the inference from sequence similarity data that sequences are evolutionarily related

Sequence alignments Aligning sequences gives information about –Similarity –Areas of sequences that are conserved through evolution Aligning sequences gives information about –Similarity –Areas of sequences that are conserved through evolution

The real problem … How do we compare sequences? Seq 1: CTGCACTA Seq 2: CACTA or C---ACTA How do we compare sequences? Seq 1: CTGCACTA Seq 2: CACTA or C---ACTA

The real problem … How do we compare sequences? Seq 1: CTGCACTA Seq 2: CACTA or C---ACTA Scoring tries to approximate evolution: scores for substitutions and for gaps (insertions/deletions) Scores = sum of terms for substitutions and for gaps (sequence as character string) How do we compare sequences? Seq 1: CTGCACTA Seq 2: CACTA or C---ACTA Scoring tries to approximate evolution: scores for substitutions and for gaps (insertions/deletions) Scores = sum of terms for substitutions and for gaps (sequence as character string) 41 17

Sequence alignment I Simplest scoring: 1 for match, 0 for no match CTGCACTA CACTA CTGCACTA C---ACTA Simplest scoring: 1 for match, 0 for no match CTGCACTA CACTA CTGCACTA C---ACTA Score = 5

Sequence alignment II Slightly more advanced scoring: +1 for match, 0 for no match, -1 for gap CTGCACTA CACTA CTGCACTA C---ACTA Score = 5 Score = 2

GCAT G1000 C0100 A0010 T0001 G C A T G C A T Identity scoring matrices: top, simple form; below, with mismatch penalty

In-class exercise II Using the “advanced scoring method” calculate the scores for the following pairs of nucleotide sequences: CCTGGGCTATGC CAGGGTT-TGC CCTGGGCTATGC CA-GGG-TTTGC

What about proteins? Chemistry of amino acids means that some substitutions in the sequence are better than others Substitution matrix: empirically derived scores for frequency of substitution of each amino acid for all 19 others. Chemistry of amino acids means that some substitutions in the sequence are better than others Substitution matrix: empirically derived scores for frequency of substitution of each amino acid for all 19 others.

BLOSUM 62 Substitution matrix

In-class exercise III Using the BLOSUM62 substitution matrix and a gap penalty of -2, score the following pairs of protein sequences (do not penalize end gaps) YIHMNVFLSFML RVGAANFPNPRL YIHMNVFLSFML FIHMNLFVSFML YIHMNVFLSFML IHMNLFV--SFML YIHMNVFLSFML IVLSMMFFLNHY

Dynamic programming: strategy Break alignment problem into small pieces Optimize first piece Then extend into second piece; since first piece is optimized already, program only needs to optimize extension Continue until end of comparison Break alignment problem into small pieces Optimize first piece Then extend into second piece; since first piece is optimized already, program only needs to optimize extension Continue until end of comparison

HEE H E A E

HEE H E A E

HEE H E A E

Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved regions and function (more data) Better estimate of significance when using a sequence of unknown function Must use multiple alignments when establishing phylogenetic relationships Alignment of more than two sequences Usually gives better information about conserved regions and function (more data) Better estimate of significance when using a sequence of unknown function Must use multiple alignments when establishing phylogenetic relationships

Dynamic programming extended to many dimensions? No – uses up too much computer time and space E.g. 200 amino acids in a pairwise alignment – must evaluate 4 x 10 4 matrix elements If 3 sequences, 8 x 10 6 matrix elements If 6 sequences, 6.4 x matrix elements No – uses up too much computer time and space E.g. 200 amino acids in a pairwise alignment – must evaluate 4 x 10 4 matrix elements If 3 sequences, 8 x 10 6 matrix elements If 6 sequences, 6.4 x matrix elements

Need to find more efficient method Sacrifice certainty of optimum alignment for certainty of good alignment but faster Need to find more efficient method Sacrifice certainty of optimum alignment for certainty of good alignment but faster

Feng-doolittle algorithm Does all pairwise alignments and scores them Converts pairwise scores to “distances” D = -logS eff = -log [(S obs –S rand )/(S max – S rand )] S obs = pairwise alignment score S rand = expected score for random alignment S max = average of self-alignments of the two sequences Does all pairwise alignments and scores them Converts pairwise scores to “distances” D = -logS eff = -log [(S obs –S rand )/(S max – S rand )] S obs = pairwise alignment score S rand = expected score for random alignment S max = average of self-alignments of the two sequences

As S max approaches S rand (increasing evolutionary distance), S eff goes down; to make the distance measure positive, use the -log

Once the distances have been calculated, construct a guide tree (more in the phylogeny class) – tells what order to group the sequences Sequences can be aligned with sequences or groups; groups can be aligned with groups Once the distances have been calculated, construct a guide tree (more in the phylogeny class) – tells what order to group the sequences Sequences can be aligned with sequences or groups; groups can be aligned with groups

Sequence-sequence alignments: dynamic programming Sequence-group alignments: all possible pairwise alignments between sequence and group are tried, highest scoring pair is how it gets aligned to group Group-group alignments: all possible pairwise alignments of sequences between groups are tried; highest scoring pair is how groups get aligned Sequence-sequence alignments: dynamic programming Sequence-group alignments: all possible pairwise alignments between sequence and group are tried, highest scoring pair is how it gets aligned to group Group-group alignments: all possible pairwise alignments of sequences between groups are tried; highest scoring pair is how groups get aligned

Example Seq1Seq2 Seq3Seq4 Seq5 Alignment 1 Alignment 2 Alignment 3 Final alignment

Notice that this method does not guarantee the optimum alignment; just a good one. Gaps are preserved from alignment to alignment: “once a gap, always a gap” Notice that this method does not guarantee the optimum alignment; just a good one. Gaps are preserved from alignment to alignment: “once a gap, always a gap”

Distance methods Measuring distance -- just like when we talked about multiple alignment, distance represents all the differences at the various positions; these differences can be treated as equal or weighted according to empirical knowledge of substitution rates

Another way to say this is that there are a set of distances d ij between each pair of sequences i,j in the dataset. d ij can be the fraction f of sites u where residues x i and x j differ; or d ij can be such a fraction but weighted in some way (e.g. Jukes-Cantor distance)

Clustering algorithms UPGMA -- this is the distance clustering method that is used in pileup to make the guide tree d ij is the average distance between pairs of sequences found in two clusters, C i and C j. Text’s notation: |C i | = number of sequences in C i UPGMA -- this is the distance clustering method that is used in pileup to make the guide tree d ij is the average distance between pairs of sequences found in two clusters, C i and C j. Text’s notation: |C i | = number of sequences in C i

The algorithm in the text means just what we said before: find the closest distance between two sequences, cluster those; then find the next closest distance, cluster those; as sequences are added to existing clusters find the average distance between existing clusters Work through the notation! UPGMA assumes a molecular clock mechanism of evolution The algorithm in the text means just what we said before: find the closest distance between two sequences, cluster those; then find the next closest distance, cluster those; as sequences are added to existing clusters find the average distance between existing clusters Work through the notation! UPGMA assumes a molecular clock mechanism of evolution

Neighbor-joining: corrects for UPGMA’s assumption of the same rate of evolution for each branch by modifying the distance matrix to reflect different rates of change. The net difference between sequence i and all other sequences is r i =  d ik Neighbor-joining: corrects for UPGMA’s assumption of the same rate of evolution for each branch by modifying the distance matrix to reflect different rates of change. The net difference between sequence i and all other sequences is r i =  d ik k

The rate-corrected distance matrix is then M ij = d ij - (r i + r j )/(n - 2) Join the two sequences whose M ij is minimal; then calculate the distance from this new node to all other sequences using d km = (d im + d jm - d ij )/2 Again correct for rates and join nodes. The rate-corrected distance matrix is then M ij = d ij - (r i + r j )/(n - 2) Join the two sequences whose M ij is minimal; then calculate the distance from this new node to all other sequences using d km = (d im + d jm - d ij )/2 Again correct for rates and join nodes.

In-class exercise I Retrieve the file named phylo2 from bioinfI.list in my directory Open it in the editor, select all the sequencs Select Functions  Evolution  PAUPSearch; in Tree Optimality Criterion choose distance; in Method for Obtaining Best Tree choose heuristic. Leave everything else as default (make sure bootstrap option is not selected) Select Run. Inspect output Retrieve the file named phylo2 from bioinfI.list in my directory Open it in the editor, select all the sequencs Select Functions  Evolution  PAUPSearch; in Tree Optimality Criterion choose distance; in Method for Obtaining Best Tree choose heuristic. Leave everything else as default (make sure bootstrap option is not selected) Select Run. Inspect output

Parsimony methods Parsimony methods are based on the idea that the most probable evolutionary pathway is the one that requires the smallest number of changes from some ancestral state For sequences, this implies treating each position separately and finding the minimal number of substitutions at each position Parsimony methods are based on the idea that the most probable evolutionary pathway is the one that requires the smallest number of changes from some ancestral state For sequences, this implies treating each position separately and finding the minimal number of substitutions at each position

Example of parsimonious tree building Tree on left requires only one change, tree on left requires two: left tree is most parsimonious

Parsimony methods assign a cost to each tree available to the dataset, then screen trees available to the dataset and select the most parsimonious Screening all the trees available to even a smallish dataset would take too much time; branch and bound method builds trees with increasing numbers of leaves but abandons the topology whenever the current tree has a bigger cost than any complete tree Parsimony methods assign a cost to each tree available to the dataset, then screen trees available to the dataset and select the most parsimonious Screening all the trees available to even a smallish dataset would take too much time; branch and bound method builds trees with increasing numbers of leaves but abandons the topology whenever the current tree has a bigger cost than any complete tree

In-class exercise II Use same data set and program as in exercise I, but choose maximum parsimony. Use heuristic for the tree building method. Inspect your tree. Compare it to the distance generated tree. Use same data set and program as in exercise I, but choose maximum parsimony. Use heuristic for the tree building method. Inspect your tree. Compare it to the distance generated tree.

Maximum likelihood methods Maximum likelihood reconstructs a tree according to an explicit model of evolution. For the given model, no other method will work as well But, such models must be simple, because the method is computationally intensive Maximum likelihood reconstructs a tree according to an explicit model of evolution. For the given model, no other method will work as well But, such models must be simple, because the method is computationally intensive

Actually, all the other methods discussed implicitly use a simple model of evolution similar to the typical model made explicit in maximum likelihood: All sites selectively neutral All mutate independently, forward and reverse rates equal, given by  Actually, all the other methods discussed implicitly use a simple model of evolution similar to the typical model made explicit in maximum likelihood: All sites selectively neutral All mutate independently, forward and reverse rates equal, given by 

Also assume discrete generations and sites change independently Given this model, can calculate probability that a site with initial nucleotide I will change to nucleotide j within time t: P t ij =  ij e -  t + (1 - e -  t )g j, where  ij = 1 if i = j and  ij = 0 otherwise, and where g j is the equilibrium frequency of nucleotide j Also assume discrete generations and sites change independently Given this model, can calculate probability that a site with initial nucleotide I will change to nucleotide j within time t: P t ij =  ij e -  t + (1 - e -  t )g j, where  ij = 1 if i = j and  ij = 0 otherwise, and where g j is the equilibrium frequency of nucleotide j

The likelihood that some site is in state i at the kth node of a tree is L i (k) The likelihoods for all states for each site for each node are calculated separately; the product of the likelihoods for each site gives the overall likelihood for the observed data Different tree topologies are searched to find the highest overall likelihood The likelihood that some site is in state i at the kth node of a tree is L i (k) The likelihoods for all states for each site for each node are calculated separately; the product of the likelihoods for each site gives the overall likelihood for the observed data Different tree topologies are searched to find the highest overall likelihood

Maximum likelihood is maybe the “gold standard” for phylogenetic analysis; but because of its computational intensity it can only be used for select data and only after much initial fine tuning of many parameters of sequence alignments Often used to distinguish between several already generated trees Maximum likelihood is maybe the “gold standard” for phylogenetic analysis; but because of its computational intensity it can only be used for select data and only after much initial fine tuning of many parameters of sequence alignments Often used to distinguish between several already generated trees

Assessing trees The bootstrap: randomly sample all positions (columns in an alignment) with replacement -- meaning some columns can be repeated -- but conserving the number of positions; build a large dataset of these randomized samples

Bootstrap alignment process

Then use your method (distance, parsimony, likelihood) to generate another tree Do this a thousand or so times Note that if the assumptions the method is based on hold, you should always get the same tree from the bootstrapped alignments as you did originally The frequency of some feature of your phylogeny in the bootstrapped set gives some measure of the confidence you can have for this feature Then use your method (distance, parsimony, likelihood) to generate another tree Do this a thousand or so times Note that if the assumptions the method is based on hold, you should always get the same tree from the bootstrapped alignments as you did originally The frequency of some feature of your phylogeny in the bootstrapped set gives some measure of the confidence you can have for this feature

In-class exercise III Use the same dataset, select distance again. This time, select the bootstrap box. In options, make sure to select the box labelled Save a file containing PAUP screen output. Take defaults for everything else. Run. Inspect your output. In particular, look at the paup.log file and compare it to the paupdisplay.figure file. Use the same dataset, select distance again. This time, select the bootstrap box. In options, make sure to select the box labelled Save a file containing PAUP screen output. Take defaults for everything else. Run. Inspect your output. In particular, look at the paup.log file and compare it to the paupdisplay.figure file.

Repeat for the maximum parsimony method. Were the original trees (not bootstrapped) meaningful? Repeat for the maximum parsimony method. Were the original trees (not bootstrapped) meaningful?