Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Phylogenetic Trees Lecture 4
Measuring the degree of similarity: PAM and blosum Matrix
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Lecture 24 Inferring molecular phylogeny Distance methods
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Probabilistic methods for phylogenetic trees (Part 2)
Phylogenetic Analysis. 2 Phylogenetic Analysis Overview Insight into evolutionary relationships Inferring or estimating these evolutionary relationships.
Phylogenetic trees Sushmita Roy BMI/CS 576
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Molecular phylogenetics
P HYLOGENETIC T REE. OVERVIEW Phylogenetic Tree Phylogeny Applications Types of phylogenetic tree Terminology Data used to build a tree Building phylogenetic.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Calculating branch lengths from distances. ABC A B C----- a b c.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Phylogenetics.
Phylogenetic Trees - Parsimony Tutorial #13
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
Methods in Phylogenetic Inference Chris Castorena Thornton Lab.
Building Phylogenies. Phylogenetic (evolutionary) trees Human Gorilla Chimp Gibbon Orangutan Describe evolutionary relationships between species Cannot.
Phylogeny and the Tree of Life
Phylogenetic basis of systematics
Multiple Alignment and Phylogenetic Trees
Inferring phylogenetic trees: Distance and maximum likelihood methods
Summary and Recommendations
Chapter 19 Molecular Phylogenetics
Unit Genomic sequencing
Summary and Recommendations
Presentation transcript:

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances Ancestral reconstructions and estimating branch lengths Lecture 23

Phylogenetic analysis and sequence alignment If two nucleic acid or protein molecules show sufficient level of similarity it likely means that they derived from a common ancestor. The major confusing point is horizontal gene transfer, when a DNA sequence was brought by viruses, mobile elements or other process from a remote source. Another confusing element is an unequal rate of molecular evolution for the same gene in different species. The third potentially confusing point is comparisons between paralogous sequences. As soon as these three obstacles are taken into consideration, a comparison of similar sequences provides a sound foundation for reconstruction of phylogenetic or evolutionary relationships using molecular data.

Multiple sequence alignment and phylogenetic analysis Multiple alignment is a critical step in phylogenetic analysis. Multiple alignment procedure, while it can be reliable in many cases, has a potential problem. It depends on composition of a group of aligned sequences and if there is some bias of alike sequences, this will affect the alignment. The position of gaps might be affected by alike sequences. Bias multiple alignments will inevitably affect phylogenetic reconstructions.

Unrooted phylogenetic tree Branch (edge) Node Seq A Seq B Seq C Seq D

Rooted phylogenetic tree TIME

Possible rooted phylogenetic trees for 4 OTUs

The number of rooted trees is much higher than unrooted trees. Number of OTUs Number of unrooted trees Number of rooted trees , ,135 2,027, , ,135 2,027,025 34,459,425

A strategy of phylogenetic reconstruction Choose set of related sequences Obtain multiple sequence alignment Is there strong sequence similarity? Maximum parsimony methods Is there clearly recognizable sequence similarity? Maximum likelihood methods Distance methods Analyse how well data support prediction Yes No Maximum parsimony produces better results when sequence similarity between sequences is high and when amount of variation is small. Distance methods are generally better when variation between sequences is intermediate. Maximum likelihood is less sensitive to sequence variation but computationally very demanding.

Simplifying assumptions for phylogenetic reconstructions All nucleotide sites change independently. The substitution rate is constant over time and in different lineages. The base composition is at equilibrium. The conditional probabilities of nucleotide substitutions are the same for all sites and do not change over time. The number of gaps in MA should not be too large. MA of similar sequences is preferable. The simplest evolutionary models are preferable: a) no reversal (A  T  A), b) no multiple steps (A  T  G). Gaps in MA are usually not scored because there is no suitable model for the evolutionary mechanisms that produce them. MOST OF THESE CONDITIONS ARE VIOLATED IN SOME DEGREE AND TIME TO TIME.

Measuring genetic distances Calculation of distances between pairs of sequences is essential for phylogenetic reconstructions. Calculation of a genetic distance is similar to asking a question “how much evolutionary change has occurred between two sequences” If all earlier assumptions were correct, calculation of a genetic distance would be simple. Unfortunately this is not the case and evolutionary time itself makes a significant contribution.

Number of nucleotide substitutions between pairs of bovid mtDNA The observed number of substitutions is not linear with time.

The need to correct observed sequence differences Numerous distance correction techniques have been proposed to estimate the actual amount of evolutionary change. Many of the methods are interrelated.

Models of sequence evolution: basic approach Within a general framework assuming the probability of a given nucleotide substitution remains constant over time and other assumptions mentioned earlier, the substitution matrix is given by: p AA p AC p AG p AT P t = p CA p CC p CG p CT p GA p GC p GG p GT p TA p TC p TG p TT In most models the matrix is symmetric : p AC = p CA If the numbers 1-4 are assigned to each nucleotide (i.g. A = 1, C = 2, etc.) then the value of the diagonal elements is given by p ii = 1 -  j  i p ij In other words, the probability of observing A at a given site at time 0 and again at time t is 1 minus the probability of observing the substitution of A by any of C, G, or T. The base composition of the sequences can be represented by a vector f = [f A f C f G f T ]. In some models f values are equal, in others different.

Different models estimating nucleotide substitutions among a pair of DNA sequences

Observed and expected numbers of nucleotide pairs between human and chimpanzee mtDNA sequences for 3 models. As the models add parameters they more closely approximate the observed pattern

Measuring evolutionary change on a tree The evolutionary distance between b and d is the sum of the edge lengths along the path in the tree between the two sequences.

Inferring branch lengths There are many methods for inferring branch lengths, and these are directly related to the methods of tree construction. Here we shall focus on one approach (parsimony method) because this method unlike others explicitly seeks to reconstruct the ancestral sequences, rather than just the edge lengths. In the vast majority of cases the ancestral sequences are never known. 1 CGA 2 CGA 3 ATT 4 TTT 5 TGT 6 TGG TGT - consensus sequence

Inferring ancestor sequences: is consensus always correct ? Given a star tree the most frequent base at each position is the most parsimonious estimate of the ancestral nucleotide. If, however, tree is known using parsimony the common ancestral sequence is different. It means that simple consensus is not appropriate in many cases.

Basic rules and assigning state sets to internal nodes Two basic rules of parsimony are: a) if two sequences (nodes) have the same state, their common ancestor had the same state, b) if the two sequences have different states then the state set of the ancestor has both states or in other words there is a lack of information about its state.

Resolving the ambiguity in ancestral reconstructions By going back up from the root of the tree the ambiguities can be resolved. However, in more complex situations it is not always possible

Estimating branch lengths Distance method of estimation of branch lengths (next lecture) uses direct or indirect measurements of sequence similarity (genetic distances). Parsimony method uses number of events between two compared nodes. A lack of data about the intermediate sequences prevents reconstruction of the true evolutionary story and incorrect conclusions can be drawn (a). In reality at least 3 events took place between nodes 1 and 2 (b).

Summary Multiple alignment is an essential and initial step in any phylogeny reconstruction effort. Multiple substitutions at the same sequence position underestimate the actual number of substitutions that have taken place. The longer period of time since the common ancestor, the sharper this problem. Several correction methods can “soften” the problem. A number of basic assumptions used in the phylogenetic reconstructions, like standard mutation/substitution rate, independence of substitutions at different sites, etc., are generally not correct. The quality of phylogenetic reconstruction depends on how many sequences were sampled. A lack of data may significantly bias the reconstruction.