Multiple Sequence Alignment & Phylogenetic Trees.

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
An Introduction to Phylogenetic Methods
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
© Wiley Publishing All Rights Reserved. Phylogeny.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
Bioinformatics and Phylogenetic Analysis
The Tree of Life From Ernst Haeckel, 1891.
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogeny Tree Reconstruction
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Multiple Sequence Alignments
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Evolutionary tree reconstruction
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Tutorial 5 Phylogenetic Trees.
Phylogenetics.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Building Phylogenies. Phylogenetic (evolutionary) trees Human Gorilla Chimp Gibbon Orangutan Describe evolutionary relationships between species Cannot.
Phylogenetic trees. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.
Phylogenetic basis of systematics
Distance based phylogenetics
Multiple sequence alignment (msa)
Multiple Alignment and Phylogenetic Trees
The Tree of Life From Ernst Haeckel, 1891.
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
Lecture 7 – Algorithmic Approaches
Phylogeny.
Presentation transcript:

Multiple Sequence Alignment & Phylogenetic Trees

Multiple Sequence Alignment Motivation: Indication of a common structure/function. A common evolutionary source (protein families, shared homologous regions).

High consensus colour: red Low consensus colour: blue Neutral colour: black Consensus: the most common letter.

Uses of Multiple Sequence Alignment 1.Determine consensus sequences EMOTIF, Clustal, Pileup 2.Building gene families 2.Blocks, Prints, Prodom, HSSP. Develop phylogenies clusters, evolutionary models. PHYLIP, MACAU Model protein structures Hidden Markov Models, PFAM Profiles and templates, SCOP, FSSP Neural Networks, PSI-PRED

EXAMPLE: LOON (bird): RED EYES, FEATHERS, 28 VERTEBRAE DOG: BROWN EYES, HAIR, 23 VERTEBRAE CROC: GREEN EYES, SCALES, 28 VERTEBRAE We would construct the matrix : LOON (bird): 000 DOG: 111 CROC: 220 With DNA sequences each possible character has the same 4 possible states (A, C, G, T). Protein sequences have 20 possible states. Multiple Alignment (Morphological Data):

Multiple Sequence Alignment - Definition A multiple alignment of sequences S1,S2,..,Sk is a series of sequences S1’, S2’,.., Sk’ with gaps such that: –all Si’ sequences are of equal lengths. –Sj’ is an extension of Sj, obtained by insertion of gaps. Example: ACTCGT, CAGTG, ACATCG AC__TCGT _CAGT_G_ ACA_TCG_

The Size Problem: If we consider only short sequences and only two taxa, we can handle the comparison manually. For example, 2 taxa matrix: But if you were to do this for 75 taxa, you'd have to use 75 dimensional space !!! In general, MSA methods are based on pairwise alignments between the sequences. Taxa 2 Taxa 1

LOON: AAC DOG: ACA CROC: CCA RAT: CAC There is one difference (two states) in each of the columns, thus the column- score for the alignment is 3. Determining Score: Most alignment algorithms determine the cost of an alignment column-wise. Example: Usually we will align the sequences in pairs, and then align the pairs. Possible scoring schemes include: Sum of pairs - sum of pairwise distances between all pairs of sequences. Distance from consensus - the consensus is a string of the most common character in each column.

MSA Approaches Progressive approach: Build MSA starting from most related sequences, and then progressively add less related sequences. ClustalW, Pileup. Iterative approach: Repeatedly realign subgroups of sequences. Objective: Improve the MSA score according to the scoring scheme, e.g., the sum of pairs score. Subgroups are based on phylogenetic tree or random selection. MultAlin, DiAlign. Problem: Errors in the initial alignment are propagated to the MSA.

ClustalW Algorithm: Compute pairwise alignment for all the pairs of sequences. Build a phylogenetic guide tree such that similar sequences are neighbors in the tree distant sequences are distant from each other in the tree. The sequences are progressively aligned according to the branching order in the guide tree.

Input data Pairwise alignment Multiple alignment

PHYLOGENETIC RECONSTRUCTION Goal: Given a set of species*, reconstruct the tree which best explains their evolutionary history.

All organisms undergo a slow process of transformation through the ages - Evolution. The process of speciation (creating new species) is described by phylogenetic trees. Trees are acyclic connected graphs. Example: Primate phylogenetic tree The common ancestor of human and chimp chimpanzee humangorillaorangutangibbonsiamang EVOLUTION and PHYLOGENY The common ancestor of all six primates

Nodes: External nodes (tips of tree) represent extant (existing) species. Internal nodes represent ancestral species (usually extinct). Branches: Length correspond to number of mutations. Longer branch means more mutations, usually implying longer evolutionary time. Typical time scale is mya (millions years ago). chimpanzeehumangorillaorangutangibbonsiamang External nodes Internal nodes Branch Tree Features:

Phylogenetic Reconstruction Goal: Given a set of taxa (a group of related biological species), build a tree which best represents the course of evolution for this set over time. Trees: Rooted or unrooted. Most reconstruction methods produce unrooted trees. To root a tree we need “external information’’ (e.g. outgroup ). human chimpanzee Unrooted chimpanzee human gorilla orangutan Rooted orangutan gorilla

Classical phylogenetic analysis: Darwin (origin of species, November 24, 1859) and his contemporaries based their work on morphological and physiological properties (e.g. cold/warm blood, existence of scales, number of teeth, existence of wings, etc., etc.) Modern biological methods are based on molecular features: homologous sequences (e.g., globins) in different species; use DNA or protein sequences. Trees are Based on What?

Homologous genes have a common ancestor. However gene duplications and losses events obscure evolutionary events.

Input Algorithm Tree Morphology Based Input: n-by-m table, with rows = species, columns = properties. Sequence Based Input: n aligned sequences, one per species. algorithm Phylogenetic tree Properties table or aligned sequences Major types of Algorithms: Distance Based Methods: UPGMA, Neighbor Joining. Character Based Methods: Maximum Parsimony, Maximum Likelihood.

The Methods: Distance- A tree that recursively combines two nodes of the smallest distance. Parsimony – A tree with a total minimum number of character changes between nodes. Maximum likelihood - Finds the most probable tree under a mutation model. The method of choice nowadays.

Distance Based Methods Iterative process, n-1 stages. Each stage consists of two steps: Step 1: Determine the closest pair of species v, u. “Merge’’ together these two “neighbors” to a new species w. Step 2: Update the distance matrix. Determine the distances from the new species w to the n-2 other. There are many distance based methods. Most popular are UPGMA and Bio-NJ. Different choices of the closest pair, and the ways to resolve ties.

UPGMA –Unweighted Pair Group Method with Arithmetic mean Algorithm - 2 stages: 1.Build a simple distance matrix: Distance between a pair of species may be the number of sites in which they differ. 2.Construct a tree by iteratively clustering species with small distances (“neighbors ”). ABCD B6 C57 D10127

EXAMPLE for UPGMA Find the pair with the closets distance: AC. Calculate distance between A and C: A | ----C 2.5 Merge A and C to AC and update distance matrix. Dist(AC,x) = [dist(A,x) + dist(C,x)]/2. ABCD B6 C57 D10127 ACBD B6.5 D8.512

EXAMPLE for UPGMA Next pair: AC,B A | |----C |2.5 | B 3.25 ACB D10.25 ACBD B6.5 D8.512 Next pair: ACB.D A | 1.875| ----C |2.5 | | | B | 3.25 | D 5.125

UPGMA Properties Builds a rooted tree. The output tree is ultrametric: the distance between the root and any leaf is the same. This leads to a similar molecular clock assumption, which is too good to be true. The tree is additive: the distance between any two nodes equals the sum of the lengths of the branches connecting them.

Neighbor Joining Builds an additive tree which does not assume an equal molecular clock. The tree is unrooted. Algorithm is similar: merge the pair of nodes whose distance is smallest. Merge nodes A and B such that M(A,B) is smallest: r(A) = [  x d (A,x)]/(N-2). M(A,B) = d (A,B)-[r(A)+r(B)]. d (A,AB) = 0.5[ d(A,B)+r(A)-r(B)] d (B,AB) = d (A,B) – d (A,AB).

Neighbor Joining Set N to contain all leaves Iteration: u Choose i,j such that M(i,j) is minimal u Create new node k, and set u remove i,j from N, and add k Terminate: when |N| =2, connect two remaining nodes i j k m

Neighbor Joining Example Compute r for every node, N=4. r(A)=0.5*(6+5+10); r(B)=0.5*(6+7+12); r(C) = 0.5*(5+7+7); r(D) = 0.5*( ); Compute M for every pair of nodes. M(A,B) = dist(A,B)-[r(A)+r(B)]=6-( ). In this example C and D are merged first. ABCD B6 C57 D10127 A B C D

If you break ties “systematically”, that is according to the order of appearance in the matrix, you'd get the UPGMA tree on the left if you completed this procedure. If you broke ties randomly, you might get the tree on the right here.

Maximum Parsimony We are looking for an “evolutionary explanation” for existing species that will minimize the number of mutations. Evolutionary explanation - a tree and series in internal nodes. The internal nodes stand for steps required to generate the observed variation in the sequences. This problem is NP-hard. However, for a given tree it is easy to find an assignment for the internal nodes that minimizes the number of mutations.

Calculating the minimal number of steps The intersection of C, T and C is (of course) C The intersection set of A, C and C is C We add a length of 1 Length=2 An intersection of A and A, it is A, thus we apply A to the node. Length =0 We add a length of 1 Length=1

Maximum Parsimony Problems It is possible for small datasets to evaluate all possible tree topologies. Done by adding taxa to the growing tree in all possible locations. Specifically, where the number of taxa t = 4, there are 3 un-rooted trees. The number of possible trees rapidly increases with increasing t. Number of trees: (2t - 5)!/[2 t-3 (t - 3)!] When t = 10, the number is more than two million. Maximum parsimony is not always real.

Maximum Likelihood Uses probability calculations to find a tree that best accounts for the variation in a set of sequences. In each tree the number of sequence changes is considered. Allows for variation in mutation rates, and can incorporate evolutionary models such as Jukes- Cantor. Like Maximum parsimony - analysis is performed on each column in a series, and all possible trees are considered. Computational intensive!

Comparison When the sequences are very similar all methods will produce a tree close to the real tree. When sequences are less related, neighbor joining and maximum likelihood are usually better than maximum parsimony.