Multiple Sequence alignment and Phylogenetic trees.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
Multiple Sequence Alignment & Phylogenetic Trees.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Phylogeny.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Multiple sequence alignment
Lecture 24 Inferring molecular phylogeny Distance methods
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Introduction to Bioinformatics Molecular Phylogeny Lesson 5.
Phylogenetic Analysis. 2 Phylogenetic Analysis Overview Insight into evolutionary relationships Inferring or estimating these evolutionary relationships.
Phylogenetic trees Tutorial 6. Distance based methods UPGMA Neighbor Joining Tools Mega phylogeny.fr DrewTree Phylogenetic Trees.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Terminology of phylogenetic trees
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
BINF6201/8201 Molecular phylogenetic methods
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Molecular Phylogeny. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Tutorial 5 Phylogenetic Trees.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Protein Sequence Alignment Multiple Sequence Alignment
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Phylogenetic trees. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.
Introduction to Bioinformatics Resources for DNA Barcoding
Phylogenetic basis of systematics
Inferring a phylogeny is an estimation procedure.
Clustering methods Tree building methods for distance-based trees
Methods of molecular phylogeny
Motif discovery and Phylogenetic trees.
Patterns in Evolution I. Phylogenetic
The Tree of Life From Ernst Haeckel, 1891.
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
Chapter 19 Molecular Phylogenetics
Lecture 7 – Algorithmic Approaches
Phylogeny.
Presentation transcript:

Multiple Sequence alignment and Phylogenetic trees

Multiple Sequence Alignment MSA

VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG-- Like pairwise alignment BUT compare n sequences instead of 2 Rows represent individual sequences Columns represent ‘same’ position Gaps allowed in all sequences

How to find the best MSA GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC GTCGTAGTCGGCTCGAC GTCTAGCGAGCGTGAT GCGAAGAGGCGAGC GCCGTCGCGTCGTAAC 1*1 2* *0.5 Score=8 4*1 11*0.75 2*0.5 Score=13.25 Score : 4/4 =1, 3/4 =0.75, 2/4=0.5, 1/4= 0

Alignment of 3 sequences: Complexity: length A  length B  length C Aligning 100 proteins, 1000 amino acids each Complexity: table cells

Feasible Approach Based on pairwise alignment scores –Build n by n table of pairwise scores Align similar sequences first –After alignment, consider as single sequence –Continue aligning with further sequences Progressive alignment (Feng & Doolittle).

–For n sequences, there are n  (n-1)/2 pairs GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC

1 GTCGTAGTCG-GC-TCGAC 2 GTC-TAG-CGAGCGT-GAT 3 GC-GAAGAGGCG-AGC 4 GCCGTCGCGTCGTAAC 1 GTCGTA-GTCG-GC-TCGAC 2 GTC-TA-G-CGAGCGT-GAT 3 G-C-GAAGA-G-GCG-AG-C 4 G-CCGTCGC-G-TCGTAA-C

CLUSTAL method Higgins and Sharp 1988 –ref: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244. [Medline][Medline] An approximation strategy (heuristic algorithm) yields a possible alignment, but not necessarily the best one Applies Progressive Sequence Alignment

For what do we need MSA? 10

11 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are mainly used for phylogenetic analyses. One tree of life A sketch Darwin made soon after returning from his voyage on HMS Beagle (1831–36) showed his thinking about the diversification of species from a single stock (see Figure, overleaf). This branching, extended by the concept of common descent,

12 Haeckel (1879)Pace (2001)

13 Molecular phylogeny uses trees to depict evolutionary relationships among organisms. These trees are based upon DNA and protein sequence data Human Chimpanzee Gorilla Orangutan Gorilla Chimpanzee Orangutan Human Molecular analysis: Chimpanzee is related more closely to human than the gorilla Pre-Molecular analysis: The great apes (chimpanzee, Gorilla & orangutan) Separate from the human

14 What can we learn from phylogenetics tree?

Was the extinct quagga more like a zebra or a horse? 1. Determine the closest relatives of one organism in which we are interested

16 Which species are closest to Human? Human Chimpanzee Gorilla Orangut an Gorilla Chimpanzee Orangutan Human

17 Example Metagenomics A new field in genomics aims the study the genomes recovered from environmental samples. A powerful tool to access the wealthy biodiversity of native environmental samples 2. Help to find the relationship between the species and identify new species

10 6 cells/ ml seawater 10 7 virus particles/ ml seawater >99% uncultivated microbes

19 From : “The Sorcerer II Global Ocean Sampling Expedition: Metagenomic Characterization of Viruses within Aquatic Microbial Samples” Williamson et al, PLOS ONE 2008

3. Discover a function of an unknown gene or protein 20 RBP1_HS RBP2_pig RBP_RAT ALP_HS ALPEC_BV ALPA1_RAT ECBLC Hypothetical protein X

21 Relationships can be represented by Phylogenetic Tree or Dendrogram A B C D E F

22 Phylogenetic Tree Terminology Graph composed of nodes & branches Each branch connects two adjacent nodes A B C D E F R

23 Rooted tree Human Chimp Chicken Gorilla Human Chimp Chicken Gorilla Un-rooted tree Phylogenetic Tree Terminology

24 Rooted vs. unrooted trees

25 How can we build a tree with molecular data? -Trees based on DNA sequence (rRNA) -Trees based on Protein sequences

26 Approach 1 - Distance methods Algorithms : - UPGMA (rooted), - Neighbor joining (unrooted) Approach 2 - State methods Algorithms: –Maximum parsimony (MP) –Maximum likelihood (ML)

Basic algorithm for constructing a rooted tree Unweighted Pair Group Method using Arithmetic Averages (UPGMA) Assumption: Divergence of sequences is assumed to occur at a constant rate  Distance to root is equal Sequence a ACGCGTTGGGCGATGGCAAC Sequence b ACGCGTTGGGCGACGGTAAT Sequence c ACGCATTGAATGATGATAAT Sequence d ACACATTGAGTGTGATAATA abcd

28 abcd a0875 b8039 c7308 d5980 Basic Algorithm UPGMA Distance Table Sequence a ACGCGTTGGGCGATGGCAAC Sequence b ACACATTGAGTGTGATCAAC Sequence c ACACATTGAGTGAGGACAAC Sequence d ACGCGTTGGGCGACGGTAAT Distances * Sequences Dab = 8 Dac = 7 Dad = 5 Dbc = 3 Dbd = 9 Dcd = 8 * Can be calculated using different distance metrics

29 abcd a0875 b8039 c7308 d5980 a d c b Choose the nodes with the shortest distance and fuse them. Selection step

30 a Then recalculate the distance between the rest of the remaining sequences (a and d) to the new node (e) and remove the fused nodes from the table. d c,b e a ade a056 d507 e670 D (EA) = (D(AC)+ D(AB)-D(CB))/2 Next Step D (ED) = (D(DC)+ D(DB)-D(CB))/2 abcd a0875 b8039 c7308 d5980

31 In order to get a tree, un-fuse c and b by calculating their distance to the new node (e) !!!The distances Dce and Dde are calculated independently (formula will be given in tirgul) d c e a ade a056 d507 e670 b D ce D de Next Step

32 a a,d c e ade a056 d507 e670 b D ce D de f Next… We want to fuse the next closest nodes

33 a c e fe f04 e40 b D af D de f d D ce D bf Finally D (EF) = (D(EA)+ D(ED)-D(AD))/2 We need to calculate the distance between e and f

34 a d c b acbd f e

35 IMPORTANT !!! Usually we don’t assume a constant mutation rate and in order to choose the nodes to fuse we have to calculate the relative distance of each node to all other nodes. Neighbor Joining (NJ)- is an algorithm which is suitable to cases when the rate of evolution varies

36 Neighbor Joining (NJ) Reconstructs an unrooted tree Calculates branch lengths Based on pairwise distances In each stage, the two nearest nodes of the tree are chosen and defined as neighbors in our tree. This is done recursively until all of the nodes are paired together.

37 Advantages -It is fast and thus suited for large datasets -Permits lineages with largely different branch lengths Disadvantages - Sequence information is reduced - Gives only one possible tree Advantages and disadvantages of the neighbor-joining method

Problems with phylogenetic trees - Using different regions from a same alignment may produce different trees.

Problems with phylogenetic trees

Bacillus E.coli Pseudomonas Salmonella Aeromonas Lechevaliera Burkholderias Problems with phylogenetic trees

What to do ?

42 A.We create new data sets by sampling N positions with replacement. B.We generate such pseudo-data sets. C.For each such data set we reconstruct a tree, using the same method. D.We note the agreement between the tree reconstructed from the pseudo-data set to the original tree. Note: we do not change the number of sequences ! Bootstrapping

Bootstrapped tree Less reliable Branch Highly reliable branch

44 Open Questions Do DNA and proteins from the same gene produce different trees ? Can different genes have different evolutionary history ?

45