Phylogenetic analysis taken from and es/MSAPhylogeny.htm.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
The Evolutionary Basis of Bioinformatics: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Tree of Life Chapter 26.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Phylogeny.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
Bioinformatics and Phylogenetic Analysis
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Introduction to Bioinformatics Molecular Phylogeny Lesson 5.
Phylogenetic Analysis. 2 Phylogenetic Analysis Overview Insight into evolutionary relationships Inferring or estimating these evolutionary relationships.
Phylogenetic trees Sushmita Roy BMI/CS 576
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Molecular phylogenetics
Christian M Zmasek, PhD 15 June 2010.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Chapter 26: Phylogeny and the Tree of Life Objectives 1.Identify how phylogenies show evolutionary relationships. 2.Phylogenies are inferred based homologies.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
 Read Chapter 4.  All living organisms are related to each other having descended from common ancestors.  Understanding the evolutionary relationships.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Using blast to study gene evolution – an example.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Phylogenetics.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Chapter 26 Phylogeny and Systematics. Tree of Life Phylogeny – evolutionary history of a species or group - draw information from fossil record - organisms.
5.4 Cladistics The images above are both cladograms. They show the statistical similarities between species based on their DNA/RNA. The cladogram on the.
Phylogenetic trees. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.
Section 2: Modern Systematics
Phylogeny and the Tree of Life
Introduction to Bioinformatics Resources for DNA Barcoding
Phylogenetic basis of systematics
Basics of Comparative Genomics
Comparative Genomics.
Pipelines for Computational Analysis (Bioinformatics)
Section 2: Modern Systematics
Multiple Alignment and Phylogenetic Trees
Methods of molecular phylogeny
Phylogenetic Trees.
Phylogeny and the Tree of Life
Phylogeny and Systematics
Unit Genomic sequencing
Basics of Comparative Genomics
Presentation transcript:

Phylogenetic analysis taken from and es/MSAPhylogeny.htm And Introduction to Bioinformatics course slides

Purpose of phylogenetics : Reconstruct the evolutionary relationship between species Experience learns that closely related organisms have similar sequences, more distantly related organisms have more dissimilar sequences. Estimate the time of divergence between two organisms since they last shared a common ancestor. But… The theory and practical applications of the different models are not universally accepted. Important to have a good alignment to start with. (Garbage in, Garbage out) Trees based on an alignment of a gene represent the relationship between genes and this is not necessarily the same relationship as between the whole organisms. If trees are calculated based on different genes from organisms, it is possible that these trees result in different relationships.

Why is phylogeny imporant Determining tree of life (e.g., for a new organism) Determining gene function Understand which parts of the gene/regulatory sequences are important Tracing the evolution of genes – horizontal gene transfer etc.

Protein or DNA? As with Multiple Sequence Alignment – proteins are preferred –More informative –Shorter in length –Less chance of multiple mutations at the same site When DNA? –A non-coding sequence –Proteins too similar

Terminology : node : a node represents a taxonomic unit. This can be a taxon (an existing species) or an ancestor (unknown species : represents the ancestor of 2 or more species). branch : defines the relationship between the taxa in terms of descent and ancestry. topology : is the branching pattern. branch length : often represents the number of changes that have occurred in that branch. root : is the common ancestor of all taxa. distance scale : scale which represents the number of differences between sequences (e.g. 0.1 means 10 % diff

Possible ways of drawing a tree : Unscaled branches : the length is not proportional to the number of changes.

Possible ways of drawing a tree : Scaled branches : the length of the branch is proportional to the number of changes (usually in PAMs). The distance between 2 species is the sum of the length of all branches connecting them.

Possible ways of drawing a tree : Rooted trees: the root is the common ancestor. The direction of each path from the root corresponds to evolutionary time. Unrooted tree: specifies the relationships among species and does not define the evolutionary path.

9 Rooted vs. unrooted trees

10 The position of the root does not affect the MP score. Rooted vs. Unrooted.

11 s1s4s3s2s5 Gene number or 0 Intuition why rooting doesn’t change the score The change will always be on the same branch, no matter where the root is positioned… 1

12 We want rooted trees! How can we root the tree?

13

14

15 Gorilla gorilla (Gorilla) Homo sapiens (human) Pan troglodytes (Chimpanzee) Gallus gallus (chicken)

16 Evaluate all 3 possible UNROOTED trees: Human Chimp Chicken Gorilla Human Gorilla Chimp Chicken Human Chicken Chimp Gorilla MP tree

17 Rooting based on a priori knowledge: Human Chimp Chicken Gorilla HumanChimpChickenGorilla

18 Ingroup / Outgroup: HumanChimp Chicken Gorilla INGROUP OUTGROUP

Tree of life

Distance-based methods Compress all of the individual differences between pairs of sequences into a single number – the distance. Starting from an alignment, pairwise distances are calculated between DNA sequences as the sum of all base pair differences between two sequences (the most similar sequences are assumed to be closely related. This creates a distance matrix. From the obtained distance matrix, a phylogenetic tree is calculated with clustering algorithms. These cluster methods construct a tree by linking the least distant pair of taxa, followed by successively linking more distant taxa. Algorithms: UPGMA clustering, Neighbor Joining. Assumes molecular clock ClustalW!

Cladistic methods Trees are calculated by considering the various possible pathways of evolution and are based on parsimony or likelihood methods. These methods use each alignment position as evolutionary information to build a tree. Parsimony : Looks for the most parsimonious tree: the tree with the fewest evolutionary changes for all sequences to derive from a common ancestor. Slower than distance methods. Assumes molecular clock Maximum Likelihood : Looks for the tree with the maximum likelihood: the most probable tree. this is the slowest method of all but seems to give the best result and the most information about the tree. No molecular clock assumption Phylip

Two homologous DNA sequences which descended from an ancestral sequence and accumulated mutations since their divergence from each other. Note that although 12 mutations have accumulated, differences can be detected at only three nucleotide sites. Even the best evolutionary models can't solve this problem...

Molecular clocks Assumption: constant rate of evolution Different rate for different genes: Millions of years since divergence Dickerson, 1971

Human insulin

Insulin multiple alignment

Surprisingly, insulin from the guinea pig evolved seven times faster than insulin from other species. Why? The answer is that guinea pig insulin does not bind two zinc ions, while insulin molecules from most other species do. There was a relaxation on the structural constraints of these molecules, and so the genes diverged rapidly. Problems with molecular clocks

Building trees with ClustalW Place alignment here Choose a tree here

PHYLIP A suite of phylogeny tools Both web servers and stand-alone applications Used for distance/parsimony/maximum likelihood /phylip-uk.html

Sequences

Bootstrapping Assigns confidence to individual tree branches Columns of the alignment are randomly sampled (with replacement) and the tree is recomputed X many interactions Boorstrap value of a branch = how many iterations had it.

Collections of homologous genes Entrez – enehttp:// ene COG – Clusters of Orthologous Genes –Results of Blast All-vs-All between genomes. Genes within the same COG are “pairwise best hits” – RDP – Ribosomal sequences –The “standard” sequences for doing species phylogeny –Focused on Bacteria –

32 Orthologs Homologous sequences are orthologous if they were separated by a speciation event: If a gene exists in a species, and that species diverges into two species, then the copies of this gene in the resulting species are orthologous.

33 Orthologs Orthologs will typically have the same or similar function in the course of evolution. Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes.

34   Orthologs speciation ancestor descendant 2 (e.g., dog) descendant 1 (e.g., human)

35 Paralogs Homologous sequences are paralogous if they were separated by a gene duplication event: If a gene in an organism is duplicated, then the two copies are paralogous.

36 Paralogs Orthologs will typically have the same or similar function. This is not always true for paralogs due to lack of the original selective pressure upon one copy of the duplicated gene, this copy is free to mutate and acquire new functions.

37 Paralogs   Duplication

38 (taken from NCBI)

39 Using BLAST and phylogeny to study gene evolution

40 Mol. Biol. Evol. (2005) 22:

41 Evolutionary rate and conservation Functionally or structurally important sites are conserved: Conserved sites  “slow” evolving sites Variable sites  “fast evolving” sites Sites which are under a functional/structural constraint are conserved, and evolve slowly

42 Conservation in an MSA S1 KITAYCELARTDMKLGLDFYKGVSLANWVCLAKWESGYN S2 MPFERCELARTLKRMADADIRGVSLANWVCLAKWFWDGG S3 MPFERCELARTLKRMMDADIRGVSLANWVCLAKWFWDGG From the MSA (and the tree), one can determine how conserved is a gene.

43 “ Inverse relation between evolutionary rate and age of mammalian genes ” : Protocol

44 Step 1 - BLAST Build the dataset of mammalian genes

45 Step 1 – BLAST: build the dataset of mammalian genes, based on mouse-human ortholog pairs The orthologs are defined as pairs of reciprocal BLAST hits. Eliminate genes with more than one potential orthologous sequence. Select only genes which the human protein was functionally annotated.

46 Step 2 – Calculate conservation

47 Step 2 – Calculate Evolutionary Rates (Conservation) For each orthologous pair: Alignment at the amino acid level. Measure evolutionary rate The dataset contained 6,776 human-mouse gene pairs.

48 Step 3 – Assignment of Temporal Categories How old is each gene? Used BLAST to find homologs in 6 different eukaryotic genomes

49 Caenorhabditis elegans Schizosaccharomyces pombe Takifugu rubripes Drosophila melanogaster Arabidopsis thaliana Saccharomyces cerevisiae

50 What is Old ? Presence of any homolog in all the 6 genomes. What is Presence ?  Using an e-value cutoff of in BLAST. OLD METAZOANS DEUTEROSTOMES TETRAPODS

51 METAZOANS - Organisms whose bodies consist of many cells, as distinct from Protozoa, which are unicellular; also commonly called animals. DEUTEROSTOMES - The second of the two main groups of bilaterally symmetrical animals. The name derives from 'deutero' (second) 'stome' (mouth), referring to the origin of the definitive mouth as an opening independent from the blastopore of the embryo. TETRAPODS - Any four-legged animals, including mammals, birds, reptiles and amphibians.

52 Human Mouse Fish Insect Worm Yeast Plant Tetrapods Deuterostomes Metazoa Old (eukaryotes)

53 Results

54 Negative correlation between “age” of genes and the rate of evolution Evolutionary rate Negative correlation between “age” of genes and the rate of evolution

55 Control. Changing the sensitivity of the BLAST detection to a more conservative one of , did not significantly affect the result.

56 Explanations

57 Functional constraints remained constant throughout the evolutionary history of each gene, but the newer genes are less constrained than older genes. Functional constraints are not constant, rather they are weak at the time of origin of a gene and they become progressively more stringent with age.

58 Eran Elhaik, Niv Sabath, and Dan Graur Mol. Biol. Evol. 23(1):1–

59 Goal To show that these results are an artifact caused by our inability to detect similarity when genetic distances are large.

60 Simulation

61 The evolutionary process Rat Dog Cat Mouse Fly AlaArg Va l Ala Arg Val … Replacement probabilities …

62 The evolutionary process Rat Dog Cat Mouse Fly V AlaArg Va l Ala Arg Val … Replacement probabilities …

63 Rat Dog Cat Mouse Fly V V The evolutionary process AlaArg Va l Ala Arg Val … Replacement probabilities …

64 Rat Dog Cat Mouse Fly L V V The evolutionary process AlaArg Va l Ala Arg Val … Replacement probabilities …

65 L L I M V Rat Dog Cat Mouse Fly L L V V The evolutionary process AlaArg Va l Ala Arg Val … Replacement probabilities …

66 Rat L M T G S H M G N F I I Mouse L M T G S G M A N H V I Cat I M T G S H I G Y A M F Dog M M T G S G I G L T R A Fly V M T G S W R G R M Y A The evolutionary process... And repeat the process for all positions… (assume: each position evolves independently)

67 All the genes originated in the common ancestor of A,B,C,D,E and are, thus, of equal age. Similar to the human and mouse orthologous genes. Remote homologs from increasingly distant taxa (similar to fish, insect, yeast…) The aim of the simulations: generate sequences with the following phylogenetic relationships: DABEC

68 Simulation They simulated genes with 101 different rates. High rate  higher likelihood for an amino acid replacement in each branch.

69 After simulating the sequences: Use BLAST, at the same way that Alba and Castresana used it, to detect homology between gene A to genes C,D and E.

70 Only one difference – the groups names OLD METAZOANS DEUTEROSTOMES TETRAPODS SENIORS ADULTS TEENAGERS TODDLERS

71 Results

72 Same as Alba and Castresana

73 But all the simulated genes are at the same “age”. What is the problem ???

74 We can only count genes that are identified as homologous by the protocol … BLAST

75 Alba and Castresana may have, thus, failed to spot the vast majority of homologs from among the fastest evolving genes

76 The vast majority of the fastest evolving genes are undetectable even when the cutoffs are extremely permissive.

77 Conclusion

78 The inverse relationship between evolutionary rate and gene age is an artifact caused by our inability to detect similarity when genetic distances are large.

79 Since genetic distance increases with time of divergence and rate of evolution, it is difficult to identify homologs of fast evolving genes in distantly related taxa. Thus, fast evolving genes may be misclassified as “new”.

80 So, the only conclusion that can be drawn from Alba and Castresana’s study is that Slowly evolving genes evolve slowly !!!