Phylogenetics workshop: Protein sequence phylogeny Darren Soanes.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Phylogenetics workshop: Protein sequence phylogeny week 2 Darren Soanes.
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
BIO2093 – Phylogenetics Darren Soanes Phylogeny I.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Phylogeny.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
Bioinformatics and Phylogenetic Analysis
With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Probabilistic methods for phylogenetic trees (Part 2)
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Phylogenetic Analysis. 2 Phylogenetic Analysis Overview Insight into evolutionary relationships Inferring or estimating these evolutionary relationships.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Phylogenetic trees Sushmita Roy BMI/CS 576
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Molecular phylogenetics
Christian M Zmasek, PhD 15 June 2010.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
A brief introduction to phylogenetics
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Phylogenetics.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Molecular Evolution. Study of how genes and proteins evolve and how are organisms related based on their DNA sequence Molecular evolution therefore is.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Evolutionary genomics can now be applied beyond ‘model’ organisms
Phylogenetic basis of systematics
Gene Mutations.
Pipelines for Computational Analysis (Bioinformatics)
Types of Mutations.
Methods of molecular phylogeny
Molecular basis of evolution.
Molecular Evolution.
Summary and Recommendations
Summary and Recommendations
Phylogeny and the Tree of Life
Presentation transcript:

Phylogenetics workshop: Protein sequence phylogeny Darren Soanes

Parts of a tree plural of taxon = taxa

Phylogenetic tree: evolutionary family tree Nodes in the tree represent speciation events, where an ancestral lineage gives rise to daughter lineages.

Relationships in trees

Rooting a tree outgroup Root - most recent common ancestor of all the taxa in a tree outgroup — a taxon outside the group of interest. All the members of the group of interest are more closely related to each other than they are to the outgroup. Used to root the tree.

Outgroup

Rooted and unrooted trees rooted tree unrooted tree

Cladogram Phylogram

Evolution of Amino Acid Sequences Amino acid sequences change due to mutations in DNA sequence. Amino acid sequences evolve more slowly than DNA sequences. Evolutionary selection occurs on protein sequences. Gene trees created using protein sequences.

DNA mutations (1) Synonymous substitution – change in DNA sequence that does not affect the amino acid sequence, often in the third position of a codon, e.g. CCG (Pro)→CCA (Pro). Non-synonymous substitution - change in DNA sequence that does affect the amino acid sequence, often in the first or second position of a codon, e.g. CCG (Pro)→CAG (Gln).

Genetic Code

DNA mutations (2) Non-synonymous substitution also called missense mutation. Nonsense mutation – where a a stop codon is introduced into the middle of a sequence, e.g. TGG (Trp)→ TAG (Stop) Insertion / deletion (indel), causes a frame shift if not a multiple of three bases. Nonsense and frame-shift mutations usually produce non-functional proteins.

Amino acid substitution matrices (1) Substitutions between amino acids that are similar in properties are more common. Cysteine, glycine and tryptophan rarely change. Substitution matrices measure the likelihood that one amino acid is likely to change to another.

Families of amino acids

Amino acid substitution matrices (2) Amino acid substitution matrices are empirically derived by alignment of sets of closely related protein sequences. Examples include Dayhoff, BLOSUM (used in BLAST searches), WAG, JTT, LG. Different matrices suitable for looking at proteins encoded by mitochondrial genome e.g. MtREV.

BLOSUM 62 Matrix

Rates of amino acid change Rate of substitution varies at different positions in an amino acid sequence. A proportion of sequences are likely to be invariant, generally have an essential role in the function of a protein. A gamma distribution models the variation of rates at different sites. Sites are sorted into gamma rate categories.

Structure of thrombin showing catalytic triad (conserved in serine proteases)

Phylogenetic analysis Phylogenetic analysis programs take an alignment of protein sequences and attempt to produce a phylogenetic tree showing evolutionary relationships between the sequences. User can select amino acid substitution matrix and number of gamma rate categories, the program will estimate the proportion of invariant sites. Programs use these parameters and protein alignment to estimate evolutionary distance between sequences. They calculate topology and branch length of final tree.

Distance Methods Evolutionary distance calculated for all pairs of taxa. UPGMA - assumes rate of substitution is constant. Least squares – allows different rates of substitution in different branches. Minimum evolution (ME)– topology chosen where the sum of branch lengths is the smallest. Can take a long time to compute, neighbour joining (NJ) method is simplified version of ME – much quicker.

Maximum parsimony For each topology the smallest number of amino acid substitutions are calculated that could explain the evolutionary process. The topology that requires the smallest number of substitutions is chosen as the best one.

Maximum likelihood (ML) For each topology the likelihood is calculated that the known sequences could have evolved on that tree (branch lengths and substitution rate parameters optimised). Topology with the best likelihood score is chosen. Takes a long time to compute ML of every possible tree. Heuristic methods such as quartet puzzling reduce the number of candidate trees. Programs that use ML methods: PhyML, RAxML, TreePuzzle (uses quartet puzzling).

Bootstrapping Tests the reliability of a tree. Initial protein alignment is randomised (by sampling columns at random). Tree construction repeated for each randomised alignment. For each group of taxa in the original tree it is determined what percentage of the randomised trees contain the same group. Alternative: Approximate likelihood-ratio test

Bayesian methods A sample is taken of a large number of trees with high ML. Posterior probabilities calculated for different events of interest. Markov Chain Monte Carlo method used to generate samples of trees. Mr Bayes uses these methods.

Taxon sampling Take initial protein sequence. Decide which range of species you are interested in. Use BLAST to find homologous sequences in databases, either NCBI database or individual genome databases.

FASTA formatted file >YJL052W_Saccharomyces_cerevisiae MIRIAINGFGRIGRLVLRLALQRKDIEVVAVNDPFISNDYAAYMVKYDSTHGRYKGTVSH DDKHIIIDGVKIATYQERDPANLPWGSLKIDVAVDSTGVFKELDTAQKHIDAGAKKVVIT APSSSAPMFVVGVNHTKYTPDKKIVSNASCTTNCLAPLAKVINDAFGIEEGLMTTVHSMT ATQKTVDGPSHKDWRGGRTASGNIIPSSTGAAKAVGKVLPELQGKLTGMAFRVPTVDVSV VDLTVKLEKEATYDQIKKAVKAAAEGPMKGVLGYTEDAVVSSDFLGDTHASIFDASAGIQ LSPKFVKLISWYDNEYGYSARVVDLIEYVAKA* >YJR009C_Saccharomyces_cerevisiae MVRVAINGFGRIGRLVMRIALQRKNVEVVALNDPFISNDYSAYMFKYDSTHGRYAGEVSH DDKHIIVDGHKIATFQERDPANLPWASLNIDIAIDSTGVFKELDTAQKHIDAGAKKVVIT APSSTAPMFVMGVNEEKYTSDLKIVSNASCTTNCLAPLAKVINDAFGIEEGLMTTVHSMT ATQKTVDGPSHKDWRGGRTASGNIIPSSTGAAKAVGKVLPELQGKLTGMAFRVPTVDVSV VDLTVKLNKETTYDEIKKVVKAAAEGKLKGVLGYTEDAVVSSDFLGDSNSSIFDAAAGIQ LSPKFVKLVSWYDNEYGYSTRVVDLVEHVAKA*

Multiple sequence alignment Take FASTA file of sequences you are interested in. Align sequences using ClustalW, Muscle, TCoffee.

Sampling of conserved blocks To get reliable trees non-aligned and poorly conserved areas of sequence need to be removed. Gblocks samples highly conserved blocks of sequence.

Sequence alignment and sampling conserved block

Which substitution model should I use? ModelGenerator takes your sequence alignment and calculates the best amino acid substitution model to use.

Creating tree Take alignment produced by Gblocks and use program of choice to generate a tree (using substitution model suggest by ModelGenerator and specifying number of gamma rate categories, 4 is sufficient). File format problems, different programs use different file formats – use Readseq to convert between file formats. Use tree viewing program to look at graphical representation of tree (TreeView, TreeDyn).

TOR gene duplication events in fungi TOR: protein kinase, subunit of a complex that regulate cell growth in response to nutrient availability and cellular stresses

Workshop task Looking at evolution of genes encoding two types of phosphoglycerate mutase in fungi.

Two types of phosphoglycerate mutase (PGM) Both catalyse the same overall reaction: – 3-phosphoglycerate → 2-phosphoglycerate cofactor-dependent PGM (dPGM) uses 2,3- bisphosphoglycerate (2,3BPG) as a cofactor: 3PG + P-Enzyme → 2,3BPG + Enzyme → 2PG + P-Enzyme cofactor-independent PGM (iPGM) has two bound Mn(II) ions at its active site. 3PG + Enzyme → PG + P-Enzyme → 2PG + Enzyme

Two types of phosphoglycerate mutase (PGM) dPGM found in yeasts and vertebrates iPGM found in filamentous fungi, plants and some invertebrates Both can be found in bacteria. No sequence similarity between the two forms of the enzyme.

Structure of iPGM

Structure of dPGM

Task Use BLAST search to find PGM protein sequences in a sample of fungal species. Use these to create phylogenetic trees showing the evolution of genes encoding these enzymes.

Taxon sampling (get sequences – BLAST) Alignment (ClustalW) Sampling conserved positions (GBlocks) Determine substitution model (ModelGenerator) Create tree (PhyML) Visualise tree (TreeDyn)