Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology August 18, 2006, Stanford University, CA AUTOMATED ASSEMBLY OF GENE FAMILIES.

Slides:



Advertisements
Similar presentations
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Advertisements

Ortholog vs. paralog? 1. Collect Sequence Data Good Dataset
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
MCB 5472 Supertrees vs Supermatrix Assembly of Gene Families Peter Gogarten Office: BSP 404 phone: ,
Advanced Methods in Reconstructing Phylogenetic Relationships 2009 EMBO World Practical Course: March 16th to 22nd, 2009, Botanical Garden, Rio de Janeiro.
ATPase dataset -> nj in figtree. ATPase dataset -> muscle -> phyml (with ASRV)– re-rooted.
No similarity vs no homology If two (complex) sequences show significant similarity in their primary sequence, they have shared ancestry, and probably.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
A Web Interface to analyse SOM of Bipartitions of Gene Phylogenies - A Walk Through J. Peter Gogarten, Maria Poptsova Dept. of Molecular and Cell Biology.
New Tools for Visualizing Genome Evolution Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island J. Peter Gogarten Dept. of Molecular.
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Phylogeny.
The Cobweb of life revealed by Genome-Scale estimates of Horizontal Gene Transfer Fan Ge, Li-San Wang, Junhyong Kim Mourya Vardhan.
Sequence alignment: Removing ambiguous positions: Generation of pseudosamples: Calculating and evaluating phylogenies: Comparing phylogenies: Comparing.
MCB 5472 Assembly of Gene Families Peter Gogarten Office: BSP 404 phone: ,
Bioinformatics and Phylogenetic Analysis
MCB 5472 Gene Families, Super Trees and Super Matrices Peter Gogarten Office: BSP 404 phone: ,
Branches, splits, bipartitions In a rooted tree: clades (for urooted trees sometimes the term clann is used) Mono-, Para-, polyphyletic groups, cladists.
Bas E. Dutilh Phylogenomics Using complete genomes to determine the phylogeny of species.
Example of bipartition analysis for five genomes of photosynthetic bacteria (188 gene families) total 10 bipartitions R: Rhodobacter capsulatus, H: Heliobacillus.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Trees as a Tool to Visualize Evolutionary History
Cenancestor (aka LUCA or MRCA) can be placed using the echo remaining from the early expansion of the genetic code. reflects only a single cellular component.
MCB 372 #12: Tree, Quartets and Supermatrix Approaches Collaborators: Olga Zhaxybayeva (Dalhousie) Jinling Huang (ECU) Tim Harlow (UConn) Pascal Lapierre.
Trees? J. Peter Gogarten University of Connecticut Dept. of Molecular and Cell Biology Sculpture at Royal Botanical Gardens, Kew.
MCB 372 #14: Student Presentations, Discussion, Clustering Genes Based on Phylogenetic Information J. Peter Gogarten University of Connecticut Dept. of.
The diversity of genomes and the tree of life
T-COFFEE Multiple Alignments of Orthologous Sequences Horizontal Gene Transfer (Phylogenetic Trees) WebLogo.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Coalescence and the Cenancestor J. Peter Gogarten University of Connecticut Department of Molecular and Cell Biology.
Pollen transcript unigene identifier log 2 -fold change Annotation (BLAST) Unigene L. longiflorum chloroplast, complete genome Unigene
MCB5472 Computer methods in molecular evolution Lecture 4/21/2014.
MCB 3421 class 25. student evaluations Please follow this link to the on-line surveys that are open for you this semester.
Descendent Subtrees Comparison of Phylogenetic Trees with Applications to Co-evolutionary Classifications in Bacterial Genome Yaw-Ling Lin 1 Tsan-Sheng.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
3- RIBOSOMAL RNA GENE RECONSTRUCITON  Phenetics Vs. Cladistics  Homology/Homoplasy/Orthology/Paralogy  Evolution Vs. Phylogeny  The relevance of the.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
Introduction to Phylogenetics
Phylogenetic analyses of cyanobacterial genomes: Quantification of horizontal gene transfer events Olga Zhaxybayeva, J. Peter Gogarten, Robert L. Charlebois,
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
Molecular Phylogeny. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.
GEBA Project Summary Dongying Wu. Phylogenetic Tree Building (Martin Wu) Concatenate alignments of 31 marker genes build a PHYML tree 667 non-GEBA genomes,
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
MCB5472 Computer methods in molecular evolution Slides for comp lab 4/2/2014.
In brief Vertical vs. Horizontal Homologous vs. Unequal
MCB 3421 class 26.
Phylogenetics.
Advanced Methods in Reconstructing Phylogenetic Relationships 2008 EMBO World Practical Course: March 3rd to 9th, 2008, Botanical Garden, Rio de Janeiro.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
Bootstrap ? See herehere. Maximum Likelihood and Model Choice The maximum Likelihood Ratio Test (LRT) allows to compare two nested models given a dataset.Likelihood.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
Subsystem: General secretory pathway (sec-SRP) complex (TC 3.A.5.1.1) Matthew Cohoon, Department of Computer Science, University of Chicago, Chicago, IL.
Phylogenetic trees. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.
MCB 3421 class 26.
The Coral of Life (Darwin)
Pipelines for Computational Analysis (Bioinformatics)
The Ribosomal “Tree of Life”
Methods of molecular phylogeny
Why could a gene tree be different from the species tree?
Summary and Recommendations
Comments on bipartitions, quartets and supertrees
Volume 108, Issue 5, Pages (March 2002)
The Ribosomal “Tree of Life”
Summary and Recommendations
Presentation transcript:

Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology August 18, 2006, Stanford University, CA AUTOMATED ASSEMBLY OF GENE FAMILIES AND DETECTION OF HORIZONTALLY TRANSFERRED GENES Superfamily of ATP synthases for 317 taxa of bacteria and archaea

Outline: Automated Methods of Assembling Orthologous Gene Families Methods of HGT Detection Tree of Life 16s RNA tree Rooting the Tree of Life How Tree-like is an Organismal Evolution? What is Horizontal Gene Transfer? What is Organismal Lineage in light of HGT? Reciprocal Blast Hit Method – problems with paralogs Branchclust: Phylogenetic Algorithm for Assembling Gene Families Overview of methods for HGT detection: AU Test, Symmetrical Difference of Robinson and Foulds, Bipartition Analysis HGT In-silico experiments

Cenancestor as placed by ancient duplicated genes (ATPases, Signal recognition particles, EF) To Root strictly bifurcating no reticulation only extant lineages based on a single molecular phylogeny branch length is not proportional to time The Tree of Life according to SSU ribosomal RNA (+)

SSU-rRNA Tree of Life Euglena Trypanosoma Zea Paramecium Dictyostelium Entamoeba Naegleria Coprinus Porphyra Physarum Homo Tritrichomonas Sulfolobus Thermofilum Thermoproteus pJP 27 pJP 78 pSL 22 pSL 4 pSL 50 pSL 12 E.coli Agrobacterium Epulopiscium Aquifex Thermotoga Deinococcus Synechococcus Bacillus Chlorobium Vairimorpha Cytophaga Hexamita Giardia mitochondria chloroplast Haloferax Methanospirillum Methanosarcina Methanobacterium Thermococcus Methanopyrus Methanococcus A RCHAEA B ACTERIA E UCARYA Encephalitozoon Thermus EM changes per nt Marine group 1 Riftia Chromatium ORIGIN Treponema CPS V/A-ATPase Prolyl RS Lysyl RS Mitochondria Plastids Fig. modified from Norman Pace

What is HGT? Genes can be passed vertically – from ancestor to a child Genes also can be passed horizontally – exchange of genes between different species HGT stands for Horizontal Gene Transfer

Science,280 p.672ff (1998)  Horizontal Gene Transfer  Mosaic Genomes How Tree-like is Organismal Evolution?

Escherichia coli, strain CFT073, uropathogenic Escherichia coli, strain EDL933, enterohemorrhagic Escherichia coli K12, strain MG1655, laboratory strain, Welch RA, et al. Proc Natl Acad Sci U S A. 2002; 99: “… only 39.2% of their combined (nonredundant) set of proteins actually are common to all three strains.” How many common genes?

What is an “organismal lineage” in light of horizontal gene transfer? Over very short time intervals an organismal lineage can be defined as the majority consensus of genes. Organismal Lineage

Rope as a metaphor to describe an organismal lineage (Gary Olsen) Individual fibers = genes that travel for some time in a lineage. While no individual fiber (gene) present at the beginning might be present at the end, the rope (or the organismal lineage) nevertheless has continuity.

However, the genome as a whole will acquire the character of the incoming genes (the rope turns solidly red over time).

From: Bill Martin (1999) BioEssays 21,

Selection of Orthologous Gene Families (COG, or Cluster of Orthologous Groups) All automated methods for assembling sets of orthologous genes are based on sequence similarities. BLAST hits (SCOP database) Triangular circular BLAST significant hits Sequence identity of 30% and greater Similarity complemented by HMM-profile analysis Pfam database Reciprocal BLAST hit method

’ often fails in the presence of paralogs 1 gene family Reciprocal BLAST Hit Method 0 gene family

ATP-F Case of 2 bacteria and 2 archaea species ATP-A (catalytic subunit) ATP-B (non-catalytic subunit) Escherichia coli Bacillus subtilis Methanosarcina mazei Sulfolobus solfataricus ATP-A ATP-B ATP-A ATP-B ATP-A ATP-B ATP-A ATP-B Escherichia coli Bacillus subtilis Methanosarcina mazei Sulfolobus solfataricus ATP-A ATP-B ATP-A ATP-B ATP-A ATP-B ATP-A ATP-B ATP-F Neither ATP-A nor ATB-B is selected by RBH method Families of ATP-synthases

ATP-A ATP-F ATP-B Escherichia coli Bacillus subtilis Escherichia coli Methanosarcina mazei Methanosarcina mazei Sulfolobus solfataricus Sulfolobus solfataricus Family of ATP-A Family of ATP-B Family of ATP-F Phylogenetic Tree

BranchClust Algorithm genome i genome 1 genome 2 genome 3 genome N dataset of N genomes superfamily tree BLAST hits

BranchClust Algorithm

BranchClust Algorithm Superfamily of penicillin-binding protein Superfamily of DNA-binding protein 13 gamma proteo bacteria Root positions 13 gamma proteo bacteria

BranchClust Algorithm Comparison of the best BLAST hit method and BranchClust algorithm Number of taxa - A: Archaea B: Bacteria Number of selected families: Reciprocal best BLAST hit BranchClust 2A 2B80414 (all complete) 13B (263 complete, 409 with n  8 ) 16B 14A (60 complete, 126 with n  24).

BranchClust Algorithm ATP-synthases: Examples of Clustering 13 gamma proteobacteria 30 taxa: 16 bacteria and 14 archaea 317 bacteria and archaea

BranchClust Algorithm Typical Superfamily for 30 taxa (16 bacteria and 14 archaea) 59:30 33:19 53:26 55:21 37:19 36:21

BranchClust Algorithm Gene Annotation CLUSTER FAMILY >gi| | peptidoglycan synthetase FtsI [Buchnera aphidicola str. Bp (Baizongia pistaciae)] >gi| | Peptidoglycan synthetase ftsI precursor [Escherichia coli CFT073] >gi| | penicillin-binding protein 3 [Haemophilus influenzae Rd KW20] >gi| | FtsI [Pasteurella multocida subsp. multocida str. Pm70] >gi| | penicillin-binding protein 3 [Pseudomonas aeruginosa PAO1] >gi| | division specific transpeptidase [Salmonella typhimurium LT2] >gi| | penicillin-binding protein 3 [Vibrio cholerae O1 biovar eltor str. N16961] >gi| | hypothetical protein WGLp212 [Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis] >gi| | penicillin-binding protein 3 [Xanthomonas campestris pv. campestris str. ATCC 33913] >gi| | penicillin-binding protein 3 [Xanthomonas axonopodis pv. citri str. 306] >gi| | penicillin binding protein 3 [Xylella fastidiosa 9a5c] >gi| | penicillin-binding protein 3 [Yersinia pestis CO92] >gi| | peptidoglycan synthetase [Yersinia pestis KIM] COMPLETE: 13 >>>>> IN-PARALOGS >gi| | putative penicillin-binding protein 3 [Salmonella typhimurium LT2] >gi| | penicillin-binding protein 3A [Pseudomonas aeruginosa PAO1] CLUSTER FAMILY >gi| | Penicillin-binding protein 2 [Escherichia coli CFT073] >gi| | penicillin-binding protein 2 [Haemophilus influenzae Rd KW20] >gi| | Pbp2 [Pasteurella multocida subsp. multocida str. Pm70] >gi| | penicillin-binding protein 2 [Pseudomonas aeruginosa PAO1] >gi| | cell elongation-specific transpeptidase [Salmonella typhimurium LT2] >gi| | penicillin-binding protein 2 [Vibrio cholerae O1 biovar eltor str. N16961] >gi| | hypothetical protein WGLp172 [Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis] >gi| | penicillin-binding protein 2 [Xanthomonas campestris pv. campestris str. ATCC 33913] >gi| | penicillin-binding protein 2 [Xanthomonas axonopodis pv. citri str. 306] >gi| | penicillin binding protein 2 [Xylella fastidiosa 9a5c] >gi| | penicillin-binding protein 2 [Yersinia pestis CO92] >gi| | peptidoglycan synthetase, penicillin-binding protein 2 [Yersinia pestis KIM] INCOMPLETE: 12 >>>>> IN-PARALOGS >gi| | putative penicillin-binding protein [Salmonella typhimurium LT2] Superfamily of penicillin-binding protein for 13 gamma proteobacteria

BranchClust Algorithm Implementation and Usage 1.Bioperl module for parsing trees Bio::TreeIO 2. Taxa recognition file gi_numbers.out must be present in the current directory. How to create this file, read the Taxa recognition file section on the web-site. The BranchClust algorithm is implemented in Perl with the use of the BioPerl module for parsing trees and is freely available at Required: Usage: At the command line type: # perl branch_clust.pl Output: families.list, clusters.out, clusters.log How to do batch processing: Example of a wrapper you can find on the web-site.

BranchClust Algorithm Data Flow Download n complete genomes (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria) In fasta format (*.faa) Put all n genomes in one database Take one starting genome and do BLAST of this genome against the database, consisting of n genomes Parse BLAST-output with the requirement that all n-taxa should be present Superfamilies Align with ClustalW Reconstruct superfamily tree ClustalW –quick distance method Phyml – Maximum Likelihood Parse with BranchClust Gene families

Why do we need gene families? How many genes are common between different species? Do all the common genes share the common history? How do we reconstruct the tree of life? How can we detect genes that were horizontally transferred?

Methods of HGT Detection Parametric methods GC-content analysis analysis of single nucleotide composition (SNC) and dinucleotide composition (DNC) codon usage bias other measures based on sequence composition AU-test SPR metric (NP-hard problem) Symmeytric difference of Robison and Foulds Biparition spectrum analysis Quartet spectrum analysis Phylogenetic methods

AU test AU test, or approximately unbiased test of phylogenetic tree selection was proposed for assessing the confidence of tree selection. The AU test method produces for each tree a number ranging from zero to one – (P1 and P2). This number Is the probability value that the tree is the true tree. The greater the P-value, the greater the probability that the tree is the true tree. P1 P2 One would expect that a tree with different topology would have a small P-value. Accepted requirement for HGT detection: P-Value < 1E-2 – 1E-4 Hidetoshi Shimodaira (2002) If P1 = genome tree, and P2 = gene tree, then the AU test provides the probability that P1 is the true tree for the gene family. unclear

Some Metrics to Compare Tree Topologies SPR – distance Symmetric Difference of Robinson and Foulds (bipartition distance) Quartet distance

is very hard computationally There are no tools available to calculate the difference In tree topology by number of SPR-operations required to transform one tree to another. That is why bipartitions come into the scene SPR metric – Subtree Pruning and Regrafting There is Robert Beiko’s program.. Also the SPR distance alone would not consider support levels.

Phylogenetic information present in genomes Break information into small quanta of information (bipartitions or embedded quartets) Spectral Analyses of Phylogenetic Data Analyze spectra to detect transferred genes and plurality consensus.

Bipartition (or split) – a division of a phylogenetic tree into two parts that are connected by a single branch. It divides a dataset into two groups, but it does not consider the relationships within each of the two groups. Number of non-trivial bipartitions for N genomes is equal to 2 (N-1) -N-1. **… ***.. *...* *.*.* Bipartitions can be divided in conflicting and non-conflicting non-conflicting (can coexist in one tree) **… ***.. conflicting (can not coexist in one tree) **… *…* BIPARTITION OF A PHYLOGENETIC TREE AB CDE AE CDB

Try to infer phylogeny “Likely” Trees “Best” Tree Choose/ make consensus The Tree Drawing Process unclear

Resampling e.g.. bootstrapping Resampling simulates examining extra sequence from the original data Obtaining Bootstrap Support for Branches Now bipartitions have weights:.**…..65 ***…..75 …**…75 …..***100 ……**100 A B C D E F G H *.*…..75

Data Flow For every gene family, align sequences (ClustalW) For every gene family, align sequences (ClustalW) For every gene family, reconstruct Maximum Likelihood (ML) Tree and generate 100 bootstrap samples (phyml) For every gene family, extract bipartition information from each bootstrapped tree, and compose a bipartition matrix For every gene family, extract bipartition information from each bootstrapped tree, and compose a bipartition matrix Bipartitions matrix is generated Do “Lento” Plot analysis Results: select gene families **….. *.*…. *…*…. *….*…. etc. fam fam fam … famN majority consensus bipartitions detected conflicts = HGT events

13 gamma proteobacteria 1. Buchnera aphidicola str. Bp (Baizongia pistaciae) 2. Escherichia coli CFT Haemophilus influenzae Rd KW20 4. Pasteurella multocida subsp. multocida str. Pm70 5. Pseudomonas aeruginosa PAO1 6. Salmonella typhimurium LT2 7. Vibrio cholerae O1 biovar eltor str. N Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis 9. Xanthomonas campestris pv. campestris str. ATCC Xanthomonas axonopodis pv. citri str Xylella fastidiosa 9a5c 12. Yersinia pestis KIM 13. Yersinia pestis CO E+10 possible unrooted tree topologies

“Lento”-plot of 34 supported bipartitions (out of 4082 possible) 13 gamma- proteobacterial genomes (258 putative orthologs): E.coli Buchnera Haemophilus Pasteurella Salmonella Yersinia pestis (2 strains) Vibrio Xanthomonas (2 sp.) Pseudomonas Wigglesworthia There are 13,749,310,575 possible unrooted tree topologies for 13 genomes

Consensus Tree and Horizontally Transferred Gene Phylogeny of putatively transferred gene (virulence factor homologs (mviN)) only 258 genes analyzed Consensus clusters of eight significantly supported bipartitions

Are the detected transfers mainly false positives, or are they the tip of an iceberg of many transfer events most of which go undetected by current methods? Here we explore how well these methods perform using in silico transfers between the leaves of a gamma proteobacterial phylogeny. What are the Actual HGT Rates?

HGT in silico: Testing Methods of Detection AU test Symmetric Difference of Robinson – Foulds Biparition Analysis

HGT in silico: AU test 236 families 13 gamma proteobacteria A Only two families out of 236 showed a conflict at the significance level of 5 *10-4, 5 conflicts were found at the significance level of 0,01 and 26 conflicts at the significance level of 0,05.

HGT in silico: AU test 236 families Escherichia coli  Xylella fastidosa Pseudomonas aeroginosa  Vibrio cholera 13 gamma proteobacteria Au-value = Log(Au-value)=-4 Only 10% of au-values is less than % of au-values is less than 10 -4

HGT in silico: AU test Power of Detection Significance level < 1e-4 Significance level < 1e-2

HGT in silico: Robinson Foulds Metric Nu of different bipartitions Distribution of number of different bipartitions in the original dataset Power of Detection Significance level < 1e-2

HGT in silico: Bipartition Analysis Bootstrap support >70% Bootstrap support >90% Power of Detection

Acknowledgements NSF Microbial Genetics NASA Exobiology & AISR Programs Gogarten Lab: Pascal Lapierre Gregory Fournier Alireza Ghodsi Senejani Holly E. Gardner Tim Harlow Kristen Swithers Kaiyuan Shi Prof. Peter Gogarten, University of Connecticut