Translational evidence and the accuracy of prokaryotic gene annotation Luciano Brocchieri Department of Molecular Genetics & Microbiology and Genetics.

Slides:



Advertisements
Similar presentations
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Luciano Brocchieri, PhD Research Interests. Summary of Research Interests 1.Gene identification and genome annotation 2.The evolution of genome-sequence.
Recombinant DNA technology
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
MGH-PGA Genomic Analysis of Stress and Inflammation: Sequence Analysis of Pseudomonas aeruginosa Strain PA14 Nicole T. Liberati, Dan G. Lee, Jacinto M.
Transcriptomics Jim Noonan GENE 760.
Ribosomal Profiling Data Handling and Analysis
Investigating the Importance of non-coding transcripts.
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
BI420 – Course information Web site: Instructor: Gabor Marth Teaching.
Review of important points from the NCBI lectures. –Example slides Review the two types of microarray platforms. –Spotted arrays –Affymetrix Specific examples.
2.7 DNA Replication, transcription and translation
Molecular genetics of gene expression Mat Halter and Neal Stewart 2014.
Online Counseling Resource YCMOU ELearning Drive… School of Architecture, Science and Technology Yashwantrao C havan Maharashtra Open University, Nashik.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,
Gene Structure and Identification
Biotechnology SB2.f – Examine the use of DNA technology in forensics, medicine and agriculture.
Today: Genetic Technology Wrap-up Exam Review Remember: Final Exam is Wednesday, 12/13 at 1 pm!
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Proteome and interactome Bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
The generalized transcription of the genome Víctor Gámez Visairas Genomics Course 2014/15.
BSC Developmental Biology Patterns of Inheritance EvolutionEcology.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Cis-regulatory Modules and Module Discovery
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
No reference available
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
The Central Dogma of Molecular Biology DNA  RNA  Protein  Trait.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Bacterial infection by lytic virus
bacteria and eukaryotes
Bacterial infection by lytic virus
OMICS Journals are welcoming Submissions
The Transcriptional Landscape of the Mammalian Genome
A Quest for Genes What’s a gene? gene (jēn) n.
PROTEIN SYNTHESIS.
From DNA to Proteins Transcription.
Genomes and Their Evolution
Volume 8, Issue 5, Pages (September 2014)
Predicting Genes in Actinobacteriophages
Genome organization and Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Introduction to Bioinformatics II
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Protein Occupancy Landscape of a Bacterial Genome
Volume 154, Issue 1, Pages (July 2013)
Volume 7, Issue 6, Pages (June 2014)
Nilansu Das Dept. of Microbiology Surendranath College
Volume 14, Issue 7, Pages (February 2016)
Genome Annotation and the Human Genome
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Gene Structure.
Redefining the Translational Status of 80S Monosomes
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Volume 11, Issue 7, Pages (May 2015)
Gene Structure.
Presentation transcript:

Translational evidence and the accuracy of prokaryotic gene annotation Luciano Brocchieri Department of Molecular Genetics & Microbiology and Genetics Institute University of Florida, Gainesville, FL 32610

From gene prediction to genome annotation Computational gene predictions. E.g., GeneMark2.5 (Borodovsky and McIninch 1993), GeneMarkHMM (Lukashin and Borodovsky 1998), Glimmer3.0 (Delcher et al. 2007), Prodigal (Hyatt et al. 2010), etc. Union of predictions (comprehensive compilation) Intersection of prediction (robust predictions) Evolutionary conservation Annotations modeled on closely-related species Long-range conservation indicative of functionality Expression Microarrays RNA-seq Proteomics

Missing genes in genome annotation Extensive conservation analysis of genomic ORFs from 1,300 bacterial chromosomes has revealed conservation across distantly related genomes of 40,000 ORFs not represented in genome annotations (Warren et al., BMC Bioinformatics 2010) More than 52,000 genes predicted by Glimmer3.0 and not included in 1,574 bacterial chromosome annotations are confirmed by evolutionary conservation and functional characterization (Wood et al., Biology Direct 2012) Significant 3-base periodicity identifies more than 68,000 con- served ORFs in annotated inter-genic regions of 2,000 prokaryotic chromosomes (Oden and Brocchieri, Bioinformatics 2015)

NPACT and the identification of coding regions by 3- base periodicity ( ORFs not included in gene annotations can be identified by significant 3-base periodicity in the sequence (Oden and Brocchieri 2015, in revision)

Why are genes missed in genome annotation? Missed genes do not depend on date of annotation (Wood et al., Biology Direct 2012) Lack of sensitivity of computational gene predictors Lack of consistency among computational gene predictors Lack of specificity of computational gene predictors Stringent criteria (e.g., on consistency or conservation) for acceptance during annotation Problems with the annotation pipelines

Gene annotation and conservation Has sequence similarity with E-value ≤ 1.0E-6. Is conserved in length: 1/1.2 ≤ [length target] / [length query] ≤ 1.2 Is conserved across genera or phyla. We define a gene to be conserved if:

Conservation by class of prediction Genes exclusively predicted by one method tend to be less conserved. Glimmer3.0 predicts substantially more exclusive genes than other methods, of which a greater number but a smaller fraction are conserved. None (NPACT) 101,

Gene predictions and periodicity in Pseudomonas aeruginosa strains

Experimental evidence of expression in P. aeruginosa PAO1: RNA-seq

Expression of predicted genes by length and conservation classes Published annotation Newly identified ORFs ORFs with RNA-seq coverage

What do we learn about gene predictions from transcription in bacteria? Unexpected patterns Contradictory patterns of expression of well defined protein coding genes

In the case of prediction of H-443*A, sequence features are more convincing than RNA-seq expression evidence. What do we learn about gene predictions from transcription in bacteria? The problem of antisense transcription ‘Pervasive transcription’ in bacterial genomes (see Wade and Grainger, Nature reviews 2014) limits the detective power of RNA-seq

Ribosome footprinting (Ingolia et al, Science 2009) Schematic representation of the ribosome footprinting. In application to P. aeruginosa tetracycline replaces cycloheximide cycloheximide

Ribosome footprints at initiation sites The antibiotic tetracycline inhibits translation-elongation stalling actively-translating ribosomes

Ribosome footprints at initiation sites However, tetracycline does not prevent more ribosomes to be recruited at the initiation site.

Ribosome footprints of initiation sites The accumulation of ribosomes will result in increased numbers of profile-reads corresponding to the initiation site.

Ribosome footprint coverage in P. aeruginosa # of reads Example of ribosome footprint coverage in P. aeruginosa PAO1 showing relation with S-profiles, annotated genes and newly identified ORFs.

Ribosome footprint coverage by codon position Metagene analysis of ribosome-footprint coverage Coverage is averaged over all genes, relative to the start of translation

Ribosome footprint coverage by codon position: center of reads Metagene analysis of coverage by read center + 2 nt Coverage is averaged over all genes, relative to the start of translation

Translational evidence by ribosome footprinting in P. aeruginosa Ribosome-footprint read-count patterns identify mRNA translation, translation-initiation sites, and translational pausing.

Ribosome-footprint-coverage patterns are robustly reproducible Similar patterns of coverage of groEL observed in independent biological replicates. What drives ribosome-footprint coverage patterns?

Newly identified genes in P. aeruginosa Examples of RFP-based gene discovery in P. aeruginosa PAO1 showing relation with S-profiles and annotated genes. Position relative to predicted start of translation

Identification of new genes by ribosome-footprint evidence A new gene is found to be expressed 5’ of the gene eco for Ecotin, a protease inhibitor localized to the periplasmic space.

Translational evidence for newly identified ORFs

Scoring RFP expression “Strength” of evidence decreases for poorly translated mRNA.

Scoring RFP expression “Strength” of the evidence of expression is measured by an “Expression Index”. : Count of RFP reads in codon positions [-2,+2] / 5; : Count of RFP reads in codon positions [+8, len/2] / (len/2 - 8); Expression Index

Expression of predicted genes by length and conservation classes Published annotation Newly identified ORFs ORFs with Expression Index ≥ 12.0

,457/5,567 Conserved Expressed ,208/5,567 Number and fraction of conserved or expressed genes of all genes annotated in P. aeruginosa PAO1 Conservation and expression of genes annotated in Pseudomonas aeruginosa PAO1

Conservation and expression of predicted genes not included in annotations by class of prediction Number and fraction of conserved or expressed genes of all genes predicted by different sets of predictors in P. aeruginosa PAO1 Conserved Expressed

Identification of translation-initiation sites by ribosome-footprinting Hyothetical gene RFP evidence of translation from alternative start at +600.

Start of translation identification by RFP read accumulation Annotated Newly identified Same start85.0%77.8% Different start15.0%22.2% Ribosome footprints confirm the predicted start of translation of 85% of annotated genes, and of 78% of the newly-identified ORFS, among those with evidence of translation.

RFP read patterns suggest that translation of cysH [phospho-adenylylsulphate reductase (PAPS) reductase] starts 75 nucleotides downstream of the computationally-predicted start Alternative start of translation?

FliA, sigma factor of RNA polymerase for flagellum genes transcription. CheY is involved in transmission of sensory signal to the flagellal motor.

Post-transcriptional control of translation after oxidative stress

Lab members Steve Oden – Postdoctoral associate. Development of gene finding methods and software, gene content analysis in human and prokaryotes. Nathan Bird– Programmer with Acceleration.com. Anna Picca – Postdoctoral associate. RNA-seq and ribosome profiling Ying Zhang – Postdoctoral associate. RNA-seq Silvia Tornaletti (UF Dept. of Medicine). RNA biology. Shouguang Jin (UF Dept. of Molecular Genetics and Microbiology). P. aeruginosa samples and advice Collaborators Thanks to NIH R01 GM087485­01A2 MGM, Genetics Institute, College of Medicine. Funding