Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.

Slides:



Advertisements
Similar presentations
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Gene Prediction Preliminary Results Computational Genomics February 20, 2012.
Gene Expression and Control Part 2
Predicting Genes in Mycobacteriophages December 8, In Silico Workshop Training D. Jacobs-Sera.
BIOS816/VBMS818 Lecture 7 – Gene Prediction Guoqing Lu Office: E115 Beadle Center Tel: (402) Website:
Gene Identification Lab
ECE 501 Introduction to BME
Eukaryotic Gene Finding
Gene expression.
Protein Synthesis The genetic code – the sequence of nucleotides in DNA – is ultimately translated into the sequence of amino acids in proteins – gene.
Gene Structure and Identification
Essentials of the Living World Second Edition George B. Johnson Jonathan B. Losos Chapter 13 How Genes Work Copyright © The McGraw-Hill Companies, Inc.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
Protein Synthesis Transcription and Translation DNA Transcription RNA Translation Protein.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Monday, October 18, 1:43:47 PM Outline for today Lec 06
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
Fig.1.8 DNA STRUCTURE 5’ 3’ Antiparallel DNA strands Hydrogen bonds between bases DOUBLE HELIX 5’ 3’
Central Dogma DNA  RNA  Protein. …..Which leads to  Traits.
Chapter 7 Gene Expression and Control Part 2. Transcription: DNA to RNA  The same base-pairing rules that govern DNA replication also govern transcription.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Genome Annotation Rosana O. Babu.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop September 16, 2008.
From Genomes to Genes Rui Alves.
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Copyright © by Holt, Rinehart and Winston. All rights reserved. ResourcesChapter menu Flow of Genetic Information The flow of genetic information can be.
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Lecture 4: Transcription in Prokaryotes Chapter 6.
Lesson Four Structure of a Gene. Gene Structure What is a gene? Gene: a unit of DNA on a chromosome that codes for a protein(s) –Exons –Introns –Promoter.
RNA Makin’ Proteins DNAMutations Show off those Genes!
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Annotation of eukaryotic genomes
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Bacterial infection by lytic virus
ORF Calling.
bacteria and eukaryotes
Annotating The data.
Bacterial infection by lytic virus
A Quest for Genes What’s a gene? gene (jēn) n.
Lesson Four Structure of a Gene.
Lesson Four Structure of a Gene.
Transfer of RNA molecules serve as interpreters during translation
Microbial Genome Annotation
Predicting Genes in Actinobacteriophages
Introduction to Bioinformatics II
Gene Prediction.
Cuong Nguyen, Deng Xin, Dongmei, Zheng Wang
What do you with a whole genome sequence?
Structure of the Genome
Presentation transcript:

Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence 1.I ntroduction 2. Tools out there 3. Basic principles behind tools and known problems 4. Metagenomes Outline

Advancing Science with DNA Sequence Sequence features in prokaryotic genomes:  stable RNA-coding genes (rRNAs, tRNAs, RNA component of RNaseP, tmRNA)  protein-coding genes (CDSs)  transcriptional features (mRNAs, operons, promoters, terminators, protein-binding sites, DNA bends)  translational features (RBS, regulatory antisense RNAs, mRNA secondary structures, translational recoding and programmed frameshifts, inteins)  pseudogenes (tRNA and protein-coding genes)  … Finding the genes in microbial genomes Well-annotated bacterial genome in Artemis genome viewer:

Advancing Science with DNA Sequence 1.Introduction 2. Tools out there (don’t bother to write down the names and links, all presentations will be available on the web site) 3. Known problems 4. Metagenomes Outline

Advancing Science with DNA Sequence IMG-ER RAST JCVI Annotation Service service/ NCBI PGAAP tml Publicly available genome annotation services

Advancing Science with DNA Sequence “Non-coding” RNAs 23S, 16S and 5S rRNAs tRNAs (tmRNA, RNaseP RNA component) Protein-coding genes Topological features (signal peptides, transmembrane regions) Repeats (CRISPRs, IS elements) Regulatory RNAs (riboswitches) Pseudogenes and frameshifts Features predicted by most pipelines

Advancing Science with DNA Sequence Best way – covariance models (captures the secondary structure) Used for small structural RNAs (5S rRNA, tRNAs, tmRNA, RNaseP RNA component) Rfam database, INFERNAL search tool tRNAScan-SE Does not work for large RNAs (23S, 16S). Alternatives: BLASTn RNAmmer RNA prediction

Advancing Science with DNA Sequence Reading frames: translations of the nucleotide sequence with an offset of 0, 1 and 2 nucleotides (three possible translations in each direction) Open reading frame (ORF): reading frame between a start and stop codon CDS (not ORF!) prediction

Advancing Science with DNA Sequence Two major approaches to CDS prediction Two major approaches to prediction of protein-coding genes: ab initio (ORFs with nucleotide composition similar to CDSs are also CDSs) Advantages: finds “unique” genes; high sensitivity; very fast! Limitations: often misses “unusual” genes; high rate of false positives “evidence-based” (ORFs with translations homologous to the known proteins are CDSs) Advantages: finds “unusual” genes (e. g. horizontally transferred); relatively low rate of false positive predictions Limitations: cannot find “unique” genes; low sensitivity on short genes; prone to propagation of false positive results of ab initio annotation tools; slow!

Advancing Science with DNA Sequence How ab initio tools work – very briefly Statistical model of coding and non- coding regions Statistical model(s) of other components (RBS) Additional algorithms for refinement of predictions (overlap resolution, etc.) Prokaryotic gene model used by all ab initio gene finders Ribosome-binding site within certain distance of the start codon; One of 3 start codons; One of 3 stop codons; No frame interruptions Ribosome binding site Start codon: ATG, GTG, TTG Stop codon: TAG, TAA, TGA open reading frame

Advancing Science with DNA Sequence Gene finders: ab initio tools; evidence-based refinement Ab initio tools used by the pipelines: Glimmer family (Glimmer2, Glimmer3, RBS finder) ->NCBI, RAST, JCVI GeneMark family (GeneMark-hmm, GeneMarkS) ->NCBI PRODIGAL -> IMG-ER, NCBI Evidence-based refinement: mostly undocumented in-house developed tools. Types of corrections: missed genes (RAST, JCVI, NCBI), frameshifts (JCVI, NCBI), start sites (RAST)

Advancing Science with DNA Sequence Known problems of all annotation pipelines RNAs –Incomplete rRNAs –Trans-spliced tRNA in archaeal genomes –Small structural RNAs not predicted at all Protein-coding genes that don’t fit into prokaryotic gene model used by ab initio gene finders –no RBS (leaderless transcripts) –interrupted translation frame sequencing errors or translational exceptions –non-canonical start Ribosome binding site Start codon: ATG, GTG, TTG Stop codon: TAG, TAA, TGA open reading frame GenomeSequencing center 16S rRNA, nt Synechococcus sp. CC9311UCSD, TIGR1477 Synechococcus sp. CC9605JGI1440 Synechococcus elongatus PCC 7942JGI1490 Synechococcus sp. JA-2-3BA(2-13)TIGR1323 Synechococcus sp. JA-3-3AbTIGR1324 Synechococcus sp. RCC307Genoscope1498 Synechococcus sp. WH7803Genoscope1497, 1464

Advancing Science with DNA Sequence Symptoms of gene finding problems Some type of mandatory features (rRNAs, tRNAs, CDSs) is missing “Truncated” genes (shorter than homologs) => unusual translation initiation features (non- canonical start codons, leaderless transcripts) Many “unique” genes without protein family assignment or BLASTp hit => sequencing errors (frameshifts) Undetected selenocysteines, programmed frameshifts in ~50 well-conserved protein families

Advancing Science with DNA Sequence Supplemental tools  TIS (translation initiation site) prediction/correction TICO TriTISA Two tools often disagree about the best TIS, especially in high GC genomes  Operon prediction JPOP  Proteins with unusual translational features – selenocysteine-containing genes bSECISearch

Advancing Science with DNA Sequence Metagenomes sequenced with new technologies: low-coverage problems Both 454 and Illumina require high sequence coverage in order to achieve high sequence quality (25x to >100x) High sequence coverage cannot be achieved for metagenome data How does this affect metagenome annotation? ~70% of 454 Titanium reads have at least 1 sequencing artifact (basecalls in homopolymeric runs), there is no clear pattern of error distribution >100 bp Illumina reads have ~3% error rate, error rate is higher towards the end of the read, the majority of errors are substitutions metagenome genome sequence coverage

Advancing Science with DNA Sequence Just one example… predicted gene 3 frameshifts 4-read contig, 1476 nt, no misassembly Contig has 27 homopolymers (3 nt and more), 3 of them have errors No correlation with homopolymer type or error type Reads were quality trimmed prior to assembly

Advancing Science with DNA Sequence Metagenome annotation tools (more details will be given) GeneMark (GeneMark-hmm for reads, GeneMarkS for longer contigs) MetaGene FragGeneScan Full-service annotation pipelines IMG/M-ER – “metagenome gene calling” + other options MG-RAST CAMERA annotation pipeline