Download presentation
Presentation is loading. Please wait.
Published byMuriel Neal Modified over 9 years ago
1
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012
2
Advancing Science with DNA Sequence 1.I ntroduction 2. Tools out there 3. Basic principles behind tools and known problems 4. Metagenomes Outline
3
Advancing Science with DNA Sequence Sequence features in prokaryotic genomes: stable RNA-coding genes (rRNAs, tRNAs, RNA component of RNaseP, tmRNA) protein-coding genes (CDSs) transcriptional features (mRNAs, operons, promoters, terminators, protein-binding sites, DNA bends) translational features (RBS, regulatory antisense RNAs, mRNA secondary structures, translational recoding and programmed frameshifts, inteins) pseudogenes (tRNA and protein-coding genes) … Finding the genes in microbial genomes Well-annotated bacterial genome in Artemis genome viewer:
4
Advancing Science with DNA Sequence 1.Introduction 2. Tools out there (don’t bother to write down the names and links, all presentations will be available on the web site) 3. Known problems 4. Metagenomes Outline
5
Advancing Science with DNA Sequence IMG-ER http://img.jgi.doe.gov/ RAST http://rast.nmpdr.org/ JCVI Annotation Service http://www.jcvi.org/cms/research/projects/annotation- service/ RefSeq http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.h tml Publicly available genome annotation services
6
Advancing Science with DNA Sequence Large structural RNAs (23S and 16S rRNAs) BLASTn RNAmmer http://www.cbs.dtu.dk/services/RNAmmer/ Small structural RNAs (5S rRNA, tRNAs, tmRNA, RNaseP RNA component) Rfam database, INFERNAL search tool http://www.sanger.ac.uk/Software/Rfam/ http://rfam.janelia.org/ http://infernal.janelia.org/ tRNAScan-SE http://lowelab.ucsc.edu/tRNAscan-SE/ What they provide and how they do it - RNAs
7
Advancing Science with DNA Sequence Reading frames: translations of the nucleotide sequence with an offset of 0, 1 and 2 nucleotides (three possible translations in each direction) Open reading frame (ORF): reading frame between a start and stop codon What they provide and how they do it – protein-coding gens (CDSs - not ORFs!)
8
Advancing Science with DNA Sequence Gene finders: ab initio tools; evidence-based refinement Ab initio tools used by the pipelines: Glimmer family (Glimmer2, Glimmer3, RBS finder) ->NCBI, RAST, JCVI http://glimmer.sourceforge.net/ GeneMark family (GeneMark-hmm, GeneMarkS) ->NCBI http://exon.gatech.edu/GeneMark/ PRODIGAL -> IMG-ER, NCBI http://compbio.ornl.gov/prodigal/ Evidence-based refinement: mostly undocumented in-house developed tools. Types of corrections: missed genes (RAST, JCVI, NCBI), frameshifts (JCVI, NCBI), start sites (RAST)
9
Advancing Science with DNA Sequence 1.Introduction 2.Tools out there 3.Basic principles behind tools 4.Metagenomes Outline
10
Advancing Science with DNA Sequence What is ab initio gene finder? Two major approaches to prediction of protein-coding genes: ab initio (ORFs with nucleotide composition similar to CDSs are also CDSs) Advantages: finds “unique” genes; high sensitivity; very fast! Limitations: often misses “unusual” genes; high rate of false positives “evidence-based” (ORFs with translations homologous to the known proteins are CDSs) Advantages: finds “unusual” genes (e. g. horizontally transferred); relatively low rate of false positive predictions Limitations: cannot find “unique” genes; low sensitivity on short genes; prone to propagation of false positive results of ab initio annotation tools; slow!
11
Advancing Science with DNA Sequence How ab initio tools work – very briefly Statistical model of coding and non-coding regions (codon or dicodon frequencies, hidden Markov models of different lengths) Statistical model architecture Additional algorithms for refinement of predictions (RBS finder, overlap resolution, etc.) Prokaryotic gene model used by all ab initio gene finders Ribosome-binding site within certain distance of the start codon; One of 3 start codons; One of 3 stop codons; No frame interruptions Ribosome binding site Start codon: ATG, GTG, TTG Stop codon: TAG, TAA, TGA open reading frame
12
Advancing Science with DNA Sequence Known problems of all annotation pipelines RNAs –Incomplete rRNAs –Trans-spliced tRNA in archaeal genomes –Small structural RNAs not predicted at all Protein-coding genes that don’t fit into prokaryotic gene model used by ab initio gene finders –no RBS (leaderless transcripts) –interrupted translation frame sequencing errors or translational exceptions –non-canonical start Ribosome binding site Start codon: ATG, GTG, TTG Stop codon: TAG, TAA, TGA open reading frame GenomeSequencing center 16S rRNA, nt Synechococcus sp. CC9311UCSD, TIGR1477 Synechococcus sp. CC9605JGI1440 Synechococcus elongatus PCC 7942JGI1490 Synechococcus sp. JA-2-3BA(2-13)TIGR1323 Synechococcus sp. JA-3-3AbTIGR1324 Synechococcus sp. RCC307Genoscope1498 Synechococcus sp. WH7803Genoscope1497, 1464
13
Advancing Science with DNA Sequence Symptoms of gene finding problems Some type of mandatory features (rRNAs, tRNAs, CDSs) is missing “Truncated” genes (shorter than homologs) => funky translation initiation features (non-canonical start codons, leaderless transcripts) Many “unique” genes without protein family assignment or BLASTp hit => sequencing errors (frameshifts) Undetected selenocysteines, programmed frameshifts in ~50 well-conserved protein families
14
Advancing Science with DNA Sequence Supplemental tools TIS (translation initiation site) prediction/correction TICO http://tico.gobics.de/http://tico.gobics.de/ TriTISA http://mech.ctb.pku.edu.cn/protisa/TriTISA Two tools often disagree about the best TIS, especially in high GC genomes Operon prediction JPOP http://csbl.bmb.uga.edu/downloads/#jpop http://www.cse.wustl.edu/~jbuhler/research/operons/ http://www.sph.umich.edu/~qin/hmm/ Proteins with unusual translational features – selenocysteine-containing genes bSECISearch http://genomics.unl.edu/bSECISearch/
15
Advancing Science with DNA Sequence Metagenomes sequenced with new technologies: low-coverage problems Both 454 and Illumina require high sequence coverage in order to achieve high sequence quality (25x to >100x) High sequence coverage cannot be achieved for metagenome data How does this affect metagenome annotation? ~70% of 454 Titanium reads have at least 1 sequencing artifact (basecalls in homopolymeric runs), there is no clear pattern of error distribution >100 bp Illumina reads have ~3% error rate, error rate is higher towards the end of the read, the majority of errors are substitutions metagenome genome sequence coverage
16
Advancing Science with DNA Sequence Just one example… predicted gene 3 frameshifts 4-read contig, 1476 nt, no misassembly Contig has 27 homopolymers (3 nt and more), 3 of them have errors No correlation with homopolymer type or error type Reads were quality trimmed prior to assembly
17
Advancing Science with DNA Sequence Metagenome annotation tools (more details will be given) GeneMark (GeneMark-hmm for reads, GeneMarkS for longer contigs) http://exon.gatech.edu/GeneMark/ MetaGene http://metagene.cb.k.u-tokyo.ac.jp/metagene/ FragGeneScan http://omics.informatics.indiana.edu/FragGeneScan/ Full-service annotation pipelines IMG/M-ER – “metagenome gene calling” + other options http://img.jgi.doe.gov/submit MG-RAST http://metagenomics.nmpdr.org/ CAMERA annotation pipeline http://camera.calit2.net/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.