Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.

Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern and Peter Meinicke José Lugo-Martínez I609-Week 9 th March 9, 2010 Orphelia: predicting genes in metagenomic sequencing reads Katharina J. Hoff, Thomas Lingner, Peter Meinicke and Maike Tech

Outline  Introduction  Background  Orphelia  Effect of sequencing errors  Conclusion

Metagenomics (revisited)  Simultaneously characterize all single species genomes of a particular habitat  Without prior cultivation!!  Phylogenetic origin may be unknown  Identification of protein coding genes location of genes unknown!!  Identification of metabolic pathways

So far, we have discussed … Sequencing Who is there? What are they doing? Phylogenetic profilingFunctional profiling Environmental Sample … Today

What’s the Problem?  Don’t know phylogenetic origin Most reads cannot be assembled into longer contigs How to assemble reads?  This implies: Analysis of single sequencing reads  But ORF-based will overlook most reads  Need gene prediction approaches for metagenomics Fast and accurate

Possible Approaches (1)  Homology based BLAST search against databases of known proteins BLAST search against sample Clustering of sample and database sequences  Limited to already known genes, and/or computationally expensive

Possible Approaches (2)  Model-based Methods GeneMark - derives an adapted monocodon usage model from GC-content MetaGene – extracts ORFs and scores them, then, calculates the final ORF combination from different scores FragGene Scan - Mina Rho (IU) Orphelia –fragment-oriented based on a two-stage machine learning approach:  linear discriminants and neural networks

Orphelia Overview Added in 2 nd paper

Pipeline Score all Candidates Likely Genes Likely Random “ORFs” Selection of Candidates Final Prediction Extract “ORFs”

“ORF” Extraction  Begin start codon (ATG, CTG, GTG, or TTG)  Followed by >18 subsequent triplets  End stop codon (TGA, TAG, or TAA)  But, also consider incomplete “ORFs” of length ≥60bp that lack start and/or stop codon

“ORFs” Identification STOPTIS

Scoring of Candidate

Step 1: Linear Discriminants  Feature preprocessing Training Linear Discriminants  Example: Monocodon Linear Discriminant

Step 2: Neural Network  Input: Feature vector x :=  Output: Gene probability of being coding “ORF”  Training of Orphelia Versions Net300 (a.k.a Orphelia 300)  Trained on 300bp fragments for predicting genes (454 reads) Net700 (a.k.a. Orphelia 700)  Trained on 700bp fragments for predicting genes (Sanger reads)

Gene Candidate Selection Algorithm Initially: C i = all “ORFs” along with their gene probability for fragment i (p > 0.5) G i = ϕ (empty) Selection Algorithm(C i, G i ) while C i not empty 1. determine “ORFs” with highest probability w/respect to all “ORFs” in C i 2. remove selected “ORF” from C i and add it to G i 3. remove all “ORFs” from C i that overlap with selected “ORF” by more than o max bp Result: G i = list of genes for fragment i

Performance  use fragments with known annotation  compare prediction to annotation SensitivitySpecificity TP - reading frame and/or stop codon of prediction match annotation FP - predicted gene does not occur in annotation FN - annotated gene was not predicted

Test Species  Randomly excised fragments to 1x genome coverage from annotated genomes

Results

Sensitivity on different fragments lengths

Specificity on different fragments lengths

Web server: http://orphelia.gobics.de/ http://orphelia.gobics.de/

Limitations  Do not annotate rRNA and tRNA genes  All model-based methods are susceptible to sequencing errors

Effect of sequencing errors  Traditional gene prediction methods subject to a benchmark study on real sequencing reads with typical errors  However such a comparison has not been conducted for specialized tools Gene prediction accuracy mostly measured on error-free DNA fragments

Two major sequencing techniques  Sanger sequencing Avg read length of ~700nt Error rates from 0.001% to > 1% (depends on software used for post-processing of reads)  Pyrosequencing Shorter reads ~450nt Error rate of 0.49% for reads of 100-200nt Metagenomics simulation software MetaSim produces reads with an error rate of 2.8%

Accuracy Results

Accuracy Results by GC-content

Conclusions  Orphelia high gene prediction accuracy on short DNA fragments high gene prediction specificity  Accounting for realistic sequencing error rates will significantly influence prediction performance

Additional References  http://orphelia.gobics.de/ http://orphelia.gobics.de/  K. J. Hoff, T. Lingner, P. Meinicke, M. Tech, “Orphelia: predicting genes in metagenomic sequencing reads”, Nucleic Acids Research, 2009, 37, W101–W105.  K. J. Hoff, “The effect of sequencing errors on metagenomic gene prediction”, BMC Genomics, 2009, 10:520.  H. Noguchi, J. Park, T. Takagi, “MetaGene: prokaryoticgene finding from environmental genome shotgun sequences”, Nucleic Acids Research, 2006, 34(19), 5623–5630.  J. Besemer, M. Borodovsky, “Heuristic approach to deriving models for gene finding”, Nucleic Acids Research, 1999, 27(19), 3911-3920.

Questions?

Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.

Similar presentations

Presentation on theme: "Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.

Similar presentations

Presentation on theme: "Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern."— Presentation transcript:

Similar presentations

About project

Feedback