Download presentation
Presentation is loading. Please wait.
Published byMay Pope Modified over 8 years ago
1
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern and Peter Meinicke José Lugo-Martínez I609-Week 9 th March 9, 2010 Orphelia: predicting genes in metagenomic sequencing reads Katharina J. Hoff, Thomas Lingner, Peter Meinicke and Maike Tech
2
Outline Introduction Background Orphelia Effect of sequencing errors Conclusion
3
Metagenomics (revisited) Simultaneously characterize all single species genomes of a particular habitat Without prior cultivation!! Phylogenetic origin may be unknown Identification of protein coding genes location of genes unknown!! Identification of metabolic pathways
4
So far, we have discussed … Sequencing Who is there? What are they doing? Phylogenetic profilingFunctional profiling Environmental Sample … Today
5
What’s the Problem? Don’t know phylogenetic origin Most reads cannot be assembled into longer contigs How to assemble reads? This implies: Analysis of single sequencing reads But ORF-based will overlook most reads Need gene prediction approaches for metagenomics Fast and accurate
6
Possible Approaches (1) Homology based BLAST search against databases of known proteins BLAST search against sample Clustering of sample and database sequences Limited to already known genes, and/or computationally expensive
7
Possible Approaches (2) Model-based Methods GeneMark - derives an adapted monocodon usage model from GC-content MetaGene – extracts ORFs and scores them, then, calculates the final ORF combination from different scores FragGene Scan - Mina Rho (IU) Orphelia –fragment-oriented based on a two-stage machine learning approach: linear discriminants and neural networks
8
Orphelia Overview Added in 2 nd paper
9
Pipeline Score all Candidates Likely Genes Likely Random “ORFs” Selection of Candidates Final Prediction Extract “ORFs”
10
“ORF” Extraction Begin start codon (ATG, CTG, GTG, or TTG) Followed by >18 subsequent triplets End stop codon (TGA, TAG, or TAA) But, also consider incomplete “ORFs” of length ≥60bp that lack start and/or stop codon
11
“ORFs” Identification STOPTIS
12
Scoring of Candidate
13
Step 1: Linear Discriminants Feature preprocessing Training Linear Discriminants Example: Monocodon Linear Discriminant
14
Step 2: Neural Network Input: Feature vector x := Output: Gene probability of being coding “ORF” Training of Orphelia Versions Net300 (a.k.a Orphelia 300) Trained on 300bp fragments for predicting genes (454 reads) Net700 (a.k.a. Orphelia 700) Trained on 700bp fragments for predicting genes (Sanger reads)
15
Gene Candidate Selection Algorithm Initially: C i = all “ORFs” along with their gene probability for fragment i (p > 0.5) G i = ϕ (empty) Selection Algorithm(C i, G i ) while C i not empty 1. determine “ORFs” with highest probability w/respect to all “ORFs” in C i 2. remove selected “ORF” from C i and add it to G i 3. remove all “ORFs” from C i that overlap with selected “ORF” by more than o max bp Result: G i = list of genes for fragment i
16
Performance use fragments with known annotation compare prediction to annotation SensitivitySpecificity TP - reading frame and/or stop codon of prediction match annotation FP - predicted gene does not occur in annotation FN - annotated gene was not predicted
17
Test Species Randomly excised fragments to 1x genome coverage from annotated genomes
18
Results
19
Sensitivity on different fragments lengths
20
Specificity on different fragments lengths
21
Web server: http://orphelia.gobics.de/ http://orphelia.gobics.de/
22
Limitations Do not annotate rRNA and tRNA genes All model-based methods are susceptible to sequencing errors
23
Effect of sequencing errors Traditional gene prediction methods subject to a benchmark study on real sequencing reads with typical errors However such a comparison has not been conducted for specialized tools Gene prediction accuracy mostly measured on error-free DNA fragments
24
Two major sequencing techniques Sanger sequencing Avg read length of ~700nt Error rates from 0.001% to > 1% (depends on software used for post-processing of reads) Pyrosequencing Shorter reads ~450nt Error rate of 0.49% for reads of 100-200nt Metagenomics simulation software MetaSim produces reads with an error rate of 2.8%
25
Accuracy Results
26
Accuracy Results by GC-content
27
Conclusions Orphelia high gene prediction accuracy on short DNA fragments high gene prediction specificity Accounting for realistic sequencing error rates will significantly influence prediction performance
28
Additional References http://orphelia.gobics.de/ http://orphelia.gobics.de/ K. J. Hoff, T. Lingner, P. Meinicke, M. Tech, “Orphelia: predicting genes in metagenomic sequencing reads”, Nucleic Acids Research, 2009, 37, W101–W105. K. J. Hoff, “The effect of sequencing errors on metagenomic gene prediction”, BMC Genomics, 2009, 10:520. H. Noguchi, J. Park, T. Takagi, “MetaGene: prokaryoticgene finding from environmental genome shotgun sequences”, Nucleic Acids Research, 2006, 34(19), 5623–5630. J. Besemer, M. Borodovsky, “Heuristic approach to deriving models for gene finding”, Nucleic Acids Research, 1999, 27(19), 3911-3920.
29
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.