Download presentation
Presentation is loading. Please wait.
Published byAnnabel Higgins Modified over 8 years ago
1
Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models Arthur Brady and Steven L. Salzberg Nature Methods 6(9): 673-678. 2009 I609 – Week 7: Paper 8 02/23/2010 Presented by Vikas Rao Pejaver
2
Outline Background - Problem statement- Previous methods - Markov chains & interpolated Markov models Methods - Datasets- Training and testing - Weighed voting in PhymmBL Results - Synthetic data- Classification accuracy - Acid mine data Summary Discussion
3
Background – Problem statement Given a metagenomic sample, containing short reads, classify these reads such that DNA fragments from common species can be grouped together and assembled Authors make a distinction between ‘classification’ and ‘binning’ Classification – results in the assignment of specific labels to the unique groups Binning – Although the dataset is expected to be divided into smaller clusters, groups may remain unlabeled
4
Background – Previous methods TETRA 1 – uses z-scores of tetranucleotide frequencies MetaClust 2 – combines different approaches like the GC content, dinucleotide relative abundances, raw counts, statistical evaluations of short k-mers & chaos game representations and performs clustering CompostBin 3 - uses a weighted PCA algorithm to project the high dimensional DNA composition data into a lower-dimensional space, and then uses the normalized cut clustering algorithm Barcodes for genomes 4 – uses combined frequency distributions of k-mers and reverse complements Issue with all these methods is that they cannot capture local sequence variations and thus, work only for longer reads 1.Teeling et al. TETRA: A Web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 5: 163 (2004) 2.Woyke et al. Symbiosis insights through metagenomic analysis of a microbial consortium. Nature, 443(7114): 950-955 (2006) 3.Chatterji et al. CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. In Proc. the 12th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2008), pp.17-28 4.Zhou et al. Barcodes for genomes and applications. BMC Bioinformatics, 9: 546 (2008)
5
Background – Previous methods CARMA 1 – matches reads to known Pfam domains (shown to correctly classify only 6% of reads) PhyloPythia 2 – an SVM based method that makes use of oligonucleotide frequencies to perform classification (however works well for long reads) BLAST 3 – assigns taxonomy based on best match obtained when aligning a read to a sequence database (however not all sequences may be represented in current databases) MEGAN 4 – similar in approach to BLAST but uses information from multiple high-scoring BLAST hits 1.Krause, L. et al. Phylogenetic classification of short environmental DNA fragments. NAR. 36, 2230–2239 (2008) 2.McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Meth. 4, 63–72 (2007) 3.Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554–557 (2005) 4.Huson, D.H., Auch, A.F., Qi, J. & Schuster, S.C. MEGAN analysis of metagenomic data. Genome Res. 17, 377–386 (2007)
6
Background – Markov chains & IMMs Interpolated Markov models used in gene finding and previously implemented in GLIMMER 1 For this particular problem, IMMs provide two major advantages: 1.Frequencies of oligonucleotides of different sizes can be modeled 2.Non-adjacent patterns can be modeled 1.Delcher, A.L., Harmon, D., Kasif, S., White, O. & Salzberg, S.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636– 4641 (1999)
7
Background – Markov chains & IMMs Images from lecture slides from I529 – Spring 2009 – Haixu Tang X1X1 X2X2 X n-1 XnXn Q. What if the probability at a particular state depended on n states before it ? A.We would have an n th order Markov chain Q. What if n was not fixed? A. We would have an IMM!
8
Background – Markov chains & IMMs where k is the order, S x is the oligomer ending at position x, P k (S x ) is the estimate obtained from the training data of the probability of the base located at x in the k th -order model and k (S x-1 ) is the numeric weight associated with the k-mer ending at position x – 1 Delcher, A.L., Harmon, D., Kasif, S., White, O. & Salzberg, S.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636– 4641 (1999)
9
Methods – Datasets Synthetic metagenome test set: –Based on core library of bacterial and archaeal genomes of 539 distinct species –No overlap between training and test sets –5 randomly-selected ‘reads’ from each chromosome /plasmid were used to construct test sets –Test sets filtered to ensure only species with at least two sister species within the clade considered Acid mine drainage test set: –Entire set of raw sequence reads from NCBI –166,345 reads remained after vector sequence cleaning, quality filtering and very short read removal –True positives defined by aligning reads to draft genomes of three species using MUMmer
10
Methods – Training and testing 1,146 IMMs built and 1,146 molecular sequences included in the BLAST database For PhymmBL, composition-based measures (G+C content, dinucleotide frequency, etc.) were included in training to boost accuracy But significant improvements were not observed Surprisingly, no correlation found between prediction accuracy or standard deviations and training set size (Figure in supplementary material) Accuracy more dependent on evolutionary diversity than on training set size
11
Methods – Weighted voting Phymm – Basic score reflects the probability that a query sequence was generated from the same distribution as that used to train a given IMM PhymmBL – used weighted scoring scheme: Score = IMM + 1.2(4 – log(E)) where IMM is the best-matching IMM and E is the lowest E-value from BLAST Constants experimentally determined and were fixed to avoid dominance of scores provided by one method over the other
12
Results – Synthetic data
13
Results – Classification accuracy
14
Results – Acid mine data
15
BLASTPhymm PhymmBL
16
Summary IMMs effective for non-binary classification problems as well BLAST still better but PhymmBL would allow characterization of novel sequences Phylogenetic classification achieved without gene-finding, domain-matching, etc. ‘1000 bp’ read barrier surpassed by both methods Will improve downstream analyses in metagenomes
17
Discussion As mentioned in paper, relationships between raw scores and predictive accuracies, across various read lengths needs further investigation – any ideas? As Prof. Ye mentioned at the beginning of this course, it has been found that despite such high accuracy in classification by Phymm and PhymmBL, subsequent assembly has still remained poor. Why? Can we use alternative approaches in the context of the metagenomics pipeline – perhaps a simultaneous classification and assembly approach?
18
Thank you! Metagenomics has been compared to ‘a reinvention of the microscope in the expanse of research questions it opens to investigation’
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.