A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011
Metagenomics Environmental Genomics Metagenomes: genetic material recovered directly from environmental samples
Metagenomics (Data Analysis) Major difficulty of metagenomics lies in the fact that most bacteria (up to 99%) found in environmental samples are unknown and cannot be cultivated and isolated under laboratory conditions. One possible solution is to directly sequence DNA fragments of multiple species obtained from the mixed environmental DNA sample. Identification and taxonomic characterization of DNA fragments resulting from sequencing a sample of mixed species Binning Group DNA fragments from similar species.
History Similarity-based methods Align each DNA fragment to known reference genomes Limited to the availability of known microorganism genomes <1%
History Composition-based methods Group DNA fragments using genetic features such as genome structure or composition Low availability and reliability of taxonomic markers Some species may share multiple marker with other species
What’s the best? The more promising method is to use unsupervised binning algorithms based on the occurrence frequencies of l-mers. It is based on the observation that the l-mer distributions of the fragments in the same genome are more similar than those l-mer distributions of two unrelated species.
l-mer Based Methods Tetra (Teeling et al., 2004) MetaCluster (Yang et al., 2009) MetaCluster 2.0 (Yang et al., 2010) AbundanceBin (Wu and Ye, 2010) MetaCluster 3.0 (Yang et al., 2011)
MetaCluster 3.0 Based on l-mer Two step: Top-Down Bottom-Up
MetaCluster 3.0
l-mer frequency The DNA composition features of each DNA fragment are represented by the l-mer frequencies of the DNA fragment kinds of l-mers DNA feature vector: [f 1, f 2, …, f n(l) ] f w : frequency of each l-mer n(l): number of different l-mers
l-mer frequency l=4 best for DNA fragment size 1000 to 10,000 (Chor et al & Zhou et al) Observation that the l-mer distributions of those DNA substrings (fragments) from the same genome are similar. compare the l-mer distribution of reads Spearman distance distribution
Spearman distance distribution The difference of two l-mer distribution from two fragments A: (a 1, a 2, …, a n(l) ) B: (b 1, b 2, …, b n(l) ) be the rank of a i in the sorted list of a i ’s and be the rank of b i in the sorted list of b i ’s. The smaller the value of the metric, the more similar the vectors are For vectors with size k, the distance value can range between 0 and k(k+1)
Spearman distance distribution Benefits: Compared with other distance metrics that rely on the exact value of each entry in the feature vectors less sensitive to those entries with unexpectedly large values. more global view of the distance of two feature vectors with respected to all the entries.
Spearman distance distribution Observation using empirical study for 1000 genome The Spearman distance distribution of the differences between two l-mer distributions of fragments from the same species and those from species of different families can be approximated by a normal distribution There is significant difference between these two distribution
Spearman distance distribution
Top-down clustering K-median algorithm Cluster fragments into k’ Repeatedly assign feature vector to closest cluster Select a feature vector as the center with the following objective function:
Top-down clustering K-median algorithm greedy algorithm It repeated several dozen times with different initial clustering center Select the ones with minimum objective function
Top-down clustering (k’ determination ) Distance between each fragment and the center from the same species: Distance between each fragment and the center from the different species:
Top-down clustering (k’ determination ) The expected number of false positive: Since the expected number of false positives decreases with the value of k’, MetaCluster 3.0 increases the value of k’ until the expected number of false positives in a cluster ≤tn/ k’. Set t=5% such that the expected accuracy is over 95% for the first phase k’ can be much larger than the number of species Bottom-up Merging
Bottom-up merging Goal: merge clusters from same species into one cluster It is based on intercluster similarity like intercluster distance Average of all distances between pairs of DNA fragments A in C 1 and B in C 2
Bottom-up merging Known k: Merge the pair of cluster with the minimum intercluster distance greedily until k cluster
Bottom-up merging Unknown k: It is based on the observation that the spearman distance between two fragments from the same species is smaller than from different species Average intracluster distance of C 1 ad C 2 are d 1 and d 2 Merge two clusters if and only if the intercluster distance dist(C 1, C 2 ) is similar to d 1 and d 2 α dist(C 1, C 2 )≤average(d 1, d 2 ) for some threshold α
Bottom-up merging Calculating threshold
Results Source: Randomly selected 240 genomes of bacteria, complete reference from NCBI to generate 1080 test datasets. Comparison between metacluster 2.0 and metacluster 3.0 and AbundanceBin Different species with vary abundance ratio from 1:1 to 1: 24
Result Overall performance of all datasets
Result i) Family: DNA fragments from the same order but different families ii) Order: DNA fragments from the same class but different orders iii) Class: DNA fragments from different classes
Result Performance for Class, Order and Family datasets
A novel method Based on the Shannon entropy Shannon entropy: The information content of a signal. Entropy is a measure of “disorder” in a signal. Entropy shows the “diversity”.
Algorithm For each read: Calculate the Shannon entropy for k-mers for k starting from 1. Choose the largest Shannon entropy Sort the reads based on the Shannon entropy Clustering based on the entropy.
Example A DNA read: CACGACACGCCATTGACTAGCAGTGTCTGATGCAGAAACC The entropy is calculated for l-mers: S 1, S 2, S 3, S 4, … The reads that correspond to the same species must have similar entropy distribution
Implementation I have implemented the algorithm. The developed code calculates the entropy for different l-mers automatically. The Entropy is defined as,
Results I used some data from the dataset contains 2000 DNA fragments from "Acinetobacter_baumannii_SDF“. The length of each DNA fragment is 1000bp.
Result The Shannon entropy for different k for 2 reads from the same species: