NoDupe algorithm to detect and group similar mass spectra.
Reducing the number of similar spectra in proteomic experiments: Why? Identifying peptides from spectral collections is time consuming. Detecting similarities reduces number of spectra to be processed. Dynamic exclusion feature of the mass spectrometer does not eliminate all duplicate spectra. a. Peptides may elute over a period of time b. Peptide mixture may have high complexity.
MS/MS spectra from the same peptide may look different Signal to noise ratio. Variations in collision energy. Random noise.
Finding degree of similarity between two spectra Dot product comparison used to find similarity. Vectors are built for each spectra. Greater angles imply greater differences between spectra. Angles nearing zero imply considerable similarity.
NoDupe Algorithm Created in Java programming language. Spectra are grouped on the based on their similarities. Preprocessing done to reduce complexity. Optionally removes duplicate spectra from each LC run retaining only one representative spectrum.
NoDupe: Preprocessing All fragment ions in a run are assigned to bins 1.0057 m/z ions wide. Intensities of succeeding peaks in the same bins are added. Intensities of peaks are normalized by the sum of intensities of all peaks. Smaller peaks are emphasized. Peaks of very low intensities are removed. Sum of square roots of the intensities is calculated. Only significant peaks are retained and the rest are discarded.
Results of preprocessing
NoDupe: Finding similarities Scans are sorted based on the precursor m/z. Spectral contrast angles are calculated for pairs of spectra within 3 m/z of each other. ia peak intensity of spectrum A ib peak intensity of spectrum B θ spectral contrast angle For identical spectra, θ = 0 For completely dissimilar spectra, θ = π / 2
Spectral contrast angles
Similarity angle cutoff is taken as 1.1
NoDupe: Selecting representative spectra Match count is for spectra is calculated. Duplicates are detected based on the match count. Ties are broken based on number of peaks removed during preprocessing.
Samples used Gel band sample: Protein complex from stable HEK 293 cell. Microtubule-Associated protein sample: MAP purified from bovine brains Rat hippocampus sample : protein from rat brains. Sample complexity varied from 18.3 to 34.6 spectra/min.
Experimental process LC separations were done for all three samples. 2to3 algorithm was applied to remove spectral copies with incorrect charge state assignments. They used NoDupe to reduce the number of spectra.
Observations Large number of peaks removed. For the peptide VAAPEEHPVLLTEAPLNPK, Approximately 70% of the peaks in the spectra were removed. number of peaks and relative standard deviation diminished. The relative standard deviation diminished from 26% to 20%.
Observations: Clusters Average cluster size among was found to be around 4. Spectral pairs were the most common kind of clusters. Two-thirds of the spectra were not significantly similar to any other spectra. High confidence peptides were lost when duplicate spectra were removed.
Identifications lost 4 to 14% of the identifications were lost. Without removing the duplicate spectra 5 to 19% of the identifications were lost. Angle is found to be 0.847.
For group size 2 Since there are only two spectra in this group, the most representative one is chosen. Scan 491 is chosen as only 21% of the peaks are remaining as opposed to 24%. Since pairs are common, there might be a significant loss of protein identifications.
Lost spectra Scan 4892 was not found to be similar enough by NoDupe.
Duplicate spectra and peptides identified
Where it can be used Grouping results in substantial savings in time. Instead of finding the best sequence for each spectrum, it will find the spectrum that best matches each of the spectra in a group. If the database is large, it is more effective in saving time. A narrower mass window can be used. Alleviates random matching. Spectral libraries will be more effective if they contain representative spectra than randomly chosen ones. Spectra that are in the same groups but receive different identifications by De Novo examination can be flagged.
Acknowledgments The paper presented was “Similarity among tandem mass spectra from proteomic experiments: detection, similarity and utility” David L.Tabb, Michael J.MacCoss, Christine C.Wu, Scott D.Anderson, and John R.Yates. Thanks to Prof. Haixu Tang for guiding me.