Softberry Mass Spectra (SMS) processing tools This is a collaborative project for analysis of mass spectra data with Universal Prediction Limited (UK) (http://www.universal-prediction.com).
Processing mass spectra: main analysis steps Calibration Data resampling Data smoothing Detection of the baseline and its subtraction from intensity Normalization Peaks identification Peaks alignment Sample classification and patient outcome prediction from MS data
Calibration: removing systematic noise from equipment Spectrum with set of calibration peaks: used to find transform to put these peaks to known MZ positions Raw data Spectrum peak location Calibration peak location Sample spectrum after calibration transform Raw data Calibrated data
Resampling: removing excessive data, transform to common MZ scale The data resampling allows to discriminate the excessive data and to bring the mi values to common scale. As a result, different spectra will have the same m value counts, and, thus, will be comparable. Reduction in number of spectrum points allows to lower the noise and to eliminate excessive data, but, at the same time, to keep the spectrum shape. Initial data Resampled data
Smoothing: random noise elimination Data smoothing procedure is intended for data noise elimination. During the smoothing, the values of intensity for each mzi point are being averaged by several neighboring points. Initial data Smoothed data
Baseline processing This step of data processing is applied for elimination of the systematic artifacts that occur due to matrix and chemicals used in the experiments or as a result of detector overload. It results in background noise that may occur to be significant for some m values. Initial data Baseline subtracted Baseline
Normalization: bring spectrum intensity to common scale Normalization allows to bring peaks intensity values to a common scale, and thus it becomes possible to compare data from different spectra. Initial data Normalized data
Peak identification The current step of analysis lies in searching for peaks in spectrum with high signal-noise ratio. Peaks, in themselves, are identified as points of local spectrum maximum. Peak location Sample intensity
Peak alignment On analyzing several spectra the question if there are common peaks for these spectra easily arises. To solve this question it is necessary to compare peaks locations and intensity for spectra of interest. It is mandatory that for all spectra to be compared the previous steps are to be completed with the same parameters. Sample 1 Sample 2 Common peaks Specific peak
Using LDA with MS data to predict patient outcome Dataset description. MS data were taken from work of Gammerman et al, 2008. Control data: We used 153 control samples (no ovarian cancer detected) as ‘NO’ dataset. In this work we considered control samples as a general pool of healthy people. Cancer patient data: The all data contain 75 samples from patients with identified ovarian cancer (OC) taken from 0 to 75 months prior to diagnosis from 18 patients. To train LDA classifier we used these patients samples taken from 1 to 6 months prior to diagnosis (‘YES’ dataset, 17 samples).
MS data processing We used algorithms described in Gammerman et al, 2008 to preprocess mass pectra from 228 samples total. The processing included: Calibration Resampling Smoothing Normalization Peak identification Peak alignment and peak group detection As result, 374 peak groups were detected for all sample data.
List of top 20 peak groups with highest representation in analyzed samples Peak Group Index PeakID MeanMass MinMass MaxMass NumPeaks (of 228) Max Intensity 5 3191.554 3188.161 3193.358 211 45.57914 20 1770.479 1769.719 1772.318 195 29.40414 18 2009.877 2009.076 2012.017 193 30.74098 24 825.7725 825.2985 826.2407 189 26.30554 42 3333.192 3329.355 3334.906 184 19.21943 2 2026.901 2025.914 2029.441 177 53.74245 37 2267.009 2266.025 2268.258 20.36678 17 2985.741 2983.11 2989.592 167 31.3781 90 2552.984 2551.655 2554.576 157 11.41295 8 1894.954 1894.057 1896.423 147 42.05795 78 2114.491 2111.304 2116.45 13.52459 7 1863.654 1862.77 1864.733 144 42.50182 10 1449.102 1448.24 1451.12 136 35.4617 56 1584.659 1582.731 1586.55 133 15.79827 55 2567.124 2563.25 2568.585 132 16.01036 23 944.728 944.0944 945.2638 130 27.91649 3 2647.657 2646.315 2648.923 126 50.25482 6 6647.589 6635.569 6651.674 121 44.14933 12 1395.111 1394.238 1397.255 120 34.98699
Selection of LDA features for classification of cancer and non-cancer samples We used LDA function that uses 2 prediction features: Logarithm CA125 serum tumor marker level. Logarithm of the MS signal intensity within peak group MZ range. For each MS data: if peak was presented in the peak group we take logarithm of its intensity; if no peak was detected we took average signal intensity for the MZ range corresponding to peak group; if the intensity values were all zero for MZ within the peak group range, we set the log intensity value to -10. Thus, LDF (linear discriminant function) is LDF=a1*x1+a2*x2+b, where x1 is log(CA125 level), x2 is log(Peak intensity for some peak). We test the utility of the x2 feature (MS intensity) for all the peak groups that have the largest peak number (listed at the previous slide), 20 peak groups were tested in total. For each peak intensity we calculated LDF value and made classification for cancer/non-cancer samples. The classification performances (fraction of true predictions) were estimated for each of the 20 peak groups.
Example of data input for LDA analysis The information for LDA classification is represented as table containing (1) sample index, (2) time of sampling (before diagnosis for cancer patients), (3) patient index (case), (4) logarithm of CA125 level (denoted as lnCA125), and (5-25) logarithm of mass spectra peak intensity for 20 peak groups (denoted as MZ_NNNN, where NNNN is mean mass value for peak group).
LDF=a1*logCA125+a2*logPi17+b; a1=5.247, a2=-0.006, b=-18.639 It was found, that peak group 17 provide the best performance for LDA classification if used with CA125 level. LDF=a1*logCA125+a2*logPi17+b; a1=5.247, a2=-0.006, b=-18.639 Peak Group Index PeakID MeanMass MinMass MaxMass NumPeaks (of 228) Max Intensity 5 3191.554 3188.161 3193.358 211 45.57914 20 1770.479 1769.719 1772.318 195 29.40414 18 2009.877 2009.076 2012.017 193 30.74098 24 825.7725 825.2985 826.2407 189 26.30554 42 3333.192 3329.355 3334.906 184 19.21943 2 2026.901 2025.914 2029.441 177 53.74245 37 2267.009 2266.025 2268.258 20.36678 17 2985.741 2983.11 2989.592 167 31.3781 90 2552.984 2551.655 2554.576 157 11.41295 8 1894.954 1894.057 1896.423 147 42.05795 78 2114.491 2111.304 2116.45 13.52459 7 1863.654 1862.77 1864.733 144 42.50182 10 1449.102 1448.24 1451.12 136 35.4617 56 1584.659 1582.731 1586.55 133 15.79827 55 2567.124 2563.25 2568.585 132 16.01036 23 944.728 944.0944 945.2638 130 27.91649 3 2647.657 2646.315 2648.923 126 50.25482 6 6647.589 6635.569 6651.674 121 44.14933 12 1395.111 1394.238 1397.255 120 34.98699
LDA classification example for peak group #17 This peak group defined for peaks with MZ values in the range [2983.0, 2989.6] . The distribution of CA125 and peak intensities for control (blue points) and OC patients (1-6 months before diagnosis; red points) shown below. Classification results are also shown: 7 control data were classified as disease (< 5%). Log(CA125) Log(I);MZ=2986 Cancer detected Control Number of samples=171 (control(0)=154;disease(1)=17) Fraction of true predictions: 0.959064[164] Class 0: Fraction of true positives : 0.954545[147] Fraction of false negatives : 0.045455[7] Class 1: Fraction of true positives : 1.000000[17] Fraction of false negatives : 0.000000[0]
LDV values distribution for control and cancer samples LDF calculated for features: CA125 and peak #17 intensity Non-cancer Cancer Number of samples Control 1-6 months prior to cancer detection LDF values
LDF value vs time before diagnosis The LDF values were calculated for all OC patients samples (18 patients). The results shown below. The X axis – time before diagnosis. For most samples the LDF value exceed zero in the range 10 months before diagnosis. Y-axis – LDF value. 5 samples show no increase of LDF values for this period (they have small number of samples). One patient (ID 3480) have LDF value greater than zero for all period of time. Thus positive LDF values based on CA125 and MS peak intensity [2983.0, 2989.6] can be used as OC markers for prognosis within 6 months. Patient ID Cancer LDF value Time (months) Non-cancer Wrong classification of non-cancer case for some OC patients at time=0