Download presentation
Presentation is loading. Please wait.
Published byAlfred Jennings Modified over 8 years ago
1
Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics Pingzhao Hu W Le, S Lim, B Xing, CMT Greenwood and J Beyene Hospital for Sick Children Research Institute and University of Toronto The Sixth International Conference for the Critical Assessment of Microarray Data Analysis (CAMDA 2006) Duke University Durham, NC, U.S.A June 8-9, 2006
2
Outline 1. Objectives 2. Data Set 3. Methods & Results 3.1 – Preprocessing (identification of biomarkers) 3.2 – Classification model 4. Conclusions
3
Objectives Identify biomarkers for CFS/CFS-like diseases using SELDI-TOF MS technology Evaluate performance of the identified biomarkers to distinguish patients with CFS/CFS-like from healthy people Determine the best experimental protocol for large sample studies by choosing the best Fraction/Chip/Laser Energy combinations
4
Data Set GroupCFS/CFS-likeHealthy controlQuality Control (QC) Sample31329 Replicate22 Fractionf1, f2, f3, f4, f5, f6 ChipH50, IMAC30, High stringency CM10, Low stringency CM10 Same as CFS data Laser EnergyHigh, Low Total spectra29763072864 Each spectrum has ~30000 m/z for high energy and ~20000 m/z values for low energy (note: the number of m/z values in fractions f1 and f2 is larger). Each combination (Fraction/Chip/Laser energy) includes 144 ( (31+32+9)*2 ) spectra f1 and f2 have not been analyzed since they have different number of spectra and m/z values from other fractions QC samples were not analyzed here
5
Data Analysis Pipeline Preprocessing Baseline subtraction (already done) Trimming low m/z values Normalization Peak finding and alignment Quantification of aligned peaks Merging replicate samples Classification Do a 10-fold cross-validation (CV) For each step of CV Split samples of preprocessed data into training and test sets Perform biomarker selection on the training set using t-tests Built prediction model on the training set kernel-based K-nearest neighbor (KNN) classifier Evaluate performance using test set
6
Trimming low m/z values Low laser energy allows peaks in the low mass range to be well- visualized High laser energy improves visualization of peaks in the high mass range Many studies (e.g. Baggerly et al. 2003) indicated that there is a noisy m/z region near the lower limit where the machine can not record stably. For the above reasons, we trimmed low m/z values using the following thresholds: For low laser energy condition, we trimmed low m/z values less than 100 For high laser energy condition, we trimmed low m/z values less than 2000
7
Given a spectrum with intensities X i (i=1,..,n) for all n m/z values, normalized intensities X i norm can be computed by X i norm = s*X i where s = (median of the total intensities among all spectra) (the total intensity of the current spectrum). Multiplying raw intensities by the factor s equalizes the median (mean) of the total intensity among compared spectra Global Normalization (Li 2005)
8
Peak finding -- Why The height of peak intensities at certain m/z values indicates the presence and the approximate amount of corresponding proteins or peptides in the sample However, not all peaks at a m/z value are related to a protein or even a part of a protein We need to search for those peaks that may represent a protein or a part of a protein
9
Peak finding-- Algorithm (Tuszynski 2006)
10
Peak Alignment -- Why Assume two peaks: R1 at m/z value L1, and R2 at m/z value L2 are detected in two spectra, respectively. It is known that the m/z value of the same peak in different spectra may have a small shift (0.1%-0.3%). The shift must be adjusted so that peaks in the given shift interval (say, m/z *(1-0.2%,1+0.2%)) are aligned to have the same m/z value The objective of alignment is to estimate common m/z value L3 of the peaks in the given shift interval across spectra R1 R2 L2 L1 L3
11
Alignment –Algorithm ------ Maximal cliques & real representations (Li, 2005, Gentleman 2001 ) Find maximal cliques: {1,2}, {3,4,5,6}, {7,8}, {8,9} Real representations: Find common m/z region for each maximal clique and estimate the aligned peak centers using maximum likelihood estimation (MLE) R1 R6 R8R5 R4 R2 R7 R3 R9 Not a maximum clique Aligned peak centers
12
Quantify aligned peaks for individual spectra Each aligned peak location (m/z) can be treated as an interval, m/z * (1-0.2%,1+0.2%) The intensity of each of the aligned peak location of individual spectra can be quantified by the maxima in the interval. Black: raw m/z values with intensities Red: Aligned peak m/z value. It has no associated peak in the raw data Blue: left and right intervals of the aligned peak The intensity is quantified as the intensity of the aligned peak location (red)
13
Merging replicate samples After quantification of the aligned peaks in all individual spectra, we averaged the intensities of the two replicates for each samples. The averaged intensities were used to build our prediction model.
14
Predictor --K-Nearest Neighbor (KNN) Method To classify a new input vector (observation) v, examine the k-closest training data points to v and assign the object to the most frequently occurring class Neighborhood is defined based on a mathematical distance measure Deficiencies: The individual points in a neighborhood may have very different similarities to v (distances from v ), but they all have the same influence on the prediction x k=1 k=6
15
Predictor --Kernel-based KNN Method (Hechenbichler and Schliep 2004) To classify a new observation v, examine the k+1 nearest neighbors to v according to Euclidean distance (d) The (k+1) st neighbour is used for standardization of the k smallest distance by D(i)=D(v, v(i))= d(v, v(i))/ d(v, v(k+1)), i=1,…, k Transform the normalized distance D(i) using a Gaussian kernel function K(.) into a weight w(i)=K(D(i)) Assign a prediction label to v based on where y can be either CFS/CFS-like (r=1) or NON-CFS (r=0) disease k is implicitly hidden in the weights - if k is too large, k is adjusted to a smaller value automatically, since only small number of neighbors with large weights dominate the other neighbors (very small weight-no influence on the prediction) We set k=3 in the study
16
Results We first define some concepts used in the section Condition: Experimental protocol (Fraction/Chip/Laser energy) Biomarkers: Here we mean they are the aligned peaks Differentially expressed biomarkers: Aligned peaks that have p-values less than 0.05 selected by t-test.
17
The number of biomarkers identified in each condition, and the number significant (p<0.05) Condition 1 # of biomarkers identified # of differentially expressed biomarkers (p<0.05) H50-F4 23540 IMAC-F561310 CM10 High Stringency-F431529 CM10 High Stringency-F5 1220 52 CM10 High Stringency-F6105324 1 Only conditions (total 32 conditions) with at least 2 differentially expressed peaks (p<=0.05) are listed High Laser Energy
18
The number of biomarkers identified in each condition, and the number significant (p<0.05) –Cont. Condition# of biomarkers # p<0.05Condition# of biomarkers #p<0.05 H50-F43519H50-F62999 CM10 High Stringency-F341720CM10 High Stringency-F44094 CM10 High Stringency-F580912CM10 High Stringency-F670640 CM10 Low Stringency-F33285CM10 Low Stringency-F43763 CM10 Low Stringency-F537012 IMAC-F3 3919 IMAC-F4 4187 IMAC-F5 44936 IMAC-F6 3414 Low Laser Energy
19
Using Low laser energy, there are 13 conditions (Fraction/Chip) that identified at least two differentially expressed peaks/biomarkers (p<=0.05) Using High laser energy, there are only 5 conditions (Fraction/Chip) that identified at least two differentially expressed peaks/biomarkers (p<=0.05) Comments
20
Performance of the kernel-based KNN predictors using selected biomarkers in each condition ConditionAccuracy (%)AUC (%)# of biomarkers 1 High Laser Energy--H50-F466.764.616 High_Laser_Energy_CM10 High Stringency-F668.3 70.63 Low_Laser_Energy-H50-F679.477.68 Low_Laser_Energy-H50-F461.9 58.115 Low_Laser_Energy _IMAC-F561.962.3 3 Low_Laser_Energy_CM10_Low_ Stringency-F360.3 67.92 Low_Laser_Energy_CM10_Low_ Stringency-F4 61.960.94 Low_Laser_Energy_CM10_Low_ Stringency-F5 60.359.615 Low_Laser_Energy_CM10_High_ Stringency-F3 69.876.113 Low_Laser_Energy_CM10_High_ Stringency-F6 61.961.15 1 The number of biomarkers selected in each of the 10 cross-validations Only conditions with larger than 60% accuracy have been listed
21
Biomarkers used in building prediction model for condition: H50, Low laser energy, and F6 m/zp-value (t-test)*Times** 500.46841340.00230163610 526.32216110.00535825810 500.93266750.01147514310 7784.4251030.01389705810 501.86182580.01822286510 502.79184910.021424768 527.51309040.0234769046 499.54054510.0260541998 526.79825020.0428687963 501.39714170.0550076191 791.51203890.0620073611 6915.0273240.1018609141 4276.0364330.102200981 15483.836270.3056293021 *9 of 299 peaks (after alignment) have p-values less than 0.05 ** The number of times the biomarkers was picked in 10 CV Three m/z value ranges seems to be interesting: 499-503 526-528 7784-7785
22
Conclusions Based on our analysis, the best combination (laser energy, chip and fractions) appears to be Low laser energy/H50/Fraction 6 We identified 9 significantly expressed biomarkers (p- value<=0.05), which are located in the 3 m/z value ranges: 499-503 526-528 7784-7785 Using 14 biomarkers identified from the combination, our predictor can reach ~80% accuracy.
23
Limitations For all combinations of experimental protocol, we used the same m/z shift interval (m/z*(1- 0.2%,1+0.2%). A better choice may be obtained by estimating it for each combination from QC samples We did not take the multiple testing issue into account in this analysis
24
Acknowledgements We used following R packages to perform the analysis in this study caMassClass (Jarek Tuszynski) PROcess (Xiaochun Li) kknn (Klaus Hechenbichler and Klaus Schliep) This research was supported by funding from Ontario Genomics Institute and Genome Canada, through the Centre for Applied Genomics.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.