Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics Pingzhao Hu W Le, S Lim, B Xing, CMT Greenwood and J Beyene Hospital for.

Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics Pingzhao Hu W Le, S Lim, B Xing, CMT Greenwood and J Beyene Hospital for Sick Children Research Institute and University of Toronto The Sixth International Conference for the Critical Assessment of Microarray Data Analysis (CAMDA 2006) Duke University Durham, NC, U.S.A June 8-9, 2006

Outline 1. Objectives 2. Data Set 3. Methods & Results 3.1 – Preprocessing (identification of biomarkers) 3.2 – Classification model 4. Conclusions

Objectives  Identify biomarkers for CFS/CFS-like diseases using SELDI-TOF MS technology  Evaluate performance of the identified biomarkers to distinguish patients with CFS/CFS-like from healthy people  Determine the best experimental protocol for large sample studies by choosing the best Fraction/Chip/Laser Energy combinations

Data Set GroupCFS/CFS-likeHealthy controlQuality Control (QC) Sample31329 Replicate22 Fractionf1, f2, f3, f4, f5, f6 ChipH50, IMAC30, High stringency CM10, Low stringency CM10 Same as CFS data Laser EnergyHigh, Low Total spectra29763072864  Each spectrum has ~30000 m/z for high energy and ~20000 m/z values for low energy (note: the number of m/z values in fractions f1 and f2 is larger).  Each combination (Fraction/Chip/Laser energy) includes 144 ( (31+32+9)*2 ) spectra  f1 and f2 have not been analyzed since they have different number of spectra and m/z values from other fractions  QC samples were not analyzed here

Data Analysis Pipeline  Preprocessing  Baseline subtraction (already done)  Trimming low m/z values  Normalization  Peak finding and alignment  Quantification of aligned peaks  Merging replicate samples  Classification  Do a 10-fold cross-validation (CV)  For each step of CV  Split samples of preprocessed data into training and test sets  Perform biomarker selection on the training set using t-tests  Built prediction model on the training set  kernel-based K-nearest neighbor (KNN) classifier  Evaluate performance using test set

Trimming low m/z values  Low laser energy allows peaks in the low mass range to be well- visualized  High laser energy improves visualization of peaks in the high mass range  Many studies (e.g. Baggerly et al. 2003) indicated that there is a noisy m/z region near the lower limit where the machine can not record stably.  For the above reasons, we trimmed low m/z values using the following thresholds: For low laser energy condition, we trimmed low m/z values less than 100 For high laser energy condition, we trimmed low m/z values less than 2000

 Given a spectrum with intensities X i (i=1,..,n) for all n m/z values, normalized intensities X i norm can be computed by X i norm = s*X i where s = (median of the total intensities among all spectra) (the total intensity of the current spectrum).  Multiplying raw intensities by the factor s equalizes the median (mean) of the total intensity among compared spectra Global Normalization (Li 2005)

Peak finding -- Why  The height of peak intensities at certain m/z values indicates the presence and the approximate amount of corresponding proteins or peptides in the sample  However, not all peaks at a m/z value are related to a protein or even a part of a protein  We need to search for those peaks that may represent a protein or a part of a protein

Peak finding-- Algorithm (Tuszynski 2006)

Peak Alignment -- Why  Assume two peaks: R1 at m/z value L1, and R2 at m/z value L2 are detected in two spectra, respectively.  It is known that the m/z value of the same peak in different spectra may have a small shift (0.1%-0.3%).  The shift must be adjusted so that peaks in the given shift interval (say, m/z *(1-0.2%,1+0.2%)) are aligned to have the same m/z value  The objective of alignment is to estimate common m/z value L3 of the peaks in the given shift interval across spectra R1 R2 L2 L1 L3

Alignment –Algorithm ------ Maximal cliques & real representations (Li, 2005, Gentleman 2001 )  Find maximal cliques: {1,2}, {3,4,5,6}, {7,8}, {8,9}  Real representations: Find common m/z region for each maximal clique and estimate the aligned peak centers using maximum likelihood estimation (MLE) R1 R6 R8R5 R4 R2 R7 R3 R9 Not a maximum clique Aligned peak centers

Quantify aligned peaks for individual spectra  Each aligned peak location (m/z) can be treated as an interval, m/z * (1-0.2%,1+0.2%)  The intensity of each of the aligned peak location of individual spectra can be quantified by the maxima in the interval. Black: raw m/z values with intensities Red: Aligned peak m/z value. It has no associated peak in the raw data Blue: left and right intervals of the aligned peak The intensity is quantified as the intensity of the aligned peak location (red)

Merging replicate samples  After quantification of the aligned peaks in all individual spectra, we averaged the intensities of the two replicates for each samples.  The averaged intensities were used to build our prediction model.

Predictor --K-Nearest Neighbor (KNN) Method  To classify a new input vector (observation) v, examine the k-closest training data points to v and assign the object to the most frequently occurring class  Neighborhood is defined based on a mathematical distance measure  Deficiencies: The individual points in a neighborhood may have very different similarities to v (distances from v ), but they all have the same influence on the prediction x k=1 k=6

Predictor --Kernel-based KNN Method (Hechenbichler and Schliep 2004)  To classify a new observation v, examine the k+1 nearest neighbors to v according to Euclidean distance (d)  The (k+1) st neighbour is used for standardization of the k smallest distance by D(i)=D(v, v(i))= d(v, v(i))/ d(v, v(k+1)), i=1,…, k  Transform the normalized distance D(i) using a Gaussian kernel function K(.) into a weight w(i)=K(D(i))  Assign a prediction label to v based on where y can be either CFS/CFS-like (r=1) or NON-CFS (r=0) disease k is implicitly hidden in the weights - if k is too large, k is adjusted to a smaller value automatically, since only small number of neighbors with large weights dominate the other neighbors (very small weight-no influence on the prediction) We set k=3 in the study

Results  We first define some concepts used in the section Condition: Experimental protocol (Fraction/Chip/Laser energy) Biomarkers: Here we mean they are the aligned peaks Differentially expressed biomarkers: Aligned peaks that have p-values less than 0.05 selected by t-test.

The number of biomarkers identified in each condition, and the number significant (p<0.05) Condition 1 # of biomarkers identified # of differentially expressed biomarkers (p<0.05) H50-F4 23540 IMAC-F561310 CM10 High Stringency-F431529 CM10 High Stringency-F5 1220 52 CM10 High Stringency-F6105324 1 Only conditions (total 32 conditions) with at least 2 differentially expressed peaks (p<=0.05) are listed High Laser Energy

The number of biomarkers identified in each condition, and the number significant (p<0.05) –Cont. Condition# of biomarkers # p<0.05Condition# of biomarkers #p<0.05 H50-F43519H50-F62999 CM10 High Stringency-F341720CM10 High Stringency-F44094 CM10 High Stringency-F580912CM10 High Stringency-F670640 CM10 Low Stringency-F33285CM10 Low Stringency-F43763 CM10 Low Stringency-F537012 IMAC-F3 3919 IMAC-F4 4187 IMAC-F5 44936 IMAC-F6 3414 Low Laser Energy

Using Low laser energy, there are 13 conditions (Fraction/Chip) that identified at least two differentially expressed peaks/biomarkers (p<=0.05) Using High laser energy, there are only 5 conditions (Fraction/Chip) that identified at least two differentially expressed peaks/biomarkers (p<=0.05) Comments

Performance of the kernel-based KNN predictors using selected biomarkers in each condition ConditionAccuracy (%)AUC (%)# of biomarkers 1 High Laser Energy--H50-F466.764.616 High_Laser_Energy_CM10 High Stringency-F668.3 70.63 Low_Laser_Energy-H50-F679.477.68 Low_Laser_Energy-H50-F461.9 58.115 Low_Laser_Energy _IMAC-F561.962.3 3 Low_Laser_Energy_CM10_Low_ Stringency-F360.3 67.92 Low_Laser_Energy_CM10_Low_ Stringency-F4 61.960.94 Low_Laser_Energy_CM10_Low_ Stringency-F5 60.359.615 Low_Laser_Energy_CM10_High_ Stringency-F3 69.876.113 Low_Laser_Energy_CM10_High_ Stringency-F6 61.961.15 1 The number of biomarkers selected in each of the 10 cross-validations Only conditions with larger than 60% accuracy have been listed

Biomarkers used in building prediction model for condition: H50, Low laser energy, and F6 m/zp-value (t-test)*Times** 500.46841340.00230163610 526.32216110.00535825810 500.93266750.01147514310 7784.4251030.01389705810 501.86182580.01822286510 502.79184910.021424768 527.51309040.0234769046 499.54054510.0260541998 526.79825020.0428687963 501.39714170.0550076191 791.51203890.0620073611 6915.0273240.1018609141 4276.0364330.102200981 15483.836270.3056293021 *9 of 299 peaks (after alignment) have p-values less than 0.05 ** The number of times the biomarkers was picked in 10 CV Three m/z value ranges seems to be interesting: 499-503 526-528 7784-7785

Conclusions  Based on our analysis, the best combination (laser energy, chip and fractions) appears to be  Low laser energy/H50/Fraction 6  We identified 9 significantly expressed biomarkers (p- value<=0.05), which are located in the 3 m/z value ranges:  499-503  526-528  7784-7785  Using 14 biomarkers identified from the combination, our predictor can reach ~80% accuracy.

Limitations  For all combinations of experimental protocol, we used the same m/z shift interval (m/z*(1- 0.2%,1+0.2%). A better choice may be obtained by estimating it for each combination from QC samples  We did not take the multiple testing issue into account in this analysis

Acknowledgements  We used following R packages to perform the analysis in this study  caMassClass (Jarek Tuszynski)  PROcess (Xiaochun Li)  kknn (Klaus Hechenbichler and Klaus Schliep) This research was supported by funding from Ontario Genomics Institute and Genome Canada, through the Centre for Applied Genomics.

Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics Pingzhao Hu W Le, S Lim, B Xing, CMT Greenwood and J Beyene Hospital for.

Similar presentations

Presentation on theme: "Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics Pingzhao Hu W Le, S Lim, B Xing, CMT Greenwood and J Beyene Hospital for."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics Pingzhao Hu W Le, S Lim, B Xing, CMT Greenwood and J Beyene Hospital for.

Similar presentations

Presentation on theme: "Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics Pingzhao Hu W Le, S Lim, B Xing, CMT Greenwood and J Beyene Hospital for."— Presentation transcript:

Similar presentations

About project

Feedback