Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics Pingzhao Hu W Le, S Lim, B Xing, CMT Greenwood and J Beyene Hospital for.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Clustering Categorical Data The Case of Quran Verses
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
November 12, 2013Computer Vision Lecture 12: Texture 1Signature Another popular method of representing shape is called the signature. In order to compute.
Understanding Statistics in Research Articles Elizabeth Crabtree, MPH, PhD (c) Director of Evidence-Based Practice, Quality Management Assistant Professor,
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Data Mining Classification: Alternative Techniques
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Lazy vs. Eager Learning Lazy vs. eager learning
Lecture 3 Nonparametric density estimation and classification
Chapter 4: Linear Models for Classification
Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Differentially expressed genes
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Theodore Alexandrov, Michael Becker, Sören Deininger, Günther Ernst, Liane Wehder, Markus Grasmair, Ferdinand von Eggeling, Herbert Thiele, and Peter Maass.
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
INSTANCE-BASE LEARNING
CS Instance Based Learning1 Instance Based Learning.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Whole Genome Expression Analysis
Classification (Supervised Clustering) Naomi Altman Nov '06.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
The Broad Institute of MIT and Harvard Classification / Prediction.
Line detection Assume there is a binary image, we use F(ά,X)=0 as the parametric equation of a curve with a vector of parameters ά=[α 1, …, α m ] and X=[x.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
EDGE DETECTION IN COMPUTER VISION SYSTEMS PRESENTATION BY : ATUL CHOPRA JUNE EE-6358 COMPUTER VISION UNIVERSITY OF TEXAS AT ARLINGTON.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Quantification of Membrane and Membrane- Bound Proteins in Normal and Malignant Breast Cancer Cells Isolated from the Same Patient with Primary Breast.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
PCB 3043L - General Ecology Data Analysis.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
IMPACT EVALUATION PBAF 526 Class 5, October 31, 2011.
Fewer permutations, more accurate P-values Theo A. Knijnenburg 1,*, Lodewyk F. A. Wessels 2, Marcel J. T. Reinders 3 and Ilya Shmulevich 1 1Institute for.
Descriptive Statistics The means for all but the C 3 features exhibit a significant difference between both classes. On the other hand, the variances for.
Results for all features Results for the reduced set of features
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Instance Based Learning
Fig. 1. proFIA approach for peak detection and quantification
Evaluating classifiers for disease gene discovery
K Nearest Neighbor Classification
Computer Vision Lecture 16: Texture II
Softberry Mass Spectra (SMS) processing tools
Model generalization Brief summary of methods
Hypothesis Testing: The Difference Between Two Population Means
Text Categorization Berlin Chen 2003 Reference:
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics Pingzhao Hu W Le, S Lim, B Xing, CMT Greenwood and J Beyene Hospital for Sick Children Research Institute and University of Toronto The Sixth International Conference for the Critical Assessment of Microarray Data Analysis (CAMDA 2006) Duke University Durham, NC, U.S.A June 8-9, 2006

Outline 1. Objectives 2. Data Set 3. Methods & Results 3.1 – Preprocessing (identification of biomarkers) 3.2 – Classification model 4. Conclusions

Objectives  Identify biomarkers for CFS/CFS-like diseases using SELDI-TOF MS technology  Evaluate performance of the identified biomarkers to distinguish patients with CFS/CFS-like from healthy people  Determine the best experimental protocol for large sample studies by choosing the best Fraction/Chip/Laser Energy combinations

Data Set GroupCFS/CFS-likeHealthy controlQuality Control (QC) Sample31329 Replicate22 Fractionf1, f2, f3, f4, f5, f6 ChipH50, IMAC30, High stringency CM10, Low stringency CM10 Same as CFS data Laser EnergyHigh, Low Total spectra  Each spectrum has ~30000 m/z for high energy and ~20000 m/z values for low energy (note: the number of m/z values in fractions f1 and f2 is larger).  Each combination (Fraction/Chip/Laser energy) includes 144 ( ( )*2 ) spectra  f1 and f2 have not been analyzed since they have different number of spectra and m/z values from other fractions  QC samples were not analyzed here

Data Analysis Pipeline  Preprocessing  Baseline subtraction (already done)  Trimming low m/z values  Normalization  Peak finding and alignment  Quantification of aligned peaks  Merging replicate samples  Classification  Do a 10-fold cross-validation (CV)  For each step of CV  Split samples of preprocessed data into training and test sets  Perform biomarker selection on the training set using t-tests  Built prediction model on the training set  kernel-based K-nearest neighbor (KNN) classifier  Evaluate performance using test set

Trimming low m/z values  Low laser energy allows peaks in the low mass range to be well- visualized  High laser energy improves visualization of peaks in the high mass range  Many studies (e.g. Baggerly et al. 2003) indicated that there is a noisy m/z region near the lower limit where the machine can not record stably.  For the above reasons, we trimmed low m/z values using the following thresholds: For low laser energy condition, we trimmed low m/z values less than 100 For high laser energy condition, we trimmed low m/z values less than 2000

 Given a spectrum with intensities X i (i=1,..,n) for all n m/z values, normalized intensities X i norm can be computed by X i norm = s*X i where s = (median of the total intensities among all spectra) (the total intensity of the current spectrum).  Multiplying raw intensities by the factor s equalizes the median (mean) of the total intensity among compared spectra Global Normalization (Li 2005)

Peak finding -- Why  The height of peak intensities at certain m/z values indicates the presence and the approximate amount of corresponding proteins or peptides in the sample  However, not all peaks at a m/z value are related to a protein or even a part of a protein  We need to search for those peaks that may represent a protein or a part of a protein

Peak finding-- Algorithm (Tuszynski 2006)

Peak Alignment -- Why  Assume two peaks: R1 at m/z value L1, and R2 at m/z value L2 are detected in two spectra, respectively.  It is known that the m/z value of the same peak in different spectra may have a small shift (0.1%-0.3%).  The shift must be adjusted so that peaks in the given shift interval (say, m/z *(1-0.2%,1+0.2%)) are aligned to have the same m/z value  The objective of alignment is to estimate common m/z value L3 of the peaks in the given shift interval across spectra R1 R2 L2 L1 L3

Alignment –Algorithm Maximal cliques & real representations (Li, 2005, Gentleman 2001 )  Find maximal cliques: {1,2}, {3,4,5,6}, {7,8}, {8,9}  Real representations: Find common m/z region for each maximal clique and estimate the aligned peak centers using maximum likelihood estimation (MLE) R1 R6 R8R5 R4 R2 R7 R3 R9 Not a maximum clique Aligned peak centers

Quantify aligned peaks for individual spectra  Each aligned peak location (m/z) can be treated as an interval, m/z * (1-0.2%,1+0.2%)  The intensity of each of the aligned peak location of individual spectra can be quantified by the maxima in the interval. Black: raw m/z values with intensities Red: Aligned peak m/z value. It has no associated peak in the raw data Blue: left and right intervals of the aligned peak The intensity is quantified as the intensity of the aligned peak location (red)

Merging replicate samples  After quantification of the aligned peaks in all individual spectra, we averaged the intensities of the two replicates for each samples.  The averaged intensities were used to build our prediction model.

Predictor --K-Nearest Neighbor (KNN) Method  To classify a new input vector (observation) v, examine the k-closest training data points to v and assign the object to the most frequently occurring class  Neighborhood is defined based on a mathematical distance measure  Deficiencies: The individual points in a neighborhood may have very different similarities to v (distances from v ), but they all have the same influence on the prediction x k=1 k=6

Predictor --Kernel-based KNN Method (Hechenbichler and Schliep 2004)  To classify a new observation v, examine the k+1 nearest neighbors to v according to Euclidean distance (d)  The (k+1) st neighbour is used for standardization of the k smallest distance by D(i)=D(v, v(i))= d(v, v(i))/ d(v, v(k+1)), i=1,…, k  Transform the normalized distance D(i) using a Gaussian kernel function K(.) into a weight w(i)=K(D(i))  Assign a prediction label to v based on where y can be either CFS/CFS-like (r=1) or NON-CFS (r=0) disease k is implicitly hidden in the weights - if k is too large, k is adjusted to a smaller value automatically, since only small number of neighbors with large weights dominate the other neighbors (very small weight-no influence on the prediction) We set k=3 in the study

Results  We first define some concepts used in the section Condition: Experimental protocol (Fraction/Chip/Laser energy) Biomarkers: Here we mean they are the aligned peaks Differentially expressed biomarkers: Aligned peaks that have p-values less than 0.05 selected by t-test.

The number of biomarkers identified in each condition, and the number significant (p<0.05) Condition 1 # of biomarkers identified # of differentially expressed biomarkers (p<0.05) H50-F IMAC-F CM10 High Stringency-F CM10 High Stringency-F CM10 High Stringency-F Only conditions (total 32 conditions) with at least 2 differentially expressed peaks (p<=0.05) are listed High Laser Energy

The number of biomarkers identified in each condition, and the number significant (p<0.05) –Cont. Condition# of biomarkers # p<0.05Condition# of biomarkers #p<0.05 H50-F43519H50-F62999 CM10 High Stringency-F341720CM10 High Stringency-F44094 CM10 High Stringency-F580912CM10 High Stringency-F CM10 Low Stringency-F33285CM10 Low Stringency-F43763 CM10 Low Stringency-F IMAC-F IMAC-F IMAC-F IMAC-F Low Laser Energy

Using Low laser energy, there are 13 conditions (Fraction/Chip) that identified at least two differentially expressed peaks/biomarkers (p<=0.05) Using High laser energy, there are only 5 conditions (Fraction/Chip) that identified at least two differentially expressed peaks/biomarkers (p<=0.05) Comments

Performance of the kernel-based KNN predictors using selected biomarkers in each condition ConditionAccuracy (%)AUC (%)# of biomarkers 1 High Laser Energy--H50-F High_Laser_Energy_CM10 High Stringency-F Low_Laser_Energy-H50-F Low_Laser_Energy-H50-F Low_Laser_Energy _IMAC-F Low_Laser_Energy_CM10_Low_ Stringency-F Low_Laser_Energy_CM10_Low_ Stringency-F Low_Laser_Energy_CM10_Low_ Stringency-F Low_Laser_Energy_CM10_High_ Stringency-F Low_Laser_Energy_CM10_High_ Stringency-F The number of biomarkers selected in each of the 10 cross-validations Only conditions with larger than 60% accuracy have been listed

Biomarkers used in building prediction model for condition: H50, Low laser energy, and F6 m/zp-value (t-test)*Times** *9 of 299 peaks (after alignment) have p-values less than 0.05 ** The number of times the biomarkers was picked in 10 CV Three m/z value ranges seems to be interesting:

Conclusions  Based on our analysis, the best combination (laser energy, chip and fractions) appears to be  Low laser energy/H50/Fraction 6  We identified 9 significantly expressed biomarkers (p- value<=0.05), which are located in the 3 m/z value ranges:     Using 14 biomarkers identified from the combination, our predictor can reach ~80% accuracy.

Limitations  For all combinations of experimental protocol, we used the same m/z shift interval (m/z*(1- 0.2%,1+0.2%). A better choice may be obtained by estimating it for each combination from QC samples  We did not take the multiple testing issue into account in this analysis

Acknowledgements  We used following R packages to perform the analysis in this study  caMassClass (Jarek Tuszynski)  PROcess (Xiaochun Li)  kknn (Klaus Hechenbichler and Klaus Schliep) This research was supported by funding from Ontario Genomics Institute and Genome Canada, through the Centre for Applied Genomics.