Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

Similar presentations


Presentation on theme: "A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research."— Presentation transcript:

1 A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research Day Friday, May 2 nd, 2014

2 Introduction ›Genomic Analysis for the selection of genes associated with Parkinson’s Disease (PD) ›Adoption of Multivariate Techniques ›Comparison between several classification algorithms

3 The Data ›Microarray expression data from Affymetrix ›Expression Dataset from GenBank (Geo accession GSE6613) generated using MAS5 algorithm ›105 samples, 22,283 measurements of gene expression from three groups: –Parkinson’s disease group (50 patients) –Healthy control group. (22) –Neurodegenerative control group. (33) Gene\SampleS i – PDS j – CNTLS z – N-GNTL Gene 1 P 1,I P 1,j P 1,z Gene 2 P 2,I P 2,j P 2,z … Gene n P n,i P n,j P n,z

4 Data Preparation ›Filtering: removed noise in probesets (measurement) using “Filtering by Present calls” with threshold of 25%: only maintain genes expressed in 25% of the sample. ›After the filtering the number of probeset dropped from 23,283 to 8,100.

5 Normality Assumption & Normalization ›The data showed a strong right skewness ›We applied logarithmic-scale transformation ›Normalized the data using z- score for outlier detection [z score > 5] and algorithmic optimization

6 Univariate Analysis ›Identify which (single) gene is associated with PD ›Correspond in running 8,100 hypothesis tests: –H 0 :  A =  B with the alternative H 1 :  A >  B –For this test we use the t-statistic t= with critical region t≥z  ›Since we have 3 classes a gene is selected if: –Are up-regulated in PD (Parkinson Disease) when compared with the other classes –Are up-regulated in the other classes but down-regulated in PD. ›The result of this analysis does not indicate which class contains the up-regulated gene(s), so we need to check.

7 Upregulated Features We identified 60 genes out of 24,000!

8 Problems of Univariate Analysis in Genomics ›In array-based differential expression analysis the problem is to generate a list of genes that are differentially expressed, as meaningful and complete as possible. ›Let’s have 1,000 genes. We test each with a t-test with a significance level of 0.05: we might expect 40 genes to be differentially expressed. Of the remaining 960 non-differentially expressed genes we can expect 5% errors, or.05 x 960 = 48 false positives ›There are more false positives than truly differentially expressed genes: this is called multiple hypothesis testing problem

9 Univariate Vs Multivariate ›In Univariate analysis we are considering the effect of each gene, individually, against the target (PD) ›The effect of a disease is rarely the result of a single gene. ›Even if good univariate leads are found (the 60 genes) this rarely turns into the identification of useful pathways. ›We don’t have information on any group of genes that, together, might be involved in the development of PD ›Multivariate approaches tests for group of variables that, simultaneously, explain the particular output. ›Multivariate theory is much more complex.

10 Multivariate Mining on Genomics We are trying to identify a subset of genes (as small as possible) used as a classification model that will differentiate classes in the original data set. ›Wrapper Subset Evaluator (WSE): implementation of forward wrapper method for feature selection for the creation of an optimal subset. ›Correlation-based Feature Selection (CFS): these algorithms evaluate different combinations of features to identify an optimal subset. The feature subsets to be evaluated are generated using different search techniques. We used Best First and Greedy search methods with a forward direction. ›R-Support Vector Machine (RSVM): a non-probabilistic binary linear classifier in its recursive version. No matter which algorithm you select it must use multivariate hypothesis testing

11 Multivariate Analysis Evaluating several Classification Models Algorithms# of features SelectedSelected Subset WSE6 200639_s_at 202690_s_at 203303_at 207730_x_at 211275_s_at 217301_x_at SVM20 202581_at 214800_x_at 208843_s_at 220897_at 207205_at 212994_at 219055_at 220471_s_at 212176_at 204031_s_at 201186_at 219156_at 219186_at 217142_at 206342_x_at 33814_at 213891_s_at 213340_s_at 217552_x_at 211989_at CFS39 200994_at 201935_s_at 202169_s_at 202213_s_at 202258_s_at 202347_s_at 202690_s_at 202727_s_at 202778_s_at 203104_at 203116_s_at 203153_at 203273_s_at 203303_at 203992_s_at 204255_s_at 207205_at 207416_s_at 208666_s_at 209048_s_at 209303_at 210647_x_at 210858_x_at 211406_at 213596_at 214800_x_at 215158_s_at 216341_s_at 216524_x_at 216600_x_at 217301_x_at 217819_at 217922_at 218236_s_at 218680_x_at 219055_at 220529_at 221192_x_at AFFX-HSAC07/X00351_M_at

12 Multivariate Analysis Evaluating several Classification Models ›We used 10 folds cross-validation method during the feature selection process. ›In K-Fold Cross-validation the original data set is split into k equal size sub-partitions. Out of the k sets, one is retained as a validation set for testing the model, and the remaining k-1 used in training the data. The cross-validation is repeated k times, and the results averaged.

13 Multivariate Analysis Results - WSE AlgorithmClassification Measurements WSE === Stratified cross-validation ====== Summary === Correctly Classified Instances 69 65.7143 % Incorrectly Classified Instances 36 34.2857 % Kappa statistic 0.4011 Mean absolute error 0.3245 Root mean squared error 0.4182 Relative absolute error 77.0204 % Root relative squared error 91.1559 % Total Number of Instances 105 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.182 0 1 0.182 0.308 HC 0.515 0.111 0.68 0.515 0.586 ND 0.96 0.509 0.632 0.96 0.762 PD === Confusion Matrix === a b c <-- classified as 4 6 12 | a = HC 0 17 16 | b = ND 0 2 48 | c = PD Kappa Statistics is a rate of agreement between tests.

14 Multivariate Analysis Results – CFS AlgorithmClassification Measurements CFS === Stratified cross-validation ====== Summary === Correctly Classified Instances 59 73.75 % Incorrectly Classified Instances 21 26.25 % Kappa statistic 0.5769 Mean absolute error 0.175 Root mean squared error 0.4183 Relative absolute error 41.4038 % Root relative squared error 91.0133 % Total Number of Instances 80 (why 80 and not 105?) === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.529 0.063 0.692 0.529 0.6 HC 0.76 0.127 0.731 0.76 0.745 ND 0.816 0.238 0.756 0.816 0.785 PD === Confusion Matrix === a b c <-- classified as 9 3 5 | a = HC 1 19 5 | b = ND 3 4 31 | c = PD

15 Multivariate Analysis Results – CFS AlgorithmClassification Measurements R-SVM === Stratified cross-validation ====== Summary === Correctly Classified Instances 87 82.8571 % Incorrectly Classified Instances 18 17.1429 % Kappa statistic 0.7228 Mean absolute error 0.2667 Root mean squared error 0.3432 Relative absolute error 63.2941 % Root relative squared error 74.8253 % Total Number of Instances 105 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.727 0.036 0.842 0.727 0.78 HC 0.788 0.056 0.867 0.788 0.825 ND 0.9 0.2 0.804 0.9 0.849 PD === Confusion Matrix === a b c <-- classified as 16 0 6 | a = HC 2 26 5 | b = ND 1 4 45 | c = PD This looks a good starting point. A further investigation is warrant to understand the relationships between the selected 20 genes.

16 Conclusions ›Multivariate models are a necessary tools in genomic studies. ›Among the algorithms tested in this study, RSVM clearly came out as an effective model to adopt in biomarker discovery, with the important ability of successfully discriminate between PD and other neurodegenerative diseases. ›This research cannot stop here, and the natural next step is to look for the biological interpretation of this result.

17 Thank you


Download ppt "A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research."

Similar presentations


Ads by Google