Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.

Similar presentations


Presentation on theme: "Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical."— Presentation transcript:

1 Classification of Microarray Data

2 Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical Analysis Fit to Model (time series) Expression Index Calculation Advanced Data Analysis ClusteringPCAClassification Promoter Analysis Meta analysisSurvival analysisRegulatory Network Normalization Image analysis The DNA Array Analysis Pipeline Comparable Gene Expression Data

3 Outline Example: Childhood leukemia Classification –Method selection –Feature selection –Cross-validation –Training and testing Example continued: Results from a leukemia study

4 Childhood Leukemia Cancer in the cells of the immune system Approx. 35 new cases in Denmark every year 50 years ago – all patients died Today – approx. 78% are cured Risk groups: –Standard, Intermediate, High, Very high, Extra high Treatment –Chemotherapy –Bone marrow transplantation –Radiation

5 Prognostic Factors Good prognosisPoor prognosis Immuno-phenotypeprecursor BT Age1-9≥10 Leukocyte countLow (<50*10 9 /L)High (>100*10 9 /L) Number of chromosomesHyperdiploidy (>50) Hypodiploidy (<46) Translocationst(12;21)t(9;22), t(1;19) Treatment responseGood response: Low MRD Poor response: High MRD

6 Prognostic factors: Immunophenotype Age Leukocyte count Number of chromosomes Translocations Treatment response Risk Classification Today Risk group: Standard Intermediate High Very high Extra High Patient: Clinical data Immuno- phenotype Morphology Genetic measurements Microarray technology

7 Study of Childhood Leukemia Diagnostic bone marrow samples from leukemia patients Platform: Affymetrix Focus Array –8793 human genes Immunophenotype –18 patients with precursor B immunophenotype –17 patients with T immunophenotype Outcome 5 years from diagnosis –11 patients with relapse –18 patients in complete remission

8 Reduction of data Dimension reduction –PCA Feature selection (gene selection) –Significant genes: t-test –Selection of a limited number of genes

9 Microarray Data ClassprecursorBT... Patient1Patient2Patient3Patient4... Gene11789.5963.92079.53243.9... Gene246.452.022.327.1... Gene3215.6276.4245.1199.1... Gene4176.9504.6420.5380.4 Gene54023.03768.64257.84451.8 Gene612.612.137.738.7... Gene8793312.5415.91045.41308.0

10 Outcome: PCA on all Genes

11 Outcome Prediction: CC against the Number of Genes

12 Assumptions: –Data e.g.. Gaussian distributed –Variance and covariance the same for all classes Calculation of class probability Linear discriminant analysis

13 Calculation of a centroid for each class Calculation of the distance between a test sample and each class centroid Class prediction by the nearest centroid method Nearest Centroid

14 Based on distance measure –For example Euclidian distance Parameter k = number of nearest neighbors –k=1 –k=3 –k=... –Always an odd number Prediction by majority vote K-Nearest Neighbor

15 Support Vector Machines Machine learning Relatively new and highly theoretic Works on non-linearly separable data Finding a hyperplane between the two classes by minimizing of the distance between the hyperplane and closest points

16 Neural Networks Gene1 Gene2 Gene3... B T

17 Comparison of Methods Linear discriminant analysis Nearest centroid KNN Neural networks Support vector machines Simple method Based on distance calculation Good for simple problems Good for few training samples Distribution of data assumed Advanced methods Involve machine learning Several adjustable parameters Many training samples required (e.g.. 50-100) Flexible methods

18 Cross-validation Given data with 100 samples Cross-5-validation: Training: 4/5 of data (80 samples) Testing: 1/5 of data (20 samples) - 5 different models and performance results Leave-one-out cross-validation (LOOCV) Training: 99/100 of data (99 samples) Testing: 1/100 of data (1 sample) - 100 different models

19 Validation Definition of –True and False positives –True and False negatives Actual classBBTT Predicted classBTTB TPFNTNFP

20 Sensitivity: the ability to detect "true positives" TP TP + FN Specificity: the ability to avoid "false positives" TN TN + FP Positive Predictive Value (PPV) TP TP + FP Sensitivity and Specificity

21 Accuracy Definition: TP + TN TP + TN + FP + FN Range: [0 … 1]

22 Definition:TP·TN - FN·FP √(TN+FN)(TN+FP)(TP+FN)(TP+FP) Range: [-1 … 1] Matthews correlation coefficient

23 Overview of Classification Expression data Subdivision of data for cross-validation into training sets and test sets Feature selection (t-test) Dimension reduction (PCA) Training of classifier: - using cross-validation - choice of method - choice of optimal parameters Testing of classifierIndependent test set

24 Important Points Avoid over fitting Validate performance –Test on an independent test set –Use cross-validation Include feature selection in cross-validation Why? –To avoid overestimation of performance! –To make a general classifier

25 Multiclass classification Same prediction methods –Multiclass prediction often implemented –Often bad performance Solution –Division in small binary (2-class) problems

26 Study of Childhood Leukemia: Results Classification of immunophenotype (precursorB and T) –100% accuracy During the training When testing on an independent test set –Simple classification methods applied K-nearest neighbor Nearest centroid Classification of outcome (relapse or remission) –78% accuracy (CC = 0.59) –Simple and advanced classification methods applied

27 Summary When do we use classification? Reduction of dimension often necessary –T-test, PCA Methods –K nearest neighbor –Nearest centroid –Support vector machines –Neural networks Validation –Cross-validation –Accuracy and correlation coefficient

28 Coffee break Next: Exercise

29 Prognostic factors: Immunophenotype Age Leukocyte count Number of chromosomes Translocations Treatment response Risk classification in the future ? Risk group: Standard Intermediate High Very high Ekstra high Patient: Clincal data Immunopheno- typing Morphology Genetic measurements Microarray technology Customized treatment


Download ppt "Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical."

Similar presentations


Ads by Google