Download presentation
Presentation is loading. Please wait.
1
Classification of Microarray Data
2
Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical Analysis Fit to Model (time series) Expression Index Calculation Advanced Data Analysis ClusteringPCAClassification Promoter Analysis Meta analysisSurvival analysisRegulatory Network Normalization Image analysis The DNA Array Analysis Pipeline Comparable Gene Expression Data
3
Outline Example: Childhood leukemia Classification –Method selection –Feature selection –Cross-validation –Training and testing Example continued: Results from a leukemia study
4
Childhood Leukemia Cancer in the cells of the immune system Approx. 35 new cases in Denmark every year 50 years ago – all patients died Today – approx. 78% are cured Risk groups: –Standard, Intermediate, High, Very high, Extra high Treatment –Chemotherapy –Bone marrow transplantation –Radiation
5
Prognostic Factors Good prognosisPoor prognosis Immuno-phenotypeprecursor BT Age1-9≥10 Leukocyte countLow (<50*10 9 /L)High (>100*10 9 /L) Number of chromosomesHyperdiploidy (>50) Hypodiploidy (<46) Translocationst(12;21)t(9;22), t(1;19) Treatment responseGood response: Low MRD Poor response: High MRD
6
Prognostic factors: Immunophenotype Age Leukocyte count Number of chromosomes Translocations Treatment response Risk Classification Today Risk group: Standard Intermediate High Very high Extra High Patient: Clinical data Immuno- phenotype Morphology Genetic measurements Microarray technology
7
Study of Childhood Leukemia Diagnostic bone marrow samples from leukemia patients Platform: Affymetrix Focus Array –8793 human genes Immunophenotype –18 patients with precursor B immunophenotype –17 patients with T immunophenotype Outcome 5 years from diagnosis –11 patients with relapse –18 patients in complete remission
8
Reduction of data Dimension reduction –PCA Feature selection (gene selection) –Significant genes: t-test –Selection of a limited number of genes
9
Microarray Data ClassprecursorBT... Patient1Patient2Patient3Patient4... Gene11789.5963.92079.53243.9... Gene246.452.022.327.1... Gene3215.6276.4245.1199.1... Gene4176.9504.6420.5380.4 Gene54023.03768.64257.84451.8 Gene612.612.137.738.7... Gene8793312.5415.91045.41308.0
10
Outcome: PCA on all Genes
11
Outcome Prediction: CC against the Number of Genes
12
Assumptions: –Data e.g.. Gaussian distributed –Variance and covariance the same for all classes Calculation of class probability Linear discriminant analysis
13
Calculation of a centroid for each class Calculation of the distance between a test sample and each class centroid Class prediction by the nearest centroid method Nearest Centroid
14
Based on distance measure –For example Euclidian distance Parameter k = number of nearest neighbors –k=1 –k=3 –k=... –Always an odd number Prediction by majority vote K-Nearest Neighbor
15
Support Vector Machines Machine learning Relatively new and highly theoretic Works on non-linearly separable data Finding a hyperplane between the two classes by minimizing of the distance between the hyperplane and closest points
16
Neural Networks Gene1 Gene2 Gene3... B T
17
Comparison of Methods Linear discriminant analysis Nearest centroid KNN Neural networks Support vector machines Simple method Based on distance calculation Good for simple problems Good for few training samples Distribution of data assumed Advanced methods Involve machine learning Several adjustable parameters Many training samples required (e.g.. 50-100) Flexible methods
18
Cross-validation Given data with 100 samples Cross-5-validation: Training: 4/5 of data (80 samples) Testing: 1/5 of data (20 samples) - 5 different models and performance results Leave-one-out cross-validation (LOOCV) Training: 99/100 of data (99 samples) Testing: 1/100 of data (1 sample) - 100 different models
19
Validation Definition of –True and False positives –True and False negatives Actual classBBTT Predicted classBTTB TPFNTNFP
20
Sensitivity: the ability to detect "true positives" TP TP + FN Specificity: the ability to avoid "false positives" TN TN + FP Positive Predictive Value (PPV) TP TP + FP Sensitivity and Specificity
21
Accuracy Definition: TP + TN TP + TN + FP + FN Range: [0 … 1]
22
Definition:TP·TN - FN·FP √(TN+FN)(TN+FP)(TP+FN)(TP+FP) Range: [-1 … 1] Matthews correlation coefficient
23
Overview of Classification Expression data Subdivision of data for cross-validation into training sets and test sets Feature selection (t-test) Dimension reduction (PCA) Training of classifier: - using cross-validation - choice of method - choice of optimal parameters Testing of classifierIndependent test set
24
Important Points Avoid over fitting Validate performance –Test on an independent test set –Use cross-validation Include feature selection in cross-validation Why? –To avoid overestimation of performance! –To make a general classifier
25
Multiclass classification Same prediction methods –Multiclass prediction often implemented –Often bad performance Solution –Division in small binary (2-class) problems
26
Study of Childhood Leukemia: Results Classification of immunophenotype (precursorB and T) –100% accuracy During the training When testing on an independent test set –Simple classification methods applied K-nearest neighbor Nearest centroid Classification of outcome (relapse or remission) –78% accuracy (CC = 0.59) –Simple and advanced classification methods applied
27
Summary When do we use classification? Reduction of dimension often necessary –T-test, PCA Methods –K nearest neighbor –Nearest centroid –Support vector machines –Neural networks Validation –Cross-validation –Accuracy and correlation coefficient
28
Coffee break Next: Exercise
29
Prognostic factors: Immunophenotype Age Leukocyte count Number of chromosomes Translocations Treatment response Risk classification in the future ? Risk group: Standard Intermediate High Very high Ekstra high Patient: Clincal data Immunopheno- typing Morphology Genetic measurements Microarray technology Customized treatment
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.