Download presentation
Presentation is loading. Please wait.
1
Multiclass classification of microarray data with repeated measurements: application to cancer Ka Yee Yeung & Roger E Bumgarner Genome Biology 2003, 4:R83
2
Sample Classification Use gene expression measurements from microarray experiments to classify biological sample (e.g. types of tumors).Use gene expression measurements from microarray experiments to classify biological sample (e.g. types of tumors). GoalsGoals Utilize Repeated MeasurementsUtilize Repeated Measurements Multiclass classificationMulticlass classification Remove redundancyRemove redundancy No assumption of distributionNo assumption of distribution
3
Shrunken Centroid Classification Feature selectionFeature selection Consider features individuallyConsider features individually Calculate overall centroid and each class centroidCalculate overall centroid and each class centroid “Shrink” class centroids by factor Δ“Shrink” class centroids by factor Δ Compare shrunken class centroids to overall centroidCompare shrunken class centroids to overall centroid If significantly different, feature is predictive for the classIf significantly different, feature is predictive for the class Estimate optimum Δ using 10-fold cross validationEstimate optimum Δ using 10-fold cross validation ClassificationClassification Calculate standardized, squared difference of sample to each shrunken class centroid for selected featuresCalculate standardized, squared difference of sample to each shrunken class centroid for selected features Assign to class with nearest centroidAssign to class with nearest centroid
4
Redundancy & Error Estimation Uncorrelated Shrunken Centroid (USC)Uncorrelated Shrunken Centroid (USC) Removes redundant genesRemoves redundant genes For each set of relevant genesFor each set of relevant genes Compute pairwise correlationsCompute pairwise correlations Remove least relevant gene from pairs with correlation above given thresholdRemove least relevant gene from pairs with correlation above given threshold Use cross-validation to determine best pair (shrinkage factor, correlation threshold)Use cross-validation to determine best pair (shrinkage factor, correlation threshold) Error Weighted Uncorrelated SC (EWUSC)Error Weighted Uncorrelated SC (EWUSC) The standard deviation of the sample mean is used to down weight the most variable genes and experimentsThe standard deviation of the sample mean is used to down weight the most variable genes and experiments
5
Experiments DatasetsDatasets Synthetic datasets, varying:Synthetic datasets, varying: Biological noise levelBiological noise level Technical noise levelTechnical noise level Number of repeated measurementsNumber of repeated measurements Percent of relevant genesPercent of relevant genes Real DatasetsReal Datasets Multiple tumor dataset – 7,129 genes, 123 samples, 11 classes (types of tumors)Multiple tumor dataset – 7,129 genes, 123 samples, 11 classes (types of tumors) Breast cancer dataset – 25,000 genes, 97 samples, 2 classes (good or poor prognosis)Breast cancer dataset – 25,000 genes, 97 samples, 2 classes (good or poor prognosis) Evaluation CriteriaEvaluation Criteria Prediction AccuracyPrediction Accuracy Number of relevant features selectedNumber of relevant features selected Feature stabilityFeature stability
6
Synthetic data results Removing redundant genes (USC)Removing redundant genes (USC) = Similar accuracy + Using same or fewer genes Error weighting results on synthetic datasetsError weighting results on synthetic datasets Two types of error definedTwo types of error defined Technical noise – variation over repeated measurements (λ)Technical noise – variation over repeated measurements (λ) Low (1) or High (5, 10)Low (1) or High (5, 10) + Handled “technical noise” well (similar accuracy similar, fewer genes) Biological noise – signal to noise ratio (α)Biological noise – signal to noise ratio (α) 20 to 1, 2 to 1, or 1 to 120 to 1, 2 to 1, or 1 to 1 Accuracy was worse with increased “biological noise”, despite increasing number repeated measurementsAccuracy was worse with increased “biological noise”, despite increasing number repeated measurements CriticismCriticism Noise same over entire dataset, should vary for different genesNoise same over entire dataset, should vary for different genes Each dataset would have some high signal to noise genesEach dataset would have some high signal to noise genes
7
Real Data Results Removing redundant genes (USC)Removing redundant genes (USC) = Similar, but varying accuracy + Using many fewer genes Error weighting – Real DatasetsError weighting – Real Datasets Multiple tumor dataMultiple tumor data + Improved accuracy + Improved feature stability = Using similar number of genes Breast cancer dataBreast cancer data + Improved accuracy = Similar feature stability – Using increased number of genes
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.