Download presentation
Presentation is loading. Please wait.
Published byHarry Phillip Banks Modified over 9 years ago
1
Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007
2
The Inspiration For a Method
3
The Nature of Complex Diseases Most common diseases are complex Caused by multiple genes Often interacting with one another This interaction is termed Epistasis
4
Epistasis When an allele at one locus masks the effect of an allele at another locus
5
The Failure of Traditional Methods Traditional gene hunting methods successful for rare Mendelian (single gene) diseases Unsuccessful for complex diseases: Since many genes interact to cause the disease, the effect of any single gene is too small to detect They do not take this interaction into account
6
MDR: The Algorithm
7
Multifactor Dimensionality Reduction A data mining approach to identify interactions among discrete variables that influence a binary outcome A nonparametric alternative to traditional statistical methods such as logistic regression Driven by the need to improve the power to detect gene-gene interactions
8
Multifactor Dimensionality Reduction
9
MDR Step 0 Divide data (genotypes, discrete environmental factors, and affectation status) into 10 distinct subsets
10
Multifactor Dimensionality Reduction
11
MDR Step 1 Select a set of n genetic or environmental factors (which are suspected of epistasis together) from the set of all variables in the training set
12
Multifactor Dimensionality Reduction
13
MDR Step 2 Create a contingency table for these multilocus genotypes, counting the number of affected and unaffected individuals with each multilocus genotype
14
Multifactor Dimensionality Reduction
15
MDR Step 3 Calculate the ratio of cases to controls for each multilocus genotype
16
Multifactor Dimensionality Reduction
17
MDR Step 4 Label each multilocus genotype as “high- risk” or “low-risk”, depending on whether the case-control ratio is above a certain threshold ****This is the dimensionality reduction step Reduces n-dimensional space to 1 dimension with 2 levels
18
Multifactor Dimensionality Reduction
19
MDR Step 5 Use labels to classify individuals as cases or controls, and calculate the misclassification rate
20
Multifactor Dimensionality Reduction
21
Repeat steps 1-5 for: All possible combinations of n factors All possible values of n Across all 10 training and testing sets
22
The Best Model Minimizes prediction error: the average misclassification rate across all the 10 cross-validation subsets Maximizes cross-validation consistency: the number of times a particular model was the best model across cross-validation subsets
23
Hypothesis test of best model: Evaluate magnitude of cross-validation consistency and prediction error estimates by permutation testing: Randomize disease labels Repeat MDR analysis several times to get distribution of cross-validation consistencies and prediction errors Use distributions to determine p-values for your actual cross-validation consistencies and prediction errors
24
Permutation Testing: An illustration Sample Quantiles: 0%0.045754 25%0.168814 50%0.237763 75%0.321027 90%0.423336 95%0.489813 99%0.623899 99.99%0.872345 100%1 0.4500 The probability that we would see results as, or more, extreme than 0.4500, simply by chance, is between 5% and 10%
25
Strengths Facilitates simultaneous detection and characterization of multiple genetic loci associated with a discrete clinical endpoint by reducing the dimensionality of the multilocus data Non-parametric – no values are estimated Assumes no particular genetic model False-positive rate is minimized due to multiple testing
26
Weaknesses Computationally intensive (especially with >10 loci) The curse of dimensionality: decreased predictive ability with high dimensionality and small sample due to cells with no data
27
MDR Software
28
The Authors Multifactor dimensionality reduction software for detecting gene-gene and gene- environment interactions. Hahn, Ritchie, Moore, 2003. www.sourceforge.net
30
Values Calculated by MDR MeasureFormula/Interpretation Balanced Accuracy(Sensitivity+Specificity)/2; fitness measure Accuracy is skewed in favor of the larger class, whereas balanced accuracy gives equal weight to each class Accuracy(TP+TN)/(TP+TN+FP+FN) Proportion of instances correctly classified SensitivityTP/(TP+FN); proportion of actual positives correctly classified SpecificityTN/(TN+FP); proportion of actual negatives correctly classified Odds Ratio(TP*TN)/(FP*FN); compares whether the probability of a certain event is the same for two groups X2X2 Chi-squared score for the attribute constructed by MDR from this attribute combination PrecisionTP/(TP+FP); the proportion of relevant cases returned Kappa2(TP*TN+FP*FN)/[(TP+FN)(FN+TN)+(TP+FP)*(FP+TN)] A function of total accuracy and random accuracy F-Measure2*TP/(2*TP+FP+FN); a function of sensitivity and precision
31
Sign Test n = number of cross-validation intervals C = number of cross-validation intervals with testing accuracy ≥ 0.5 The probability of observing c or more cross- validation intervals with testing accuracy ≥ 0.5 if each case were actually classified randomly
32
The Problem of Alcoholism A Case Study
33
Genes Associated With Alcoholism ADH enzymes ALDH2 enzyme Alcohol Acetaldehyd e Acetate ADH (alcohol dehydrogenase) and ALDH2 (acetaldehyde dehydrogenase 2) genes are associated with alcoholism involved in alcohol metabolism
34
ADH Genes ADH7ADH6ADH4ADH5ADH1BADH1AADH1C 5’ 3’ Class IV Class I Class VClass IIClass III 370 kb Chromosome 4
35
Taste Receptors and Aversion to Alcohol PTC TAS2R38 TastersNon-Tasters Alcohol Tastes BitterAlcohol Tastes Sweet Drink Less AlcoholDrink More Alcohol a person must be willing to drink in order to be an alcoholic TAS2R38 affects the amount of alcohol a person is willing to drink therefore, it is related to alcoholism, although no direct association has been found we hope to provide a direct link between TAS2R38 and alcoholism, by demonstrating that it acts epistatically with other genes associated with alcoholism
36
Actual Analysis
37
Data A sample of cases and controls (alcoholics and non-alcoholics) from three East Asian populations: the Ami, Atayal, and Taiwanese Genotyped for 98 markers within several genes: ALDH2, all ADH genes, and 2 taste receptor genes, TAS2R16 and TAS2R38 (PTC)
38
Computational Limitations 1. The software package has a problem reading missing data I was forced to use only complete records, dwindling my (already small) sample to 79 complete records
39
Computational Limitations 2. The computation time is way too long for higher order models, especially for high numbers of attributes I was advised to restrict my attributes to markers within ADHIC, and the 2 taste receptor genes, which left me with 36 attributes I considered models only up to order 4
40
Summary of Results: All Populations OrderModelTraining Bal. Acc.Testing Bal. Acc.Sign Test (p)CV Consistency 1X.04..ADH1C.dwstrm.Te0.60490.42780 (1.0000)5/10 2 X.07..TAS2R16.C_11431 X.04..ADH1C.dwstrm.Te 0.70760.44383 (0.9453)6/10 3 X.07..TAS2R16.C_11431 X.04..ADH1C.dwstrm.Te X.04..ADH1C.rs3762896 0.7850.31861 (0.9990)4/10 4 X.07..TAS2R16.C_11431 X.07..PTC.C_8876291_1 X.07..PTC.C_8876482_1 X.04..ADH1C.dwstrm.Te 0.84530.35642 (0.9893)6/10 Instances: 79Attributes: 36Ratio: 1.3235
41
Summary of Results: Ami OrderModelTraining Bal. Acc.Testing Bal. Acc.Sign Test (p)CV Consistency 1X.07..TAS2R16.C_114310.73310.45985 (0.6230)5/10 2 X.07..TAS2R16.C_11431 X.04..ADH1C.C_2688508 0.82840.34762 (0.9893)3/10 3 X.07..TAS2R16.C_11431 X.07..PTC.C_8876467_1 X.04..ADH1C.C_2688508 0.96880.954510 (0.0010)10/10 4 X.07..TAS2R16.C_11431 X.07..TAS2R16.C_11431.1 X.07..PTC.C_8876467_1 X.04..ADH1C.C_2688508 0.97220.87128 (0.0547)9/10 Instances: 30Attributes: 36Ratio: 0.8750
42
Cross Validation Statistics Set MeasureTrainingTesting Balanced Accuracy0.96880.9545 Accuracy0.96670.95 Sensitivity11 Specificity0.93750.9091 Odds Ratio∞∞ χ223.6250 (p < 0.0001)1.6364 (p = 0.2008) Precision0.93330.9 Kappa0.93330.9 F-Measure0.96550.9474 Sign Test: 10 (p = 0.0010) Cross-validation Consistency: 10/10
43
Whole Dataset Statistics: Training Balanced Accuracy: 0.9688 Training Accuracy: 0.9667 Training Sensitivity: 1.0000 Training Specificity: 0.9375 Training Odds Ratio: ∞ Training Χ²: 26.2500 (p < 0.0001) Training Precision: 0.9333 Training Kappa: 0.9333 Training F-Measure: 0.9655
44
Graphical Model
45
Classification Rules X.07..TAS2R16.C_11431X.07..PTC.C_8876467_1X.04..ADH1C.C_2688508Class IF A\A AND C\G AND C\C THEN 0 A\AC\GC\T1 A\AC\GT\T0 A\AG\GC\C0 A\AG\GC\T0 A\AG\GT\T1 A\GC\CC\T1 A\GC\GC\C0 A\GC\GC\T0 A\GC\GT\T0 A\GG\GC\C1 A\GG\GC\T1 A\GG\GT\T0 G\GC\GC\T1 G\G C\C0 G\G C\T1 G\G T\T1
46
Locus Dendrogram
47
Future Work Simulations to calculate the power of MDR, especially in relation to sample size Comparison of MDR with logistic regression, and other proposed methods to detect epistasis, with respect to the current data set and simulated data Research how different methods to search the sample space can be incorporated into MDR implementation to improve computational feasibility
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.