JM - 1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC

JM - http://folding.chmcc.org2 Outline Motivating story: correlating inputs and outputs Learning with a teacher (supervised learning) Model selection, feature selection and generalization k-Nearest Neighbors, Least Squares regression, Support Vector Machines and some other machine learning approaches Genotype-phenotype correlations and predictive fingerprints of phenotypes Ritchie et al., Multifactor-Dimensionality Reduction Reveals High- Order Interactions among Estrogen-Metabolism Genes in Sporadic Breast Cancer, Am. J. Hum. Genet., 69:138-147, 2001 Early results for JRA SNP data (D. Glass et al.)

JM - http://folding.chmcc.org3 Of statistics and machine learning t-Test vs. regression or decision trees Assessment vs. predictive models Treatment group mean Control group mean Continuous variables Discrete (categorical) variables 1 0 1 01 0 1 2

JM - http://folding.chmcc.org4 Choice of the model, problem representation and feature selection: another simple example heights estrogen F M adultschildren weight testosterone

JM - http://folding.chmcc.org5 Three phases in supervised learning protocols  Training data: examples with class assignment are given  Learning: i) appropriate model (or representation) of the problem needs to be selected in terms of attributes, distance measure and classifier type; ii) adaptive parameters in the model need to optimized to provide correct classification of training examples (e.g. minimizing the number of misclassified training vectors)  Validation: cross-validation, independent control sets and other measure of “real” accuracy and generalization should be used to assess the success of the model and the training phase (finding trade off between accuracy and generalization is not trivial)

JM - http://folding.chmcc.org6 Model complexity, training set size and generalization

JM - http://folding.chmcc.org7 Examples of machine learning algorithms for classification and regression problems Linear perceptron, Least Squares LDA/FDA (Linear/Fisher Discriminate Analysis) (simple linear cuts, kernel non-linear generalizations) SVM (Support Vector Machines) (optimal, wide margin linear cuts, kernel non-linear generalizations) Decision trees (logical rules) k-NN (k-Nearest Neighbors) (simple non-parametric) Neural networks (general non-linear models, adaptivity, “artificial brain”)

JM - http://folding.chmcc.org8 Decision trees provide a piecewise linear solution 01 1 0

JM - http://folding.chmcc.org9 Support Vector Machines provide a wide margin solution (separating hyperplane) wx+b=0

JM - http://folding.chmcc.org10 Optimizing adaptable parameters in the model Find a model y(x;w) that describes the objects of each class as a function of the features and adaptive parameters (weights) w. Prediction, given x (e.g. LDL=240, age=52, sex=male) assign the class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a stroke or heart attack in the next 5 years) y(x;w)y(x;w) 0 1

JM - http://folding.chmcc.org11 Training accuracy vs. generalization

JM - http://folding.chmcc.org12 Case Study: Sporadic Breast Cancer Ritchie et al., Multifactor-Dimensionality Reduction Reveals High- Order Interactions among Estrogen-Metabolism Genes in Sporadic Breast Cancer, Am. J. Hum. Genet., 69:138-147, 2001 Study based on 200 white women with sporadic primary invasive breast cancer who were treated at Vanderbilt University Medical Center during 1982-96 Patients with sporadic breast cancer were frequency age- matched to control patients at Vanderbilt University Medical Center who had been hospitalized for various acute and chronic illnesses Analysis focused on the genes: COMT (MIM 116790), 22q11.2; CYP1A1 (MIM 108330), 15q22-qter; CYP1B1 (MIM 601771), 2p21-22; GSTM1 (MIM 138350), 1p13.3; and GSTT1 (MIM 600436), 22q11.2116790108330601771138350 600436 Case-control study (machine learning to the rescue)

JM - http://folding.chmcc.org13 Polymorphisms in the genes of interest Genes involved in oxidative metabolism of estrogens

JM - http://folding.chmcc.org14 Genotype representation and identification of predictive loci (fingerprints): MDR

JM - http://folding.chmcc.org15 Main effects (individual SNPs and chi 2 test) For the simulated data shown before: High Risk Low Risk total AA2724 51 Aa3638 74 aa2124 45 total8486 170  (O-E) 2 / E

JM - http://folding.chmcc.org16 Genotype/haplotype representations AABBCC AaBBCC AABbCC aaBBCC aabbcc AAbbCC In general, 3 n genotypes for n biallelic loci. x y z 0, 1 ; x, y, z = 0, 1, 2 Vector representation: In general, highly dimensional representations …

JM - http://folding.chmcc.org17 Multiple loci and more complex fingerprints

JM - http://folding.chmcc.org18 Cross-validation results

JM - http://folding.chmcc.org19 The role of gene-gene interactions in multifactorial disease: towards even more complex traits … CYP1A1, GSTM1, and GSTT1 polymorphisms were examined before in a case-control study of 328 white and 108 African American women, using multiple logistic-regression analysis (Bailey et al. 1998b). None of the enzyme genotypes individually or combined were associated with an increased risk for breast cancer. However, COMT and CYP1B1 were not included in the analysis, because their roles in the catechol-estrogen pathway and/or their various polymorphisms were only recently elucidated.1998b Here, the influence of each genotype on disease risk appears to be dependent on the genotypes at each of the other loci: gene- gene interactions.

JM - http://folding.chmcc.org20 Complexity of the model and power calculations: as before adopted from Ritchie et al. In logistic regression, as each additional main effect is included in the model, the number of possible interaction terms grows exponentially. On the other hand, simulation studies by Peduzzi et al. (1996) suggest that having fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients.1996 Hosmer and Lemeshow (2000) suggest that logistic-regression models should contain no more than P < min(n1,n0)/10 parameters, where n1 is the number of events of type 1 and n0 is the number of events of type 0.2000 For the 200 cases and the 200 controls evaluated in the present study, this formula suggests that no more than 19 parameters should be estimated in a logistic-regression model.

JM - http://folding.chmcc.org21 Complexity of the model and power calculations: as before adopted from Ritchie et al. The number of regression terms needed to describe the interactions among a subset, k, of n biallelic loci is (n choose k) × 2 k (Wade 2000).2000 Thus, for 10 genes, we would need 20 parameters to model the main effects (assuming two dummy variables per biallelic locus), 180 parameters to model the two-way interactions, 1,920 parameters to model the three-way interactions, 3,360 parameters to model the four-way interactions, and so forth. The MDR method avoids the problems associated with the use of parametric statistics to model high-order interactions. At the same time, MDR involves sampling (evaluation) of different combinations of loci – exponential scaling anyway …

JM - http://folding.chmcc.org22 Some conclusions from Ritchie et al. “If MDR is going to be used for genome scans with hundreds to thousands of single-nucleotide polymorphisms, then it will be necessary to develop machine learning strategies to optimize the selection of polymorphisms to be modeled, since an exhaustive search of all possible combinations will not be possible. We are currently exploring the use of parallel genetic algorithms (Cantú-Paz 2000) as a robust machine learning approach.”2000 Feature selection and aggregation, inferring a classifier (approximator), validating prediction using cross-validation and independent new data, i.e., applying machine learning approaches …

JM - http://folding.chmcc.org23 Reducing (somewhat) the complexity of the problem: LD, hyplotype blocks and tagging SNPs

JM - http://folding.chmcc.org24 Reducing (somewhat) the complexity of the problem: LD, hyplotype blocks and tagging SNPs Muse and Gibson, 2004

JM - http://folding.chmcc.org25 Merging bottom-up and top-down approaches Main effects and interactions (for limited k-tuples): “statistics-based” approach, collaboration with Jack Collins and his group (NCI) Selection of loci/SNPs (feature selection) based on the initial (limited) statistical analysis: use haplotype-based Tag SNPs Combining promising features into a complex pattern (predictive fingerprint): machine learning

JM - http://folding.chmcc.org26 Some early results for JRA (joint work with the Rheumatology and Human Genetics Divisions) 771 SNPs from chromosome 2 and 765 from chromosome 7, respectively (regions around implied before loci with high LOD scores for associations with JRA subtypes) Haplotype blocks identified and representative SNPs derived Feature selection based on chi 2 -statistics and other measures Training and assessment using cross-validation on a set of about 200 data points (in several classes), case-control type of study, multiple machine learning applied No significant correlation of individual SNPs with clinical classes observed Top 20 SNPs, when combined into a classifier, yield classification accuracy of about 70% for the problem of distinguishing between joint erosion and lack of thereof (for affected individuals, baseline 62%) Much less success for the classification into JRA subtypes, i.e., it appears that SNPs included in the study cannot be used to predict if a person is likely to have a specific (clinically defined) disease subtype (e.g., poly vs. pauci)

JM - http://folding.chmcc.org27 Hyplotype-based tag SNPs on chr2 vs. joint erosion …

JM - http://folding.chmcc.org28 Next steps … Use larger data sets with careful selection of informative SNPs using prior knowledge and feature selection algorithms Use expression profiling to define “molecular” phenotypes to define classes and find predictive patterns in SNPs Validate, validate, validate …

JM - 1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

Similar presentations

Presentation on theme: "JM - 1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

JM - 1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

Similar presentations

Presentation on theme: "JM - 1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,"— Presentation transcript:

Similar presentations

About project

Feedback