Modeling Gene Interactions in Disease CS 686 Bioinformatics
Some Definitions Data mining: extracting hidden patterns and useful info from large data sets. Ex- clustering, machine learning. Should not be: "Torturing data until it confesses... and if you torture it enough, it will confess to anything" - Jeff Jonas, IBM Machine learning: the ability of a program to learn from experience. Ex- neural networks, decision trees, rule-based methods, MDR.
Methods Regression methods: modeling the relationship between a dependent variable and one of more independent variables. Data mining methods: Search the space of possible models efficiently. Better with non-linear and high-dimensional data, or data with many potential interactions. Exhaustive Search: search all possible models for the best one.
Linear regression Relates outcome as a linear combination of the parameters (but not necessarily of the independent variables). Ex: Let y = incidence of disease, n data points. Independent variables A,B 1) y i = b 0 + b 1 A i + ε i, i = 1,…,n 2) y i = b 0 + b 2 (B i ) 2 + ε i, i = 1,…,n where b 0, b 1, b 2 = parameters, ε i is error term. In both of these examples, the disease is modeled as linear in the parameters, although it is quadratic in variable B
Linear regression Given a sample, we estimate the params (ex: can use least squares) to arrive at the linear regression model: [1]
Multiple regression Relates the the probability of an event to a linear combination of predictor variables. Ex: Let y = incidence of disease, n data points. Independent variables x 1, x 2 y i = b 0 + b 1 x i1 + b 2 x i2 + … + b p x i p + ε i, i = 1,…,n Best-fit line: For each unit increase in x i p, is expected to increase by.
Logistic regression[1] Often used when the outcome is binary, relates the log-odds of the probability of an event to a linear combination of predictor variables. Ex: ln(p/(1 – p)) = α + βxB + γxC + ixBxC, where xB and xC are measured binary indicator variables, and regression coefficients β and y represent main effects, i represents interaction.
Other statistical methods [1] Bayesian model selection: a statistical approach incorporating both prior distributions for parameters and observed data into the model. Maximum likelihood: a statistical method used to make inferences about the combination of parameter values resulting in the highest probability of obtaining the observed data
Modeling Terminology[1] Saturated: a statistical model that is as full as possible (saturated) with parameters. Marginal effects: the effects of one parameter averaged over the possible values taken by other parameters Entropy: the uncertainty associated with a random variable
Modeling Terminology[1] Cross-validation: partitioning a data set into n subsets, then using each subset in turn as the test set while using the other n-1 to train. Overfitting: a model that provides a good fit to a specific data set but generalizes poorly.
Marginal Effects [2] Marginal penetrance: Ex: The probability P(D|A=Aa), irrespective of what value B has Table II. Penetrance values for combinations of genotypes from two single nucleotide polymorphisms exhibiting interactions in the absence of independent main effects Genotype Genotype Marginal penetrance B AA (0.25) Aa (0.50) aa (0.25) BB (0.25) Bb (0.50) bb (0.25) Marginal penetrance A Genotype frequencies are given in parentheses Marginal penetrance values for the A, B genotypes.
Weka [3] A collection of visualization tools and algorithms for data analysis and predictive modeling. Preprocessing tools for reading data in a variety of formats and transforming it. Classification algorithms include regression, neural network, support vector machine, decision tree. Display includes ROC curves Clustering: k-means, expectation maximization Visualization includes scatter-plot, bar graph
References Cordell, 2009, Detecting gene–gene interactions that underlie human diseases. Nature Review Genetics McKinney et al, 2006, Machine Learning for Detecting Gene-Gene Interactions, A Review. Biomedical Genomics and Proteomics Weka site: