Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics SNP Data Analysis Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics
Presentation Outline Part I. - Preliminary Analysis Introduction Methods Results Discussion Part II. - Ongoing Research Purpose Background Software
Part I. Preliminary Analysis - Introduction – Methods – Results – Discussion -
Part I. Introduction Purpose of the preliminary analysis: To develop 2 initial prediction models for the leukemia SNP data… - Logistic regression - Naïve Bayes Classifier To determine which of the two models is the better prediction model Overall goal: to improve SNP data analysis
Part I. Introduction Data SNP data consisting of 220 binary variables Output variable: CACO Leukemia present/absence Input variables: Sex (F/M) 218 SNPs (dominant/not dominant) Observations: 485 individuals Data is complete; no missing values
Part I. Methods Models using R Logistic Regression assumption: observations are independent only simple logistic regression considered (no interaction among input variables) Naïve Bayes Classifier assumption: input variables are independent given the outcome
Part I. Methods Variable Selection Goodness of Fit Measures Selection based on log likelihood score 10 input variables per model Goodness of Fit Measures 4-fold cross-validation Area under ROC
Part I. Results Variables Selected Same 10 SNPs were identified as input variables for both logistic regression and Naïve Bayes: 1. TFRC_rs406721 6. TFRC_rs3326 2. TGFB1_rs1982072 7. HFE 3. HFE_rs807212 8. RXRB_rs421446 4. HLA_DRB1_DQA1_rs2395225 9. ACP1_rs11553746 5. LTF_rs1042073 10. DQA1_3UTR
Part I. Results Cross Validation Average training/test error for each model Logistic Regression lower average training/test error
Part I. Results- ROC curve AUROC: LR=0.79; NB=0.70
Part I. Discussion Limitations of the preliminary analysis: Better methods for variable selection available (stepwise procedures: forward selection, backward elimination). Recessive, additive, & heterozygous properties of genes not included in analysis Biology between disease & genes not considered in variable selection Interaction not considered in logistic regression model
Part II. Ongoing Research - Background – Purpose – Software -
Part II. Background Bayesian Networks Probabilistic graphical model consisting of two components- an acyclic directed graph (DAG) a set of local probability distributions Example of a DAG. (nodes = random variables) (arcs = conditional dependencies)
Part II. Background P(SNP1,L|SNP3) = P(SNP1|SNP3)*P(L|SNP3) Markov Condition P(SNP1,L|SNP3) = P(SNP1|SNP3)*P(L|SNP3)
Part II. Purpose To analyze SNP data using the following models: Bayesian Networks Multifactor Dimensionality Reduction Support Vector Machine To compare the prediction capability of the above models to other widely used models
Part II. Software Development of R code for Bayesian networks analysis to search for “best” Bayesian Network utilize search in variable ordering with MCMC method Arc reversal
Thank you.