Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics

Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics
SNP Data Analysis Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics

Presentation Outline Part I. - Preliminary Analysis
Introduction Methods Results Discussion Part II. - Ongoing Research Purpose Background Software

Part I. Preliminary Analysis
- Introduction – Methods – Results – Discussion -

Part I. Introduction Purpose of the preliminary analysis:
To develop 2 initial prediction models for the leukemia SNP data… - Logistic regression - Naïve Bayes Classifier To determine which of the two models is the better prediction model Overall goal: to improve SNP data analysis

Part I. Introduction Data SNP data consisting of 220 binary variables
Output variable: CACO Leukemia present/absence Input variables: Sex (F/M) 218 SNPs (dominant/not dominant) Observations: 485 individuals Data is complete; no missing values

Part I. Methods Models using R Logistic Regression
assumption: observations are independent only simple logistic regression considered (no interaction among input variables) Naïve Bayes Classifier assumption: input variables are independent given the outcome

Part I. Methods Variable Selection Goodness of Fit Measures
Selection based on log likelihood score 10 input variables per model Goodness of Fit Measures 4-fold cross-validation Area under ROC

Part I. Results Variables Selected
Same 10 SNPs were identified as input variables for both logistic regression and Naïve Bayes: 1. TFRC_rs TFRC_rs3326 2. TGFB1_rs HFE 3. HFE_rs RXRB_rs421446 4. HLA_DRB1_DQA1_rs ACP1_rs 5. LTF_rs DQA1_3UTR

Part I. Results Cross Validation
Average training/test error for each model Logistic Regression  lower average training/test error

Part I. Results- ROC curve
AUROC: LR=0.79; NB=0.70

Part I. Discussion Limitations of the preliminary analysis:
Better methods for variable selection available (stepwise procedures: forward selection, backward elimination). Recessive, additive, & heterozygous properties of genes not included in analysis Biology between disease & genes not considered in variable selection Interaction not considered in logistic regression model

Part II. Ongoing Research
- Background – Purpose – Software -

Part II. Background Bayesian Networks
Probabilistic graphical model consisting of two components- an acyclic directed graph (DAG) a set of local probability distributions Example of a DAG. (nodes = random variables) (arcs = conditional dependencies)

Part II. Purpose To analyze SNP data using the following models:
Bayesian Networks Multifactor Dimensionality Reduction Support Vector Machine To compare the prediction capability of the above models to other widely used models

Part II. Software Development of R code for Bayesian networks analysis
to search for “best” Bayesian Network utilize search in variable ordering with MCMC method Arc reversal

Thank you.

Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics

Similar presentations

Presentation on theme: "Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics

Similar presentations

Presentation on theme: "Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics"— Presentation transcript:

Similar presentations

About project

Feedback