Presentation is loading. Please wait.

Presentation is loading. Please wait.

Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics

Similar presentations


Presentation on theme: "Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics"— Presentation transcript:

1 Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics
SNP Data Analysis Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics

2 Presentation Outline Part I. - Preliminary Analysis
Introduction Methods Results Discussion Part II. - Ongoing Research Purpose Background Software

3 Part I. Preliminary Analysis
- Introduction – Methods – Results – Discussion -

4 Part I. Introduction Purpose of the preliminary analysis:
To develop 2 initial prediction models for the leukemia SNP data… - Logistic regression - Naïve Bayes Classifier To determine which of the two models is the better prediction model Overall goal: to improve SNP data analysis

5 Part I. Introduction Data SNP data consisting of 220 binary variables
Output variable: CACO Leukemia present/absence Input variables: Sex (F/M) 218 SNPs (dominant/not dominant) Observations: 485 individuals Data is complete; no missing values

6 Part I. Methods Models using R Logistic Regression
assumption: observations are independent only simple logistic regression considered (no interaction among input variables) Naïve Bayes Classifier assumption: input variables are independent given the outcome

7 Part I. Methods Variable Selection Goodness of Fit Measures
Selection based on log likelihood score 10 input variables per model Goodness of Fit Measures 4-fold cross-validation Area under ROC

8 Part I. Results Variables Selected
Same 10 SNPs were identified as input variables for both logistic regression and Naïve Bayes: 1. TFRC_rs TFRC_rs3326 2. TGFB1_rs HFE 3. HFE_rs RXRB_rs421446 4. HLA_DRB1_DQA1_rs ACP1_rs 5. LTF_rs DQA1_3UTR

9 Part I. Results Cross Validation
Average training/test error for each model Logistic Regression  lower average training/test error

10 Part I. Results- ROC curve
AUROC: LR=0.79; NB=0.70

11 Part I. Discussion Limitations of the preliminary analysis:
Better methods for variable selection available (stepwise procedures: forward selection, backward elimination). Recessive, additive, & heterozygous properties of genes not included in analysis Biology between disease & genes not considered in variable selection Interaction not considered in logistic regression model

12 Part II. Ongoing Research
- Background – Purpose – Software -

13 Part II. Background Bayesian Networks
Probabilistic graphical model consisting of two components- an acyclic directed graph (DAG) a set of local probability distributions Example of a DAG. (nodes = random variables) (arcs = conditional dependencies)

14 Part II. Background P(SNP1,L|SNP3) = P(SNP1|SNP3)*P(L|SNP3)
Markov Condition P(SNP1,L|SNP3) = P(SNP1|SNP3)*P(L|SNP3)

15 Part II. Purpose To analyze SNP data using the following models:
Bayesian Networks Multifactor Dimensionality Reduction Support Vector Machine To compare the prediction capability of the above models to other widely used models

16 Part II. Software Development of R code for Bayesian networks analysis
to search for “best” Bayesian Network utilize search in variable ordering with MCMC method Arc reversal

17 Thank you.


Download ppt "Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics"

Similar presentations


Ads by Google