Download presentation
Presentation is loading. Please wait.
Published byNoah Lee Modified over 6 years ago
1
Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics
SNP Data Analysis Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics
2
Presentation Outline Part I. - Preliminary Analysis
Introduction Methods Results Discussion Part II. - Ongoing Research Purpose Background Software
3
Part I. Preliminary Analysis
- Introduction – Methods – Results – Discussion -
4
Part I. Introduction Purpose of the preliminary analysis:
To develop 2 initial prediction models for the leukemia SNP data… - Logistic regression - Naïve Bayes Classifier To determine which of the two models is the better prediction model Overall goal: to improve SNP data analysis
5
Part I. Introduction Data SNP data consisting of 220 binary variables
Output variable: CACO Leukemia present/absence Input variables: Sex (F/M) 218 SNPs (dominant/not dominant) Observations: 485 individuals Data is complete; no missing values
6
Part I. Methods Models using R Logistic Regression
assumption: observations are independent only simple logistic regression considered (no interaction among input variables) Naïve Bayes Classifier assumption: input variables are independent given the outcome
7
Part I. Methods Variable Selection Goodness of Fit Measures
Selection based on log likelihood score 10 input variables per model Goodness of Fit Measures 4-fold cross-validation Area under ROC
8
Part I. Results Variables Selected
Same 10 SNPs were identified as input variables for both logistic regression and Naïve Bayes: 1. TFRC_rs TFRC_rs3326 2. TGFB1_rs HFE 3. HFE_rs RXRB_rs421446 4. HLA_DRB1_DQA1_rs ACP1_rs 5. LTF_rs DQA1_3UTR
9
Part I. Results Cross Validation
Average training/test error for each model Logistic Regression lower average training/test error
10
Part I. Results- ROC curve
AUROC: LR=0.79; NB=0.70
11
Part I. Discussion Limitations of the preliminary analysis:
Better methods for variable selection available (stepwise procedures: forward selection, backward elimination). Recessive, additive, & heterozygous properties of genes not included in analysis Biology between disease & genes not considered in variable selection Interaction not considered in logistic regression model
12
Part II. Ongoing Research
- Background – Purpose – Software -
13
Part II. Background Bayesian Networks
Probabilistic graphical model consisting of two components- an acyclic directed graph (DAG) a set of local probability distributions Example of a DAG. (nodes = random variables) (arcs = conditional dependencies)
14
Part II. Background P(SNP1,L|SNP3) = P(SNP1|SNP3)*P(L|SNP3)
Markov Condition P(SNP1,L|SNP3) = P(SNP1|SNP3)*P(L|SNP3)
15
Part II. Purpose To analyze SNP data using the following models:
Bayesian Networks Multifactor Dimensionality Reduction Support Vector Machine To compare the prediction capability of the above models to other widely used models
16
Part II. Software Development of R code for Bayesian networks analysis
to search for “best” Bayesian Network utilize search in variable ordering with MCMC method Arc reversal
17
Thank you.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.