Modeling Gene Interactions in Disease CS 686 Bioinformatics.

Slides:



Advertisements
Similar presentations
Neural networks Introduction Fitting neural networks
Advertisements

Interaction-based Learning in Genomics
Brief introduction on Logistic Regression
CGeMM – University of Louisville Mining gene-gene interactions from microarray data - Coefficient of Determination Marcel Brun – CGeMM - UofL.
Supervised Learning Recap
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Data Mining Techniques Outline
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Statistics for Managers Using Microsoft® Excel 5th Edition
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Predictive sub-typing of subjects Retrospective and prospective studies Exploration of clinico-genomic data Identify relevant gene expression patterns.
Classification 10/03/07.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Data mining and statistical learning - lecture 13 Separating hyperplane.
Chapter 11 Multiple Regression.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Data mining and statistical learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification  Relationship.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Collaborative Filtering Matrix Factorization Approach
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
Simple Linear Regression
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Classification and Prediction (cont.) Pertemuan 10 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Lecture 2: Statistical learning primer for biologists
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
Case Selection and Resampling Lucila Ohno-Machado HST951.
Data Mining and Decision Support
Logistic Regression Saed Sayad 1www.ismartsoft.com.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Marginal Distribution Conditional Distribution. Side by Side Bar Graph Segmented Bar Graph Dotplot Stemplot Histogram.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Canadian Bioinformatics Workshops
Chapter Outline EMPIRICAL MODELS 11-2 SIMPLE LINEAR REGRESSION 11-3 PROPERTIES OF THE LEAST SQUARES ESTIMATORS 11-4 SOME COMMENTS ON USES OF REGRESSION.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Data Transformation: Normalization
Chapter 7. Classification and Prediction
Basic Estimation Techniques
Ch3: Model Building through Regression
Data Mining Lecture 11.
Basic Estimation Techniques
Data Analysis Learning from Data
Model generalization Brief summary of methods
Speech recognition, machine learning
Speech recognition, machine learning
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Modeling Gene Interactions in Disease CS 686 Bioinformatics

Some Definitions Data mining: extracting hidden patterns and useful info from large data sets. Ex- clustering, machine learning. Should not be: "Torturing data until it confesses... and if you torture it enough, it will confess to anything" - Jeff Jonas, IBM Machine learning: the ability of a program to learn from experience. Ex- neural networks, decision trees, rule-based methods, MDR.

Methods Regression methods: modeling the relationship between a dependent variable and one of more independent variables. Data mining methods: Search the space of possible models efficiently. Better with non-linear and high-dimensional data, or data with many potential interactions. Exhaustive Search: search all possible models for the best one.

Linear regression Relates outcome as a linear combination of the parameters (but not necessarily of the independent variables). Ex: Let y = incidence of disease, n data points. Independent variables A,B 1) y i = b 0 + b 1 A i + ε i, i = 1,…,n 2) y i = b 0 + b 2 (B i ) 2 + ε i, i = 1,…,n where b 0, b 1, b 2 = parameters, ε i is error term. In both of these examples, the disease is modeled as linear in the parameters, although it is quadratic in variable B

Linear regression Given a sample, we estimate the params (ex: can use least squares) to arrive at the linear regression model: [1]

Multiple regression Relates the the probability of an event to a linear combination of predictor variables. Ex: Let y = incidence of disease, n data points. Independent variables x 1, x 2 y i = b 0 + b 1 x i1 + b 2 x i2 + … + b p x i p + ε i, i = 1,…,n Best-fit line: For each unit increase in x i p, is expected to increase by.

Logistic regression[1] Often used when the outcome is binary, relates the log-odds of the probability of an event to a linear combination of predictor variables. Ex: ln(p/(1 – p)) = α + βxB + γxC + ixBxC, where xB and xC are measured binary indicator variables, and regression coefficients β and y represent main effects, i represents interaction.

Other statistical methods [1] Bayesian model selection: a statistical approach incorporating both prior distributions for parameters and observed data into the model. Maximum likelihood: a statistical method used to make inferences about the combination of parameter values resulting in the highest probability of obtaining the observed data

Modeling Terminology[1] Saturated: a statistical model that is as full as possible (saturated) with parameters. Marginal effects: the effects of one parameter averaged over the possible values taken by other parameters Entropy: the uncertainty associated with a random variable

Modeling Terminology[1] Cross-validation: partitioning a data set into n subsets, then using each subset in turn as the test set while using the other n-1 to train. Overfitting: a model that provides a good fit to a specific data set but generalizes poorly. Marginal effects: the effects of one parameter averaged over the possible values taken by other parameters.

Marginal Effects [2] Marginal penetrance: Ex: The probability P(D|A=Aa), irrespective of what value B has Table II. Penetrance values for combinations of genotypes from two single nucleotide polymorphisms exhibiting interactions in the absence of independent main effects Genotype Genotype Marginal penetrance B AA (0.25) Aa (0.50) aa (0.25) BB (0.25) Bb (0.50) bb (0.25) Marginal penetrance A Genotype frequencies are given in parentheses Marginal penetrance values for the A, B genotypes.

Weka [3] A collection of visualization tools and algorithms for data analysis and predictive modeling. Preprocessing tools for reading data in a variety of formats and transforming it. Classification algorithms include regression, neural network, support vector machine, decision tree. Display includes ROC curves Clustering: k-means, expectation maximization Visualization includes scatter-plot, bar graph

References Cordell, 2009, Detecting gene–gene interactions that underlie human diseases. Nature Review Genetics McKinney et al, 2006, Machine Learning for Detecting Gene-Gene Interactions, A Review. Biomedical Genomics and Proteomics Weka site: