Logistic Regression: To classify gene pairs

Slides:



Advertisements
Similar presentations
Eco 5385 Predictive Analytics For Economists Spring 2014 Professor Tom Fomby Director, Richard B. Johnson Center for Economic Studies Department of Economics.
Advertisements

Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Chapter 8 Logistic Regression 1. Introduction Logistic regression extends the ideas of linear regression to the situation where the dependent variable,
Discriminant Analysis To describe multiple regression analysis and multiple discriminant analysis. Discriminant Analysis.
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
Statistical Methods Chichang Jou Tamkang University.
T-test.
CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability Software complexity and software quality.
DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods CSE 377 – Bioinformatics - Spring 2006 Sotirios Kentros Univ. of.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Multiple Regression Dr. Andy Field.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Bootstrap and Cross-Validation Bootstrap and Cross-Validation.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
The Broad Institute of MIT and Harvard Classification / Prediction.
Business Intelligence and Decision Modeling Week 11 Predictive Modeling (2) Logistic Regression.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Correlation and Regression: The Need to Knows Correlation is a statistical technique: tells you if scores on variable X are related to scores on variable.
Logistic Regression. Linear Regression Purchases vs. Income.
XINYUE LIU Can We Determine Whether an is SPAM ?
Comparative Experiments on Sentiment Classification for Online Product Reviews Hang Cui, Vibhu Mittal, and Mayur Datar AAAI 2006.
Subjects Review Introduction to Statistical Learning Midterm: Thursday, October 15th :00-16:00 ADV2.
Linear Discriminant Analysis and Logistic Regression.
CHAPTER 10: Logistic Regression. Binary classification Two classes Y = {0,1} Goal is to learn how to correctly classify the input into one of these two.
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
Improving gene expression similarity measurement using pathway-based analytic dimension Changwon Keum BMDRC.
Global predictors of regression fidelity A single number to characterize the overall quality of the surrogate. Equivalence measures –Coefficient of multiple.
Big Data Processing of School Shooting Archives
A Multi-stage Approach to Detect Gene-gene Interactions Associated with Multiple Correlated Phenotypes Zhou Xiangdong,Keith Chan, Danhong Zhu Department.
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
Chapter 7. Classification and Prediction
Outlier Detection Identifying anomalous values in the real- world database is important both for improving the quality of original data and for reducing.
Comparing Decision Rules
Effect Size.
Multiple Regression Prof. Andy Field.
Machine Learning Logistic Regression
Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
Dipartimento di Ingegneria «Enzo Ferrari»,
Molecular Classification of Cancer
Evaluating classifiers for disease gene discovery
Prediction of RNA Binding Protein Using Machine Learning Technique
Machine Learning Logistic Regression
Students: Meiling He Advisor: Prof. Brain Armstrong
Lecture 23: Feature Selection
Mitchell Kossoris, Catelyn Scholl, Zhi Zheng
SEG 4630 E-Commerce Data Mining — Final Review —
PROBLEM 1 Training Examples: Class 1 Training Examples: Class 2
Review – TB transmission
Project 1 Binary Classification
Predict Failures with Developer Networks and Social Network Analysis
CS539: Project 3 Zach Pardos.
Eco 6380 Predictive Analytics For Economists Spring 2016
Multiple Decision Trees ISQS7342
Generalizations of Markov model to characterize biological sequences
Abdur Rahman Department of Statistics
Classification Breakdown
Logistic Regression Chapter 7.
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Predicting Loan Defaults
Cases. Simple Regression Linear Multiple Regression.
Extracting Why Text Segment from Web Based on Grammar-gram
Information Organization: Evaluation of Classification Performance
Presentation transcript:

Logistic Regression: To classify gene pairs 24-04-2007

Introduction Linear Regression Classifier Bio-python Libraries Gene pairs classification Into classes: OP (if they belong to the same Operon) NOP (otherwise)

genetic structure that contains one or more structural genes Operon genetic structure that contains one or more structural genes Associated with each Operon are promoter and opertor sequences Classification aim: To identify if genes within a gene pair belong to the same Operon illustration from http://www.agen.ufl.edu/

Logistic Regression Model A set of input (predictor) variables Distance between the genes Gene expression score Logit Score 𝑆= β 0 + β 1 𝑥 1 + β 2 𝑥 2

Training the model Focus on Bacillus Subtilis Operons The training data gathered from Operon DB located at http://odb.kuicr.kyoto-u.ac.jp Finding values for the beta coefficients Done through MLE of probabilities (class OP vs class NOP given the data)

Training using the entire dataset yields: Model Accuracy Training using the entire dataset yields: To calculate the Accuracy of the model 10-fold cross validation Leave-one-out cross validation 𝑆=0.6212327−0.0007425 𝑥 1 +4.2325169 𝑥 2

Model Testing Results (10-fold cross validation) Average Type I error rate (False positive error rate) 19% Average Specificity (probability that a pair of class NOP are classified correctly) 0.81 Average Type II error rate (False negative error rate) 4% Average Sensitivity (probability that a pair of class OP are classified correctly) 0.96

Model Testing Results (Leave-one-out cross validation) Accuracy = 90% Sensitivity = 94% False -ve rate = 6% Specificity = 82% False +ve Rate = 18%

Conclusions & Notes Classifier performed well More than two variables may be needed to improve performance Classifier works on Bacillus Subtilis genes only Due to the difference in gene length for different organisms Operon DataBase (dataset source) Uses 5 variables for classification (improves accuracy) Aims to have all known operon information Therefore has a large training set for multiple organisms Allows web users to perform gene pair classification Located at http://odb.kuicr.kyoto-u.ac.jp

Thank You Questions ?

Refrences 1. Operon. from http://en.wikipedia.org/wiki/Operon 2. Garson G. Logistic regression. 1998. from http://www2.chass.ncsu.edu/garson/pa765/logistic.htm 3. Hoon M. The logistic regression model. from http://www.biopython.org/DIST/docs/cookbook/LogisticRegression.html 4. Okuda S. Operon DataBase. 5. Schneider J. Cross validation. 1997. from http://www.cs.cmu.edu/~schneide/tut5/node42.html Illustration: http://www.agen.ufl.edu/~chyn/age2062/lect/lect_07/lac_op.jpg