Evaluating Classifiers for Disease Gene Discovery

Slides:



Advertisements
Similar presentations
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Advertisements

Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
PRIORITIZING REGIONS OF CANDIDATE GENES FOR EFFICIENT MUTATION SCREENING.
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
Feature Selection Presented by: Nafise Hatamikhah
Bioinformatics and Phylogenetic Analysis
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Introduction to WEKA Aaron 2/13/2009. Contents Introduction to weka Download and install weka Basic use of weka Weka API Survey.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
Data Mining Techniques
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
Whole Genome Expression Analysis
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Appendix: The WEKA Data Mining Software
Friday 17 rd December 2004Stuart Young Capstone Project Presentation Predicting Deleterious Mutations Young SP, Radivojac P, Mooney SD.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Bioinformatics Brad Windle Ph# Web Site:
Today Ensemble Methods. Recap of the course. Classifier Fusion
Acknowledgements Contact Information Anthony Wong, MTech 1, Senthil K. Nachimuthu, MD 1, Peter J. Haug, MD 1,2 Patterns and Rules  Vital signs medoids.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Data Mining In contrast to the traditional (reactive) DSS tools, the data mining premise is proactive. Data mining tools automatically search the data.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
A Non-EST-Based Method for Exon-Skipping Prediction Rotem Sorek, Ronen Shemesh, Yuval Cohen, Ortal Basechess, Gil Ast and Ron Shamir Genome Research August.
Data Mining and Decision Support
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Discovering Interesting Patterns for Investment Decision Making with GLOWER-A Genetic Learner Overlaid With Entropy Reduction Advisor : Dr. Hsu Graduate.
Classification of GPCRs at Family and Subfamily Levels Using Decision Trees & Naïve Bayes Classifiers Betty Yee Man Cheng Language Technologies Institute,
BNFO 615 Fall 2016 Usman Roshan NJIT. Outline Machine learning for bioinformatics – Basic machine learning algorithms – Applications to bioinformatics.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
SNS COLLEGE OF TECHNOLOGY
Tomás Pérez-García, Carlos Pérez-Sancho, José M. Iñesta
Semi-Supervised Clustering
Machine Learning – Classification David Fenyő
Classification with Gene Expression Data
Results for all features Results for the reduced set of features
An Artificial Intelligence Approach to Precision Oncology
Prepared by: Mahmoud Rafeek Al-Farra
Boosting and Additive Trees (2)
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Gene expression.
Data Mining 101 with Scikit-Learn
Evaluating classifiers for disease gene discovery
Waikato Environment for Knowledge Analysis
Peter John M.Phil, PhD Atta-ur-Rahman School of Applied Biosciences (ASAB) National University of Sciences & Technology (NUST)
Machine Learning Week 1.
Machine Learning with Weka
Before-After Studies Part I
Revision (Part II) Ke Chen
iSRD Spam Review Detection with Imbalanced Data Distributions
Information Theoretical Probe Selection for Hybridisation Experiments
Objective 1: Use Weka’s WrapperSubsetEval (Naïve Bayes
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Kernel Methods for large-scale Genomics Data Analysis
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Evaluating Classifiers for Disease Gene Discovery Lon Turnbull and Kino Coursey lt0013@unt.edu, kino@daxtrom.com University of North Texas

Instructors: Armin Mikler and Kaja Abbas Biocomputing Fall 2005 CSCD 4930.004/CSCE 5933.007 Biol 4930.773/Biol 5905.773 Instructors: Armin Mikler and Kaja Abbas

Outline An interesting hypothesis What is a disease gene? Can disease genes be classified using machine learning tools? If so, can we do better? Classifiers + Data Analysis + Conclusions

Hypothesis It has been suggested that the genes which have some relationship to hereditary disease might have common variations in their DNA sequence structure.

What is a disease gene? Any gene that has mutated in such a way that the proteins created from it are dysfunctional.

What is a disease gene? Any gene that has mutated in such a way that the proteins created from it are dysfunctional. However, mutation can happen to any gene, so can one actually search for physical characteristics of a “disease” gene?

Reviewed Paper A research group has used the alternating decision tree algorithm from Weka to test the hypothesis. On average, 70% of the genes marked as disease phenotype were correctly identified with their automatic classifier they called PROSPECTR. They found that about 40% of their chosen features had statistically significant differences.

PROSPECTR results Feature Ratio Gene encodes signal peptide 2.06 Gene Length 1.42 5' CpG islands 1.33 Protein length 1.29 Exon Number 1.25 cDNA length 1.15 Distance to neighboring gene 1.13 3' UTR length 1.09

Can we do better with other methods of classification? Question Can we do better with other methods of classification?

Classification Methods ADTree: alternating decision tree, optimized for two-class problems. J48: a variant of classification 7. Logistic: Linear logistic regression. SMO: Sequential Minimal Optimization algorithm for training a support vector classifier. Naïve Bayes: Standard probabilistic Naïve Bayes. Ibk-K: K-nearest neighbor classifier (k=5). PART: Obtains rules from partial decision trees build using C4.5 heuristics.

Test Data A training set that consisted of 1,084 genes known to be associated with a disease and 1,084 genes not known to be associated with genes diseases. A set with 675 disease genes listed in the Human Gene Mutation Database (HGMD) and 675 genes not known to be involved in disease. A set based on oliongenic disorders. It contained 54 genes known to be associated with an oliongenic disorder and 54 genes not known to be associated with gene diseases.

Classifier interpretation There are four possible results from a classification analysis. They are that a selected gene either: Matches a disease gene. Matches a non disease gene. Is selected to match a disease gene but does not do so. Is selected to match a non-disease gene but does not do so.

Validity The analysis of an independent data set ought to produce similar results to the training set. If not the analysis is suspect.

Validity If the analysis is valid, we would expect that classification using the only the successful subset of features found by the PROSPECTR application would result in an improved solution. The removal of non-relevant features ought to decrease the number of mismatches.

All data analyzed

The Best Classifier Results Percent total corrects Difference with best features J48 88.7 -15.1 PART 80.7 -10.6 ADTree 75.5 -3.1 Ibk-K 75.4 -0.32 Naïve Bayes 73.0 -12.0 SMO 72.3 -6.0 Logistic 70.0 -4.9

Conclusions We have shown that classifier 2, performs better than classifier 1, the one chosen by PROSPECTR method. The features that showed the largest differences in the PROPSPECTR study were most likely a statistical anomaly. It seems that using these machine learning methods to classify disease genes is not very productive. At best it needs to be combined with some other independent method.

References Euan Adie et. al., Speeding disease gene discovery by sequence based candidate prioritization, BMC Bioinformatics 2005, 6:55. Hammond MP, Birney E, Genome information resources - developments at Ensembl. Trends in Genetics 2004, 20:268-272. http://www.biomedcentral.com/1471-2105/6/55. http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=gnd http://www.genetics.med.ed.ac.uk/prospectr/

Questions?

What causes disease? Causes of disease are a continuum of genetic activity interacting with nongenetic factors. The Metabolic Molecular Basis of Inherited Disease. 8th Ed., Vol 1, Chapter 1. RC 627.8.M47.2001

Weka A collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a data set or called from your own Java code. Contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. Well-suited for developing new machine learning schemes. Is open source software issued under the GNU General Public License.