Evaluating classifiers for disease gene discovery

Slides:

Advertisements

Similar presentations

Using protein interaction networks to identify candidate disease genes in psychiatric illness. Richard Adams, Psychiatric Genetics Lab, Medical Genetics.

Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.

Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.

1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.

PRIORITIZING REGIONS OF CANDIDATE GENES FOR EFFICIENT MUTATION SCREENING.

Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.

What is Statistical Modeling

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

1. Elements of the Genetic Algorithm  Genome: A finite dynamical system model as a set of d polynomials over  2 (finite field of 2 elements)  Fitness.

Reduced Support Vector Machine

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)

Modeling Gene Interactions in Disease CS 686 Bioinformatics.

Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.

MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.

Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.

Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.

JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.

Machine Learning CUNY Graduate Center Lecture 1: Introduction.

INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.

Inductive learning Simplest form: learn a function from examples

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10

Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Acknowledgements Contact Information Anthony Wong, MTech 1, Senthil K. Nachimuthu, MD 1, Peter J. Haug, MD 1,2 Patterns and Rules  Vital signs medoids.

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

Multi-Relational Data Mining: An Introduction Joe Paulowskey.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Intelligent Database Systems Lab Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

A Non-EST-Based Method for Exon-Skipping Prediction Rotem Sorek, Ronen Shemesh, Yuval Cohen, Ortal Basechess, Gil Ast and Ron Shamir Genome Research August.

PCB 3043L - General Ecology Data Analysis.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Data Mining and Decision Support

Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.

BNFO 615 Fall 2016 Usman Roshan NJIT. Outline Machine learning for bioinformatics – Basic machine learning algorithms – Applications to bioinformatics.

bacteria and eukaryotes

Heart Sound Biometrics for Continual User Authentication

Semi-Supervised Clustering

Classification with Gene Expression Data

Results for all features Results for the reduced set of features

Rule Induction for Classification Using

Instance Based Learning

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Data Science Algorithms: The Basic Methods

Data Mining Lecture 11.

Introduction Feature Extraction Discussions Conclusions Results

Asymmetric Gradient Boosting with Application to Spam Filtering

Machine Learning Week 1.

Instance Based Learning

Conserved Seed Pairing, Often Flanked by Adenosines, Indicates that Thousands of Human Genes are MicroRNA Targets Benjamin P. Lewis, Christopher B. Burge,

Supervised Machine Learning for Population Genetics: A New Paradigm

Model generalization Brief summary of methods

Machine Learning: Lecture 6

Summarized by Sun Kim SNU Biointelligence Lab.

Objective 1: Use Weka’s WrapperSubsetEval (Naïve Bayes

Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,

Derek Hoiem CS 598, Spring 2009 Jan 27, 2009

Evaluating Classifiers for Disease Gene Discovery

Kernel Methods for large-scale Genomics Data Analysis

Presentation transcript:

Evaluating classifiers for disease gene discovery Lon Turnbull and Kino Coursey lt0013@unt.edu, kino@daxtrom.com University of North Texas Introduction Classifiers Classifications analysis of correct responses Discussion Determining that a gene is probably involved in some genetic disease is an important bioinformatics task. In this poster we extend the work of the Prospectr project, at defining classifiers to determine the likelihood of candidate regions being capable of causing a genetic disease. While their work focused on only limited depth decision trees, we will examine the other classifiers in the Weka machine learning tool set, and determining the quality and accuracy of the various classifiers. Being able to improve the accuracy of this classification task allows the high probability sites to be given priority when searching for disease genes. 1. ADTree: alternating decision tree, optimized for two-class problems. 2. J48: is a variant of the C4.5 decision tree induction algorithm. 3. Logistic: Linear logistic regression, a form of regression for classification. 4. SMO: Sequential Minimal Optimization algorithm for training a support vector classifier. 5. Naïve Bayes: Standard probabilistic Naïve Bayes classifier. 6. Ibk-K: K-nearest neighbor classifier (k=5). Select the class as the majority of the 5 nearest training instances using a distance metric. 7. PART: Obtains rules from partial decision trees build using C4.5 heuristics. Classifier 2 is clearly the winner as it correctly classified 88.75% of the genes. In most cases, the reduced features set had lower correct classifications. This might mean that either there is a methodological problem or that the high optimal features are due to an anomalous matching in the data set. This problem is emphasized by the best classifier having the largest change when the reduced features were analyzed. There are two sets of data as we also performed an analysis using a reduced set of the number of features that focused on the more prominent ones found by the PROSPECTR results. The bar graphs show the numbers of genes found in the desired classification: disease genes, the red column, and the not desired classification: non-disease genes, the purple column. The two middle columns are the mismatches: blue is disease classified as not disease and green is not disease miss-classified as disease. The least desired cases are the green columns. There are two obvious methods of picking a good classifier: 1. Since there are equal numbers of genes of each type in the samples, ideally the red and purple columns ought to be the same height. The greater the difference, the worse the classifier. 2. The classifier with the least number of mismatches can also be considered a better choice. With these criteria, for the full data set, classifier 2 is the best classifier for the first two data sets and classifier 1 is a little better than classifier 2 for the oligogenic set. Unexpectedly, a different set of classifiers predominates in the reduced data set. Classifier 4 is best for the first two data sets and classifier 3 is better for the last data set. Note Well: in most cases the number of mismatches is relatively large. There was no improvement in the reduced features set, which means, that despite the statistical significance, the features that showed the largest differences were most likely a statistical anomaly. Data used for testing Hypothesis Training set: A set that consisted of 1,084 genes known to be associated with a disease and 1,084 genes not known to be associated with diseases. HGMD: Independent test set 1: A set with 675 disease genes listed in the Human Gene Mutation Database (HGMD) and 675 genes not known to be involved in disease. Oliongenic: Independent test set 2: A set based on oliongenic disorders. It contained 54 genes known to be associated with an oliongenic disorder and 54 genes not known to be associated with gene diseases. It has been suggested that the genes which have some relationship to hereditary disease might have common variations in their DNA sequence structure. A research group [1] has used the alternating decision tree algorithm from Weka to test this hypothesis. They used 24 distinct features to test about 18,000 genes that are not known to be involved in disease and the 1,084 Ensembl[2] genes also listed in OMIM. On average, 70% of the disease genes were correctly identified with their automatic classifier they called PROSPECTR. Can we do better with other methods of classification? We expected that focusing on a set of features that were highly predictive of disease genes would tend to enhance the results, that is, decrease the number of miss-classified genes. This did not happen. Results for all features Conclusions PROSPECTR results We have shown that classifier 2, performs better than classifier 1, the one chosen by the PROSPECTR method. Either the reduced features are not a well chosen subset when using these analysis methods or using these machine learning methods to classify disease genes is not very productive. They tested a number of DNA features in an attempt to find differences between disease genes and non-disease genes. This table is a ratio of the median in a disease set to the median in a control set of the 9 of the 24 features that had statistically significant differences. The larger the ratio the greater the dependence. Gene encodes signal peptide 2.06 Gene Length 1.42 5' CpG islands 1.33 Protein length 1.29 Exon Number 1.25 cDNA length 1.15 Distance to neighboring gene 1.13 3' UTR length 1.09 Protein identity with BRH in mouse 1.09 References [1] Euan Adie et. al., Speeding disease gene discovery by sequence based candidate prioritization, BMC Bioinformatics 2005, 6:55. [2] Hammond MP, Birney E: Genome information resources - developments at Ensembl. Trends in Genetics 2004, 20:268-272. http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=gnd http://www.biomedcentral.com/1471-2105/6/55. http://www.genetics.med.ed.ac.uk/prospectr/ Results for the reduced set of features Contact What are disease genes? Biocomputing Fall 2005 CSCD 4930.004/CSCE 5933.007 Disease genes are genes that have been mutated so that the body or some parts of the body no longer functions correctly. Most of the more than 100 known genetic disorders are the direct result of a mutation in one gene. It is much more difficult to find the basis of diseases that have a complex pattern of inheritance where more than one gene needs to be mutated before a susceptibility to a disease is expressed. For more information contact: Armin R. Mikler University of North Texas Email: mikler@cs.unt.edu Web: http://cerl.unt.edu