Classification of GPCRs at Family and Subfamily Levels Using Decision Trees & Naïve Bayes Classifiers Betty Yee Man Cheng Language Technologies Institute,

Slides:



Advertisements
Similar presentations
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Evaluation of Decision Forests on Text Categorization
Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.
Classification Techniques: Decision Tree Learning
Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Decision Tree Algorithm
Using IR techniques to improve Automated Text Classification
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Three kinds of learning
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
Scalable Text Mining with Sparse Generative Models
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Forecasting with Twitter data Presented by : Thusitha Chandrapala MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA.
A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham Department of Computer Science The University of Texas.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Masquerade Detection Mark Stamp 1Masquerade Detection.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
Friday 17 rd December 2004Stuart Young Capstone Project Presentation Predicting Deleterious Mutations Young SP, Radivojac P, Mooney SD.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Betty Cheng, Jaime Carbonell Language Technologies Institute, School of Computer Science Carnegie Mellon University.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
1 CSCI 3202: Introduction to AI Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) Intro AIDecision Trees.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Comparative Experiments on Sentiment Classification for Online Product Reviews Hang Cui, Vibhu Mittal, and Mayur Datar AAAI 2006.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
A Simple Approach for Author Profiling in MapReduce
Learning to Detect and Classify Malicious Executables in the Wild by J
Semantic Processing with Context Analysis
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Decision Tree Learning
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Damiano Bolzoni, Sandro Etalle, Pieter H. Hartel
Decision Tree Saed Sayad 9/21/2018.
Categorizing networks using Machine Learning
Combining HMMs with SVMs
Boosting For Tumor Classification With Gene Expression Data
Machine Learning: Lecture 3
Machine Learning in Practice Lecture 26
Decision tree ensembles in biomedical time-series classifaction
Chapter 7: Transformations
Information Retrieval
Avoid Overfitting in Classification
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Evaluating Classifiers for Disease Gene Discovery
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Classification of GPCRs at Family and Subfamily Levels Using Decision Trees & Naïve Bayes Classifiers Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman Jaime Carbonell

The Problem & Motivation  Classify GPCRs using n-grams at the  Family Level  Level 1 Subfamily Level  Compare performance of decision trees and naïve-bayes classifier to SVM and BLAST as presented in Karchin’s paper  Determine the extent of the effect in using much simpler classifiers

Baseline: Karchin et al. Paper  Karchin, Karplus and Haussler. “Classifying G-protein coupled receptors with support vector machines.” Bioinformatics: Vol. 18, no. 1, 2002, p  Compares the performance of a 1-NN classifier (BLAST), profile HMMs and SVMs in classifying GPCRs at the level 1 and 2 subfamily levels (as well as superfamily level)  Concludes that while SVMs are the most expensive computationally, they are necessary to get annotation-quality classification

Decision Trees Machine Learning, Tom Mitchell, McGraw Hill, 1997.Tom Mitchell No

Why use Decision Trees?  Easy to interpret biological significance of results  Nodes of the tree tell us which are the most discriminating features  Prune decision tree to avoid overfitting and improve accuracy on testing data  Use C4.5 software in this experiment 

Naïve Bayes Classifier Outlk = sun; Temp = cool; Humid = high; Wind = strong Used Rainbow for this experiment

Family Level Classification Family# of Proteins% of GPCRs Class A % Class B836.12% Class C282.06% Class D110.81% Class E40.29% Class F453.32% Orphan A352.58% Orphan B20.15% Bacterial Rhodopsin231.70% Drosophila Odorant Receptors312.29% Nematode Chemoreceptors10.07% Ocular Albinism Proteins20.15% Plant Mlo Receptors100.74%

Family Level – Classes A-E  Decision Trees  Using counts of n-grams only  No sequence length information N-gramsUnpruned TreePruned Tree 1-grams95.70%95.90% 1, 2-grams95.30%95.60% 1, 2, 3-grams96.70%

More Difficult … All Families Decision Trees N-gramsUnpruned TreePruned Tree 1-grams88.60%89.40% 1, 2-grams88.60%89.50% 1, 2, 3-grams88.20%89.30% Naïve Bayes N-gramsLaPlaceWittenbell 2-grams86.52%90.30% 2,3-grams95.41%95.19% 2,3,4-grams90.59%94.44% 2,3,4,5-grams90.15%93.56%

Class A/B = Orphan A/B?  Train decision trees on proteins in all classes except Orphan A and B  Test on Orphan A and B proteins  Are they classified as Class A and Class B? N-gramsOrphan AOrphan B 1-grams31 Class A 3 Class B, 1 Plant 1 Class B, 1 Plant 1, 2-grams30 Class A, 1 Class C, 1 Class D, 2 Class F, 1 Drosophila 2 Class A 1, 2, 3-grams28 Class A, 1 Class B, 3 Class C, 3 Drosophila 2 Class A

Feature Selection Needed  Large number of features is problematic in most learning algorithms for training classifiers  Reducing the number of features tend to reduce overfitting  Reduce number of features by feature extraction or feature selection  Feature extraction – new features are combinations or transformations of given features  Feature selection – new features are a subset of original features  C4.5 cannot handle all 1, 2, 3, 4-grams

Some Feature Selection Methods  Document Frequency  Information Gain (a.k.a. Expected Mutual Information)  Chi-Square  Correlation Coefficient  Relevancy Score Best in document classification

Chi-Square  Sum [(Expected – Observed) 2 / Expected]  Expected =  Observed = # of sequences in the family with > i occurrences of n-gram j  For each n-gram x, find the i x with the largest chi-square value  Sort n-grams according to these chi-values  Binary features – for each chosen n-gram x, whether x has more than i x occurrences  Frequency features – frequency of each chosen n-gram x

Effect of Chi 2 Selected Binary 1,2,3-grams in Decision Trees

Effect of Chi 2 Selected Frequency 1,2,3-grams in Decision Trees

Effect of Chi 2 Selected Binary 1,2,3-grams in Naïve Bayes

Effect of Chi 2 Selected Frequency 1,2,3-grams in Naïve Bayes

4-grams Not Useful!  4-grams are not useful in GPCR classification at the family level according to chi-square Top X Number of N-grams Percent of which are Size N  Number of 1, 2, and 3-grams = = 9723 n-grams

Sequence Length is not Discriminating at Family Level  Decision Trees using 1, 2, 3-grams and sequence length  Length was used in only 1 out of 10 folds, at level 11 in the decision tree  Addition of length changed accuracy from 89.3% to 89.4%

Sequence Length is not Discriminating at Family Level

Family Level: Our Results  BLAST as a 1-NN classifier: 94.32%  Uses SWISS-PROT database  Decision Tree: 91%  600 top 1,2,3-gram frequencies from Chi-square  Naïve Bayes: 96%  2500 top 1,2,3-gram frequency features from Chi-square

Level I Subfamily Classification  15 Class A Level I subfamilies  1207 sequences  4 Class C Level I subfamilies  62 sequences  Other sequences  149 sequences  archaea rhodopsins, G-alpha proteins  2-fold cross validation, dataset split as given in the paper

Level I Subfamily Results SVM88.4% BLAST83.3% SAM-T2K HMM69.9% kernNN64.0% Decision Trees (700 freq 1,2,3-grams) 78.40% 77.25%(no chi-square) Naïve Bayes (2500 freq 1,2,3-grams) 89.51%

Level II Subfamily Classification  65 Class A Level II subfamilies  1133 sequences  6 Class C Level II subfamilies  37 sequences  Other sequences  248 sequences  Archae rhodopsins  G-alpha proteins  Class A and class C sequences with no level II subfamily classification or in level II subfamilies with only 1 member  2-fold cross validation

Level II Subfamily Results SVM86.3% BLAST74.5% SAM-T2K HMM70.0% kernNN51.0% Decision Trees (600 freq features) 69.0% 66.0% (no chi-square) Naïve Bayes (2500 freq features) 82.36%

Conclusions  Naïve Bayes classifier seems to perform better than Decision Trees in classifying GPCRs, especially when used with Chi- square  At level I subfamily level, Naïve Bayes surprisingly did better than SVM!  At level II subfamily level, it seems SVM may be needed to gain an additional 4% accuracy. However, further experiments are needed to check as varying the number of features in Naïve Bayes may make up the difference.

References  C4.5 Software   Karchin, Karplus and Haussler. “Classifying G-protein coupled receptors with support vector machines.” Bioinformatics: Vol. 18, no. 1, 2002, p  Machine Learning, Tom Mitchell, McGraw Hill, 1997.Tom Mitchell  slides.html slides.html  Rainbow Software   Sebastiani, Fabrizio. “A Tutorial on Automated Text Categorisation.” Proceedings of ASAI-99, 