Download presentation
Presentation is loading. Please wait.
Published byIrene Copeland Modified over 8 years ago
1
Classification of GPCRs at Family and Subfamily Levels Using Decision Trees & Naïve Bayes Classifiers Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman Jaime Carbonell
2
The Problem & Motivation Classify GPCRs using n-grams at the Family Level Level 1 Subfamily Level Compare performance of decision trees and naïve-bayes classifier to SVM and BLAST as presented in Karchin’s paper Determine the extent of the effect in using much simpler classifiers
3
Baseline: Karchin et al. Paper Karchin, Karplus and Haussler. “Classifying G-protein coupled receptors with support vector machines.” Bioinformatics: Vol. 18, no. 1, 2002, p. 147-159. Compares the performance of a 1-NN classifier (BLAST), profile HMMs and SVMs in classifying GPCRs at the level 1 and 2 subfamily levels (as well as superfamily level) Concludes that while SVMs are the most expensive computationally, they are necessary to get annotation-quality classification
4
Decision Trees Machine Learning, Tom Mitchell, McGraw Hill, 1997.Tom Mitchell http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html No
5
Why use Decision Trees? Easy to interpret biological significance of results Nodes of the tree tell us which are the most discriminating features Prune decision tree to avoid overfitting and improve accuracy on testing data Use C4.5 software in this experiment http://www.cse.unsw.edu.au/~quinlan/ http://www.cse.unsw.edu.au/~quinlan/
6
Naïve Bayes Classifier Outlk = sun; Temp = cool; Humid = high; Wind = strong Used Rainbow for this experiment http://www-2.cs.cmu.edu/~mccallum/bow/
7
Family Level Classification Family# of Proteins% of GPCRs Class A108179.72% Class B836.12% Class C282.06% Class D110.81% Class E40.29% Class F453.32% Orphan A352.58% Orphan B20.15% Bacterial Rhodopsin231.70% Drosophila Odorant Receptors312.29% Nematode Chemoreceptors10.07% Ocular Albinism Proteins20.15% Plant Mlo Receptors100.74%
8
Family Level – Classes A-E Decision Trees Using counts of n-grams only No sequence length information N-gramsUnpruned TreePruned Tree 1-grams95.70%95.90% 1, 2-grams95.30%95.60% 1, 2, 3-grams96.70%
9
More Difficult … All Families Decision Trees N-gramsUnpruned TreePruned Tree 1-grams88.60%89.40% 1, 2-grams88.60%89.50% 1, 2, 3-grams88.20%89.30% Naïve Bayes N-gramsLaPlaceWittenbell 2-grams86.52%90.30% 2,3-grams95.41%95.19% 2,3,4-grams90.59%94.44% 2,3,4,5-grams90.15%93.56%
10
Class A/B = Orphan A/B? Train decision trees on proteins in all classes except Orphan A and B Test on Orphan A and B proteins Are they classified as Class A and Class B? N-gramsOrphan AOrphan B 1-grams31 Class A 3 Class B, 1 Plant 1 Class B, 1 Plant 1, 2-grams30 Class A, 1 Class C, 1 Class D, 2 Class F, 1 Drosophila 2 Class A 1, 2, 3-grams28 Class A, 1 Class B, 3 Class C, 3 Drosophila 2 Class A
11
Feature Selection Needed Large number of features is problematic in most learning algorithms for training classifiers Reducing the number of features tend to reduce overfitting Reduce number of features by feature extraction or feature selection Feature extraction – new features are combinations or transformations of given features Feature selection – new features are a subset of original features C4.5 cannot handle all 1, 2, 3, 4-grams
12
Some Feature Selection Methods Document Frequency Information Gain (a.k.a. Expected Mutual Information) Chi-Square Correlation Coefficient Relevancy Score Best in document classification
13
Chi-Square Sum [(Expected – Observed) 2 / Expected] Expected = Observed = # of sequences in the family with > i occurrences of n-gram j For each n-gram x, find the i x with the largest chi-square value Sort n-grams according to these chi-values Binary features – for each chosen n-gram x, whether x has more than i x occurrences Frequency features – frequency of each chosen n-gram x
14
Effect of Chi 2 Selected Binary 1,2,3-grams in Decision Trees
15
Effect of Chi 2 Selected Frequency 1,2,3-grams in Decision Trees
16
Effect of Chi 2 Selected Binary 1,2,3-grams in Naïve Bayes
17
Effect of Chi 2 Selected Frequency 1,2,3-grams in Naïve Bayes
18
4-grams Not Useful! 4-grams are not useful in GPCR classification at the family level according to chi-square Top X Number of N-grams Percent of which are Size N Number of 1, 2, and 3-grams = 21 + 21 2 + 21 3 = 9723 n-grams
19
Sequence Length is not Discriminating at Family Level Decision Trees using 1, 2, 3-grams and sequence length Length was used in only 1 out of 10 folds, at level 11 in the decision tree Addition of length changed accuracy from 89.3% to 89.4%
20
Sequence Length is not Discriminating at Family Level
21
Family Level: Our Results BLAST as a 1-NN classifier: 94.32% Uses SWISS-PROT database Decision Tree: 91% 600 top 1,2,3-gram frequencies from Chi-square Naïve Bayes: 96% 2500 top 1,2,3-gram frequency features from Chi-square
22
Level I Subfamily Classification 15 Class A Level I subfamilies 1207 sequences 4 Class C Level I subfamilies 62 sequences Other sequences 149 sequences archaea rhodopsins, G-alpha proteins 2-fold cross validation, dataset split as given in the paper
23
Level I Subfamily Results SVM88.4% BLAST83.3% SAM-T2K HMM69.9% kernNN64.0% Decision Trees (700 freq 1,2,3-grams) 78.40% 77.25%(no chi-square) Naïve Bayes (2500 freq 1,2,3-grams) 89.51%
24
Level II Subfamily Classification 65 Class A Level II subfamilies 1133 sequences 6 Class C Level II subfamilies 37 sequences Other sequences 248 sequences Archae rhodopsins G-alpha proteins Class A and class C sequences with no level II subfamily classification or in level II subfamilies with only 1 member 2-fold cross validation
25
Level II Subfamily Results SVM86.3% BLAST74.5% SAM-T2K HMM70.0% kernNN51.0% Decision Trees (600 freq features) 69.0% 66.0% (no chi-square) Naïve Bayes (2500 freq features) 82.36%
26
Conclusions Naïve Bayes classifier seems to perform better than Decision Trees in classifying GPCRs, especially when used with Chi- square At level I subfamily level, Naïve Bayes surprisingly did better than SVM! At level II subfamily level, it seems SVM may be needed to gain an additional 4% accuracy. However, further experiments are needed to check as varying the number of features in Naïve Bayes may make up the difference.
27
References C4.5 Software http://www.cse.unsw.edu.au/~quinlan/ http://www.cse.unsw.edu.au/~quinlan/ Karchin, Karplus and Haussler. “Classifying G-protein coupled receptors with support vector machines.” Bioinformatics: Vol. 18, no. 1, 2002, p. 147-159. Machine Learning, Tom Mitchell, McGraw Hill, 1997.Tom Mitchell http://www-2.cs.cmu.edu/~tom/mlbook-chapter- slides.html http://www-2.cs.cmu.edu/~tom/mlbook-chapter- slides.html Rainbow Software http://www-2.cs.cmu.edu/~mccallum/bow/ http://www-2.cs.cmu.edu/~mccallum/bow/ Sebastiani, Fabrizio. “A Tutorial on Automated Text Categorisation.” Proceedings of ASAI-99, 1999. http://citeseer.nj.nec.com/sebastiani99tutorial.html http://citeseer.nj.nec.com/sebastiani99tutorial.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.