Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification of GPCRs at Family and Subfamily Levels Using Decision Trees & Naïve Bayes Classifiers Betty Yee Man Cheng Language Technologies Institute,

Similar presentations


Presentation on theme: "Classification of GPCRs at Family and Subfamily Levels Using Decision Trees & Naïve Bayes Classifiers Betty Yee Man Cheng Language Technologies Institute,"— Presentation transcript:

1 Classification of GPCRs at Family and Subfamily Levels Using Decision Trees & Naïve Bayes Classifiers Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman Jaime Carbonell

2 The Problem & Motivation  Classify GPCRs using n-grams at the  Family Level  Level 1 Subfamily Level  Compare performance of decision trees and naïve-bayes classifier to SVM and BLAST as presented in Karchin’s paper  Determine the extent of the effect in using much simpler classifiers

3 Baseline: Karchin et al. Paper  Karchin, Karplus and Haussler. “Classifying G-protein coupled receptors with support vector machines.” Bioinformatics: Vol. 18, no. 1, 2002, p. 147-159.  Compares the performance of a 1-NN classifier (BLAST), profile HMMs and SVMs in classifying GPCRs at the level 1 and 2 subfamily levels (as well as superfamily level)  Concludes that while SVMs are the most expensive computationally, they are necessary to get annotation-quality classification

4 Decision Trees Machine Learning, Tom Mitchell, McGraw Hill, 1997.Tom Mitchell http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html No

5 Why use Decision Trees?  Easy to interpret biological significance of results  Nodes of the tree tell us which are the most discriminating features  Prune decision tree to avoid overfitting and improve accuracy on testing data  Use C4.5 software in this experiment  http://www.cse.unsw.edu.au/~quinlan/ http://www.cse.unsw.edu.au/~quinlan/

6 Naïve Bayes Classifier Outlk = sun; Temp = cool; Humid = high; Wind = strong Used Rainbow for this experiment http://www-2.cs.cmu.edu/~mccallum/bow/

7 Family Level Classification Family# of Proteins% of GPCRs Class A108179.72% Class B836.12% Class C282.06% Class D110.81% Class E40.29% Class F453.32% Orphan A352.58% Orphan B20.15% Bacterial Rhodopsin231.70% Drosophila Odorant Receptors312.29% Nematode Chemoreceptors10.07% Ocular Albinism Proteins20.15% Plant Mlo Receptors100.74%

8 Family Level – Classes A-E  Decision Trees  Using counts of n-grams only  No sequence length information N-gramsUnpruned TreePruned Tree 1-grams95.70%95.90% 1, 2-grams95.30%95.60% 1, 2, 3-grams96.70%

9 More Difficult … All Families Decision Trees N-gramsUnpruned TreePruned Tree 1-grams88.60%89.40% 1, 2-grams88.60%89.50% 1, 2, 3-grams88.20%89.30% Naïve Bayes N-gramsLaPlaceWittenbell 2-grams86.52%90.30% 2,3-grams95.41%95.19% 2,3,4-grams90.59%94.44% 2,3,4,5-grams90.15%93.56%

10 Class A/B = Orphan A/B?  Train decision trees on proteins in all classes except Orphan A and B  Test on Orphan A and B proteins  Are they classified as Class A and Class B? N-gramsOrphan AOrphan B 1-grams31 Class A 3 Class B, 1 Plant 1 Class B, 1 Plant 1, 2-grams30 Class A, 1 Class C, 1 Class D, 2 Class F, 1 Drosophila 2 Class A 1, 2, 3-grams28 Class A, 1 Class B, 3 Class C, 3 Drosophila 2 Class A

11 Feature Selection Needed  Large number of features is problematic in most learning algorithms for training classifiers  Reducing the number of features tend to reduce overfitting  Reduce number of features by feature extraction or feature selection  Feature extraction – new features are combinations or transformations of given features  Feature selection – new features are a subset of original features  C4.5 cannot handle all 1, 2, 3, 4-grams

12 Some Feature Selection Methods  Document Frequency  Information Gain (a.k.a. Expected Mutual Information)  Chi-Square  Correlation Coefficient  Relevancy Score Best in document classification

13 Chi-Square  Sum [(Expected – Observed) 2 / Expected]  Expected =  Observed = # of sequences in the family with > i occurrences of n-gram j  For each n-gram x, find the i x with the largest chi-square value  Sort n-grams according to these chi-values  Binary features – for each chosen n-gram x, whether x has more than i x occurrences  Frequency features – frequency of each chosen n-gram x

14 Effect of Chi 2 Selected Binary 1,2,3-grams in Decision Trees

15 Effect of Chi 2 Selected Frequency 1,2,3-grams in Decision Trees

16 Effect of Chi 2 Selected Binary 1,2,3-grams in Naïve Bayes

17 Effect of Chi 2 Selected Frequency 1,2,3-grams in Naïve Bayes

18 4-grams Not Useful!  4-grams are not useful in GPCR classification at the family level according to chi-square Top X Number of N-grams Percent of which are Size N  Number of 1, 2, and 3-grams = 21 + 21 2 + 21 3 = 9723 n-grams

19 Sequence Length is not Discriminating at Family Level  Decision Trees using 1, 2, 3-grams and sequence length  Length was used in only 1 out of 10 folds, at level 11 in the decision tree  Addition of length changed accuracy from 89.3% to 89.4%

20 Sequence Length is not Discriminating at Family Level

21 Family Level: Our Results  BLAST as a 1-NN classifier: 94.32%  Uses SWISS-PROT database  Decision Tree: 91%  600 top 1,2,3-gram frequencies from Chi-square  Naïve Bayes: 96%  2500 top 1,2,3-gram frequency features from Chi-square

22 Level I Subfamily Classification  15 Class A Level I subfamilies  1207 sequences  4 Class C Level I subfamilies  62 sequences  Other sequences  149 sequences  archaea rhodopsins, G-alpha proteins  2-fold cross validation, dataset split as given in the paper

23 Level I Subfamily Results SVM88.4% BLAST83.3% SAM-T2K HMM69.9% kernNN64.0% Decision Trees (700 freq 1,2,3-grams) 78.40% 77.25%(no chi-square) Naïve Bayes (2500 freq 1,2,3-grams) 89.51%

24 Level II Subfamily Classification  65 Class A Level II subfamilies  1133 sequences  6 Class C Level II subfamilies  37 sequences  Other sequences  248 sequences  Archae rhodopsins  G-alpha proteins  Class A and class C sequences with no level II subfamily classification or in level II subfamilies with only 1 member  2-fold cross validation

25 Level II Subfamily Results SVM86.3% BLAST74.5% SAM-T2K HMM70.0% kernNN51.0% Decision Trees (600 freq features) 69.0% 66.0% (no chi-square) Naïve Bayes (2500 freq features) 82.36%

26 Conclusions  Naïve Bayes classifier seems to perform better than Decision Trees in classifying GPCRs, especially when used with Chi- square  At level I subfamily level, Naïve Bayes surprisingly did better than SVM!  At level II subfamily level, it seems SVM may be needed to gain an additional 4% accuracy. However, further experiments are needed to check as varying the number of features in Naïve Bayes may make up the difference.

27 References  C4.5 Software  http://www.cse.unsw.edu.au/~quinlan/ http://www.cse.unsw.edu.au/~quinlan/  Karchin, Karplus and Haussler. “Classifying G-protein coupled receptors with support vector machines.” Bioinformatics: Vol. 18, no. 1, 2002, p. 147-159.  Machine Learning, Tom Mitchell, McGraw Hill, 1997.Tom Mitchell  http://www-2.cs.cmu.edu/~tom/mlbook-chapter- slides.html http://www-2.cs.cmu.edu/~tom/mlbook-chapter- slides.html  Rainbow Software  http://www-2.cs.cmu.edu/~mccallum/bow/ http://www-2.cs.cmu.edu/~mccallum/bow/  Sebastiani, Fabrizio. “A Tutorial on Automated Text Categorisation.” Proceedings of ASAI-99, 1999.  http://citeseer.nj.nec.com/sebastiani99tutorial.html http://citeseer.nj.nec.com/sebastiani99tutorial.html


Download ppt "Classification of GPCRs at Family and Subfamily Levels Using Decision Trees & Naïve Bayes Classifiers Betty Yee Man Cheng Language Technologies Institute,"

Similar presentations


Ads by Google