Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.

Similar presentations


Presentation on theme: "1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis."— Presentation transcript:

1 1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis Center for Excellence in Cancer Genomics University at Albany, SUNY

2 2 The problem Develop computational models of characteristics of protein structure and function from sequence alone using machine-learned classifiers  Input: Data  Output: A model (function) h : X  Y  Traditional approach: supervised learning Challenges:  Experimentally determined data – Expensive, limited, subject to noise/error  Large repositories of unannotated data  Data representation, bias from unbalanced / underrepresented classes, etc. Swiss-Prot 54.5: 289,473 TrEMBL 37.5: 5,035,267 AIM: Develop a method to use labeled and unlabeled data, while improving performance given the challenges presented by small, unbalanced data

3 3 Solution Semi-supervised learning  Use D l and D u for model induction Method: Generative, Bayesian probabilistic model  Based on ngLOC – supervised, Naïve Bayes classification method  Input / Feature Representation: Sequence  n-gram model  Assumption – multinomial distribution IID – Sequence and n-grams  Use EXPECTATION MAXIMIZATION! Test setup  Prediction of subcellular localization  Eukaryotic, non-plant sequences only  D l : Data annotated with subcellular localization for eukaryotic, non-plant sequences DL-2 – EXT/PLA (~5500 sequences, balanced) DL-3 – GOL [65%] / LYS [14%] /POX [21%] (~600 sequences, unbalanced)  D u : Set from ~75K eukaryotic, non-plant protein sequences. Comparative method  Transductive SVM

4 4 Algorithms based on EM EM-λ on DL-3 data λ – controls effect of UL data on parameter adjustments ALL labeled data (~600) Varied UL data EM- λ outperforms TSVM on this problem  (Failed to converge on large amounts of UL data, despite parameter selection) NOTE – TSVM performed very well on binary, balanced classification problems Basic EM on DL-2 Varied labeled data 25,000 UL sequences Most improvement when data is limited

5 5 Algorithm – EM-CS Core ngLOC method outputs a confidence score (CS) Improve running time through intelligent selection of unlabeled instances  CS(x i ) > CSthresh? Use the instance Test on DL-3 data: First, determine range of CS scores through cross-validation without UL: 33.5-47.8 (Dependent on level of similarity in data, size of dataset.) Using only sequences that meet or exceed CSthresh significantly reduces UL data required (97.5% eliminated) NOTE: it is possible to reduce UL data too much.

6 6 Conclusion Benefits:  Probabilistic Extract unlabeled sequences of “high-confidence” Difficult with SVM or TSVM  Extraction of knowledge from model Discriminative n-grams and anomalies  Information theoretic measures, KL-divergence, etc. Again, difficult with SVM or TSVM  Computational resources Time: Significantly lower than SVM and TSVM Space: Dependent on n-gram model Can use large amounts of unlabeled data  Applicable toward prediction of any structural or functional characteristic  Outputs a global model Transduction is not global!  Most substantial gain with limited labeled data Current work in progress:  TSVMs Improve performance on smaller, unbalanced data Select an improved smaller dimensional feature space representation  Ensemble classifiers, Bayesian model averaging, Mixture of experts


Download ppt "1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis."

Similar presentations


Ads by Google