Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Supervised Learning Recap
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and.
The loss function, the normal equation,
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Classification and risk prediction
Data Mining Techniques Outline
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Bayesian Learning for Conditional Models Alan Qi MIT CSAIL September, 2005 Joint work with T. Minka, Z. Ghahramani, M. Szummer, and R. W. Picard.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Final review LING572 Fei Xia Week 10: 03/11/
Graphical models for part of speech tagging
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
INTRODUCTION TO Machine Learning 3rd Edition
Learning from Positive and Unlabeled Examples Investigator: Bing Liu, Computer Science Prime Grant Support: National Science Foundation Problem Statement.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Machine Learning 5. Parametric Methods.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
LEARNING FROM EXAMPLES AIMA CHAPTER 18 (4-5) CSE 537 Spring 2014 Instructor: Sael Lee Slides are mostly made from AIMA resources, Andrew W. Moore’s tutorials:
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Chapter 7. Classification and Prediction
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Learning Coordination Classifiers
CSE 4705 Artificial Intelligence
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Statistical Models for Automatic Speech Recognition
Artificial Intelligence Research Laboratory
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Presentation transcript:

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Discriminatively Trained Markov Model for Sequence Classification Oksana Yakhnenko Adrian Silvescu Vasant Honavar Artificial Intelligence Research Lab Iowa State University ICDM 2005

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Outline Background Markov Models Generative vs. Discriminative Training Discriminative Markov model Experiments and Results Conclusion

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Sequence Classification Σ – alphabet s in Σ* - sequence C={c 1,c 2 …c n } - a set of class labels Goal: Given D={ } produce a hypothesis h: Σ* → C and assign c=h(s) to an unknown sequence s from Σ* Applications computational biology –protein function prediction, protein structure classification… natural language processing –speech recognition, spam detection… etc.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Generative Models Learning phase: –Model the process that generates the data assumes the parameters specify probability distribution for the data –learns the parameters that maximize joint probability distribution of example and class: P(x,c) θ parameters data

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Generative Models Classification phase: Assign the most likely class to a novel sequence s Simplest way – Naïve Bayes assumption: –assume all features in s are independent given c j, –estimate –

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Outline Background Markov Models Generative vs. Discriminative Training Discriminative Markov model Experiments and Results Conclusion

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Markov Models Capture dependencies between elements in the sequence Joint probability can be decomposed as a product of an element given its predecessors s2s2 s1s1 s n-1 snsn cc s3s3 c Markov Model of order 1Markov Model of order 0 (Naïve Bayes) Markov Model of order 2 s1s1 s2s2 s2s2 s1s1 snsn snsn

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Markov Models of order k-1 In general, for k dependencies full likelihood is Two types of parameters that have closed-form solution and can be estimated in one pass through data P(s 1 s 2 …s k-1, c) P(s i |s i+1 …si+k-1,c) Good accuracy and expressive power in protein function prediction tasks [Peng & Schuurmans, 2003], [Andorf et. al 2004] sufficient statistics

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Outline Background Markov Models Generative vs. Discriminative Training Discriminative Markov model Experiments and Results Conclusion

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Generative vs. discriminative models Generative –Parameters are chosen to maximize full likelihood of features and class –Less likely to overfit Discriminative –Solve classification problem directly Model a class given the data (least square error, maximum margin between classes, most-likely class given data, etc) –More likely to overfit s2s2 s1s1 s n-1 snsn c s2s2 s1s1 snsn c

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). How to turn a generative trainer into discriminative one Generative models give joint probability Find a function that models the class given the data No closed form solution to maximize class- conditional probability –use optimization technique to fit the parameters

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Examples Naïve Bayes ↔ Logistic regression [Ng & Jordan, 2002] –With sufficient data discriminative models outperform generative Bayesian Network ↔ Class-conditional Bayesian Network [Grossman & Domingos, 2004] –Set parameters to maximize full likelihood (closed form solution), use class-conditional likelihood to guide structure search Markov Random Field ↔ Conditional Random Field [Lafferty et. al, 2001]

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Outline Background Markov Models Generative vs. Discriminative Training Discriminative Markov model Experiments and Results Conclusion

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Discriminative Markov Model 1.Initialize parameters with full likelihood maximizers 2.Use gradient ascent to chose parameters to maximize logP(c|S): P(k-gram) t+1 = P(k-gram) t +α CLL 3.Reparameterize P’s in terms of weights –probabilities need to be in [0,1] interval –probabilities need to sum to 1 4.To classify - use weights to compute the most likely class

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Reparameterization s1s1 s2s2 s3s3 s k-1 sisi s i+1 s i+2 s k+i-1 P(s i |s i+1 …s k+i-1 c) P(s 1 s 2 …s k-1 c) e w /Z w = P(s i |s i+1 …s k+i-1 c) e u /Z u =P(s 1 s 2 …s k-1 c) where Zs are normalizers 1.Initialize by joint likelihood estimates, 2.Use gradient updates for w’s and u’s instead of probabilities w t+1 =w t +∂CLL/ ∂w u t+1 =u t +∂CLL/ ∂u

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Parameter updates On-line, per sequence updates The final updates are: CLL is maximized when: –weights are close to probabilities –probability of true class given the sequence is close to 1

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Algorithm Training: 1.Initialize parameters with estimates according to generative model 2.Until termination condition met –for each sequence s in the data update the parameters w and u with gradient updates (dCLL/dw and dCLL/du) Classification: –Given new sequence S, use weights to compute c=argmax cj P(c j |S)

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Outline Background Markov Models Generative vs. Discriminative Training Discriminative Markov model Experiments and Results Conclusion

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Data Protein function data: families of human kinases. 290 examples, 4 classes [Andorf et. al 2004] Subcellular localization [Hua & Sun, 2001] –Prokaryotic 997 examples, 3 classes –Eukaryotic 2427 examples, 4 classes Reuters text categorization data: 10 classes that have the highest number of examples [Lewis, 1997]

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Experiments Overfitting? –90% for training, 10% for validation –Record accuracies on training and validation data; and value of negative CLL at each iteration Performance comparison –compare with SVM that uses k-grams as feature inputs (equivalent to string kernel) and generative Markov model –10-fold cross-validation –collective classification for kinase data –one-against-all for localization and text data

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). CLL, accuracy on training vs. validation sets Localization prediction data for “nuclear” class

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Results on overfitting Overfitting occurs in most cases Accuracy on unseen data increases and drops when accuracy on train data and CLL continue to increase Accuracy on validation data is at its maximum (after 5-10 iterations) not when CLL is converged (a lot longer) Use early termination as a form of regularization

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Experiments: Pick the parameters that yield the highest accuracy on the validation data in the first 30 iterations or after convergence (whichever happens first) test … Train data in 1 cross-validation pass train 90% validation 10% test

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Results Collective classification for Protein Function data One against all for Localization and Reuters Evaluate using different performance measures: accuracy, specificity, sensitivity, correlation coefficient Kinase (protein function prediction) data

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Results ProkaryoticEukaryotic

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Results Reuters data

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Results - performance Kinase –2% improvement over generative Markov model –SVM outperforms by 1% Prokaryotic –Small improvement over generative Markov model and SVM (extracellular), other classes similar performance as SVM Eukaryotic –4%, 2.5%, 7% improvement in accuracy over generative Markov model on Cytoplasmic, Extracellular and Mitochondrial –Comparable to SVM

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Results - performance Reuters –Generative and discriminate approaches have very similar accuracy –Discriminative show higher sensitivity, generative show higher specificity Performance is close to that of SVM without the computational cost of SVM

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Results – time/space Generative Markov model needs one pass through training data SVM needs several passes through data –Needs kernel computation –May not be feasible to compute kernel matrix for k > 3 –If kernel is computed as needed, can significantly slow down one iteration Discriminative Markov model needs a few passes through training data –O(length of sequence x alphabet size) for one sequence

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Conclusion Initializes parameters in one pass through data Requires few passes through data to train Significantly outperforms generative Markov model on large datasets Accuracy is comparable to SVM that uses string kernel –Significantly faster to train than SVM –Practical for larger datasets Combines strengths of generative and discriminative training

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Future work Development of more sophisticated regularization techniques Extension of the algorithm to higher-dimensional topological data (2D/3D) Application to other tasks in molecular biology and related fields where more data is available

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National Science Foundation (IIS ) and the National Institutes of Health (GM066387). Thank you!