August 16, 2015EECS, OSU1 Learning with Ambiguously Labeled Training Data Kshitij Judah Ph.D. student Advisor: Prof. Alan Fern Qualifier Oral Presentation.

Slides:

Advertisements

Similar presentations

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

An Overview of Machine Learning

Supervised Learning Recap

CMPUT 466/551 Principal Source: CMU

Visual Recognition Tutorial

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Multiple Instance Learning

Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Multiple-Instance Learning Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998] Paper 2: EM-DD: An Improved Multiple-Instance.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Visual Recognition Tutorial

Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Semi-Supervised Learning

Crash Course on Machine Learning

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

EM and expected complete log-likelihood Mixture of Experts

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.

1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

INTRODUCTION TO Machine Learning 3rd Edition

Linear Models for Classification

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819

Lecture 2: Statistical learning primer for biologists

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Ensemble Methods in Machine Learning

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

Machine Learning 5. Parametric Methods.

NTU & MSRA Ming-Feng Tsai

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Data Mining Practical Machine Learning Tools and Techniques

Semi-Supervised Clustering

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Deep Feedforward Networks

Data Mining Lecture 11.

Probabilistic Models with Latent Variables

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Parametric Methods Berlin Chen, 2005 References:

Presentation transcript:

August 16, 2015EECS, OSU1 Learning with Ambiguously Labeled Training Data Kshitij Judah Ph.D. student Advisor: Prof. Alan Fern Qualifier Oral Presentation

August 16, 2015EECS, OSU2 Outline Introduction and Motivation Problem Definition Learning with Partial Labels Semi-Supervised Learning Multiple-Instance Learning Conclusion

August 16, 2015EECS, OSU3 Introduction and Motivation Training Data (D) Learning Algorithm f Loss Function Test Point  Supervised Learning (From CS534 Machine Learning slides)

August 16, 2015EECS, OSU4 Introduction and Motivation Obtaining labeled training data is difficult Reasons: Require substantial human effort (Information Retrieval) Require expensive tests (Medical Diagnosis, Remote Sensing) Disagreement among experts (Information Retrieval, Remote Sensing)

August 16, 2015EECS, OSU5 Introduction and Motivation Reasons: Specifying a single correct class is not possible but pointing incorrect classes is possible (Medical Diagnosis) Labeling is not possible at instance level (Drug Activity Prediction) Objective: Present the space of learning problems and algorithms that deal with ambiguously labeled training data

August 16, 2015EECS, OSU6 Space of Learning Problems Learning Problems Supervised Learning Semi-Supervised Learning Direct Supervision (Labels) Indirect Supervision (No Labels) Reinforcement Learning Probabilistic Labels Deterministic Labels Learning with Partial Labels Session-based Learning Unsupervised Learning Multiple-instance Learning

August 16, 2015EECS, OSU7 Problem Definition Let denote an instance space, being domain of the i th feature  Instance is a vector  denotes a discrete valued class variable with domain the set of all classes  Training data is a set of examples and is a vector of length m such that k th entry is 1 if is a possible class of, otherwise 0  In case of unambiguously labeled, exactly one entry in is 1

August 16, 2015EECS, OSU8 Problem Definition  In case of ambiguous labeling, more than one entries can be 1  The learning task is to select using D an optimal hypothesis from a hypothesis space  is used to classify a new instance  Optimality refers to expected classification accuracy

August 16, 2015EECS, OSU9 Learning with Partial Labels

August 16, 2015EECS, OSU10 Problem Definition Each training instance has a set of candidate class labels associated with it is assumed to contain true label of Different from multiple-label learning because each instance belongs to exactly one class Applications: Medical diagnosis, Bioinformatics, information retrieval.

August 16, 2015EECS, OSU11 Taxonomy of Algorithms for Learning with Partial Labels Probabilistic Generative Approaches Mixture Models + EM Probabilistic Discriminative Approaches Logistic Regression for Partial Labels

August 16, 2015EECS, OSU12 Probabilistic Generative Approaches Learn joint probability distribution can be used to either model data or perform classification learned using and Classification of is performed using Bayes rule: and assigning to class that maximizes

August 16, 2015EECS, OSU13 Probabilistic Generative Approaches Some examples of generative models are Gaussian Distribution Mixture Models e.g. Gaussian mixture model Directed Graphical Models e.g. Naïve Bayes, HMMs Undirected Graphical Models e.g. Markov Networks Applications: Robotics, speech recognition, computer vision, forecasting, time series prediction etc.

August 16, 2015EECS, OSU14 Mixture Models In a mixture model, the probability density function of is given by: Model parameters are often learned using maximum likelihood estimation (MLE)

August 16, 2015EECS, OSU15 Mixture Models In MLE approach, assuming instances in D are independently and identically distributed (i.i.d), the joint density of D is given by The goal is to find parameters that maximizes the log of above quantity

August 16, 2015EECS, OSU16 Expectation-Maximization Algorithm (EM) Difficult to optimize - log of sums Incomplete-data likelihood function – Generating component unknown Convert to complete-data likelihood by introducing a random variable which tells, for each, which component generated it If we observe, ( 6 )

August 16, 2015EECS, OSU17 Expectation-Maximization Algorithm (EM) We do not however observe directly, it is a hidden variable Use EM to handle hidden variables EM is an iterative algorithm with two basic steps: E-Step: Given current estimate of model parameters and data D, the expected value is calculated where expectation is to the marginal distribution of (i.e. each ):

August 16, 2015EECS, OSU18 Expectation-Maximization Algorithm (EM) M-Step: In this, we maximize (8) computed in E-step: Repeat until convergence.

August 16, 2015EECS, OSU19 Mixture Models + EM for Partial Labels - set of possible mixture components that could have generated the sample EM is modified so that is restricted to take values from Modified E-Step if

August 16, 2015EECS, OSU20 Pros and Cons of Generative Models Pros Clear probabilistic framework Effective when model assumptions are correct Cons Complex, lot of parameters Difficult to specify and verify their correctness Unlabeled data can hurt performance if model is incorrect Parameter estimation is difficult EM gets stuck in local maxima, computationally expensive No guarantee of better classification

August 16, 2015EECS, OSU21 Discriminative Approaches Discriminative approaches attempt to directly learn a mapping required for classification Simpler than generative models More focused towards classification or regression tasks Two types: Probabilistic: They learn posterior class probability Non-probabilistic: They learn a classifier Applications - text classification, information retrieval, machine perception, time series prediction, bioinformatics etc.

August 16, 2015EECS, OSU22 Logistic Regression (From CS534 Machine Learning slides) Learns the conditional class distribution For two class case {0, 1}, the probabilities are given as: For 0-1 loss function LR is a linear classifier For multiple class case, we learn weight parameters for each class

August 16, 2015EECS, OSU23 Maximum Entropy Logistic Regression Learns such that it has maximum entropy while being consistent with the partial labels in the D is close to uniform distribution over classes in candidate set and zero everywhere else Presence of unlabeled data results in with low discrimination capacity Disadvantages: Does not use unlabeled data to enhance learning Instead uses unlabeled data in unfavorable way that decreases discrimination capacity of

August 16, 2015EECS, OSU24 Minimum Commitment Logistic Regression Learns such that classes belonging to candidate set are predicted with high probability No preference is made among distributions that satisfy above property Presence of unlabeled data has no effect on discrimination capacity of Disadvantages : Does not use unlabeled data to enhance learning

August 16, 2015EECS, OSU25 Self-Consistent Logistic Regression Learns such that the entropy of the distribution is minimized Learning is driven by the predictions that are produced by the learner itself Advantage: Can make use of unlabeled data to enhance learning Disadvantage: If initial estimate is incorrect, the algorithm will go down the wrong path

August 16, 2015EECS, OSU26 Semi-Supervised Learning

August 16, 2015EECS, OSU27 Problem Definition Semi-supervised learning (SSL) is a special case of learning with partial labels, and For each in, = true class label of For each in, = Usually because unlabeled data is easily available as compared to labeled data e.g. information retrieval SSL approaches makes use of in conjunction with to enhance learning SSL has been applied to many domains like text classification, image classification, audio classification etc.

August 16, 2015EECS, OSU28 Taxonomy of Algorithms for Semi-Supervised Learning Probabilistic Generative Approaches Mixture Models + EM Discriminative Approaches Semi-supervised support vector machines (S3VMs) Self Training Co-Training

August 16, 2015EECS, OSU29 Mixture Models + EM for SSL Nigam et al. proposed a probabilistic generative framework for document classification Assumptions: (1) Documents are generated by a mixture model (2) One-to-one correspondence between mixture components and classes Probability of a document is given as: was modeled using Naïve Bayes

August 16, 2015EECS, OSU30 Mixture Models + EM for SSL is estimated using MLE by maximizing log likelihood of : where EM is used to maximize Applications: Text classification, remote sensing, face orientation discrimination, pose estimation, audio classification

August 16, 2015EECS, OSU31 Mixture Models + EM for SSL Problem: Unlabeled data decreased performance Reason: Incorrect modeling assumption One mixture component per class: Class includes multiple sub- topics Solutions: Decrease effect of unlabeled data Correct modeling assumptions: Multiple mixture components per class

August 16, 2015EECS, OSU32 SVMs Class -1 Class +1 margin

August 16, 2015EECS, OSU33 SVMs Idea: Do following things: A. Maximize margin B. Classify points correctly C. Keep data points outside margin SVM optimization problem:

August 16, 2015EECS, OSU34 SVMs A. is achieved by B. and C. are achieved using hinge loss C controls trade off between training error and generalization error From SSL tutorial by Xiaojin Zhu during ICML 2007

August 16, 2015EECS, OSU35 S3VMs From SSL tutorial by Xiaojin Zhu during ICML 2007  Also known as Transductive SVMs

August 16, 2015EECS, OSU36 S3VMs Idea: Do following things: A. Maximize margin B. correctly classify labeled data C. Keep data points (labeled and unlabeled) outside margin S3VM optimization problem:

August 16, 2015EECS, OSU37 S3VMs A. is achieved by B. is achieved using hinge loss C. is achieved for labeled data using hinge loss and for unlabeled data using hat loss From SSL tutorial by Xiaojin Zhu during ICML 2007

August 16, 2015EECS, OSU38 S3VMs Pros Same mathematical foundations as SVMs and hence applicable to tasks where SVMs are applicable Cons Optimization is difficult, finding exact solution is NP-Hard Applications: text classification, Bioinformatics, biological entity recognition, image retrieval etc.

August 16, 2015EECS, OSU39 Self Training A leaner uses its own predictions to teach itself Self training algorithm: 1.Train a classifier using 2.Use h to predict labels of 3.Add most confidently labeled instances in to 4.Repeat until a stopping criterion is met

August 16, 2015EECS, OSU40 Self Training Pros One of the simplest SSL algorithms Can be used with any classifier Cons Errors in initial predictions can build up and affect subsequent learning. A possible solution: identify and remove mislabeled examples from self- labeled training set during each iteration. E.g. SETRED algorithm [Ming Li and Zhi-Hua Zhou] Applications: object detection in images, word sense disambiguation, semantic role labeling, named entity recognition

August 16, 2015EECS, OSU41 Co-Training Idea: Two classifiers and collaborate with each other during learning and use disjoint subsets and of feature set and each is assumed to be: Sufficient for the classification task at hand Conditionally independent of the other given class

August 16, 2015EECS, OSU42 Co-Training Example: Web page classification can be text present on the web page can be anchor text on pages that link to this page The two feature sets are sufficient for classification and conditionally independent given class of web page because each is generated by two different persons that know the class

August 16, 2015EECS, OSU43 Co-Training Co-Training algorithm: 1.Use and to train ; and to train 2.Use to predict labels of 3.Use to predict labels of 4.Add h 1 ‘s most confidently labeled instances to 5.Add h 2 ‘s most confidently labeled instances to 6.Repeat until a stopping criterion is met

August 16, 2015EECS, OSU44 Co-Training Pros Effective when required feature split exists Can be used with any classifier Cons What if feature split does not exist? Answer: (1) Use random feature split (2) Use full feature set How to select confidently labeled instances? Answer: Use multiple classifiers instead of two and do majority voting. E.g. tri-training uses three classifiers [Zhou and Li], democratic co-learning uses multiple classifiers [Zhou and Goldman] Applications: Web page classification, named entity recognition, semantic role labeling

August 16, 2015EECS, OSU45 Multiple-Instance Learning

August 16, 2015EECS, OSU46 Problem Definition Set of objects Each object has multiple variants called instances, is called a bag Bag is positive if at least one instance is positive else negative Goal: learn

August 16, 2015EECS, OSU47 Why MIL is ambiguous label problem? To learn, learn maps an instance to a class does not contain labels at instance level

August 16, 2015EECS, OSU48 Taxonomy of Algorithms for Multiple-Instance Learning Probabilistic Generative Approaches Diverse Density Algorithm (DD) EM Diverse Density (EM-DD) Probabilistic Discriminative Approaches Multiple-instance logistic regression Non-probabilistic Discriminative Approaches Axis-Parallel Rectangles (APR) Multiple-instance SVM (MISVM)

August 16, 2015EECS, OSU49 Diverse Density Algorithm (DD) Point B Point A x 1 x

August 16, 2015EECS, OSU50 Diverse Density Algorithm (DD) Diverse density is a probabilistic quantity: Using gradient ascent, maximum diverse density point can be located Cons: Computationally expensive Applications: person identification, stock selection, natural scene classification

August 16, 2015EECS, OSU51 EM Diverse Density (EM-DD) Views knowledge of which instance is responsible for the bag label as a missing data problem EM to maximize EM-DD algorithm: E-Step: Given current estimate of target location, use to find the most likely instance responsible for bag label M-Step: Find

August 16, 2015EECS, OSU52 EM Diverse Density (EM-DD) Pros Computationally less expensive than DD Avoids local maxima Shown to outperform DD and other MIL algorithms on musk data and artificial real-valued data

August 16, 2015EECS, OSU53 Multiple-Instance Logistic Regression Learn a logistic regression classifier for instances Use softmax function to combine the output probabilities to get bag probabilities MIL property satisfied: A bag has high probability of being positive if an instance has high probability of being positive

August 16, 2015EECS, OSU54 Axis-Parallel Rectangles (APR) Proposed by Dietterich et al. for drug activity prediction Three algorithms GFS elim-count APR GFS elim-kde APR Iterated discrimination APR

August 16, 2015EECS, OSU55 Axis-Parallel Rectangles (APR) x 1 x 2

August 16, 2015EECS, OSU56 Iterated Discrimination APR Inside-out algorithm: Start with single positive instance and grow APR to include additional positive instances Three basic steps: Grow: Construct smallest APR that covers at least one instance from each positive bag Discriminate: Select most discriminating features using APR Expand: Expand final APR to improve generalization Works in two phases

August 16, 2015EECS, OSU57 Iterated Discrimination APR Grow x 1 x 2

August 16, 2015EECS, OSU58 Iterated Discrimination APR Discriminate Greedily select feature with highest discrimination power Discrimination power depends on: How many negative instances are outside How far they are outside Expand Expand the final APR to improve generalization using kernel density estimation

August 16, 2015EECS, OSU59 Multiple-instance SVM (MISVM) Based on the idea similar to S3VMs S3VMs: No constraint on how unlabeled data disambiguates MISVM: Maintain MIL constraint All instances from negative bag as negative At least one instance from each positive bag as positive Find maximum margin classifier with MIL constraint satisfied

August 16, 2015EECS, OSU60 Multiple-instance SVM (MISVM) MISVM optimization problem:

August 16, 2015EECS, OSU61 Multiple-instance SVM (MISVM) Two notions of margin: instance level and bag level Figure from [Andrews et al., NIPS 2002]

August 16, 2015EECS, OSU62 Conclusion Motivated problem of learning from ambiguously labeled data Studied space of such learning problems Presented a taxonomy of proposed algorithms for each problem

August 16, 2015EECS, OSU63 Thank you