Multiple-Instance Learning Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998] Paper 2: EM-DD: An Improved Multiple-Instance.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Linear Regression.
CS479/679 Pattern Recognition Dr. George Bebis
Biointelligence Laboratory, Seoul National University
Data Mining Classification: Alternative Techniques
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Segmentation and Fitting Using Probabilistic Methods
Visual Recognition Tutorial
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Overview Full Bayesian Learning MAP learning
Multiple Instance Learning
Prof. Ramin Zabih (CS) Prof. Ashish Raj (Radiology) CS5540: Computational Techniques for Analyzing Clinical Data.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Solving the Multiple-Instance Problem with Axis-Parallel Rectangles By Thomas G. Dietterich, Richard H. Lathrop, and Tomás Lozano-Pérez Appeared in Artificial.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Optimization via Search CPSC 315 – Programming Studio Spring 2009 Project 2, Lecture 4 Adapted from slides of Yoonsuck Choe.
Ensemble Learning: An Introduction
Evaluating Hypotheses
Image Categorization by Learning and Reasoning with Regions Yixin Chen, University of New Orleans James Z. Wang, The Pennsylvania State University Published.
Expectation Maximization Algorithm
Region Based Image Annotation Through Multiple-Instance Learning By: Changbo Yang Wayne State University Department of Computer Science.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Wayne State University, 1/31/ Multiple-Instance Learning via Embedded Instance Selection Yixin Chen Department of Computer Science University of.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
August 16, 2015EECS, OSU1 Learning with Ambiguously Labeled Training Data Kshitij Judah Ph.D. student Advisor: Prof. Alan Fern Qualifier Oral Presentation.
Heuristic Search Heuristic - a “rule of thumb” used to help guide search often, something learned experientially and recalled when needed Heuristic Function.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Naive Bayes Classifier
Multiple Instance Real Boosting with Aggregation Functions Hossein Hajimirsadeghi and Greg Mori School of Computing Science Simon Fraser University International.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Classification Techniques: Bayesian Classification
Bayesian Parameter Estimation Liad Serruya. Agenda Introduction Bayesian decision theory Scale-Invariant Learning Bayesian “One-Shot” Learning.
Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Multiple Instance Learning via Successive Linear Programming Olvi Mangasarian Edward Wild University of Wisconsin-Madison.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
A Generalized Version Space Learning Algorithm for Noisy and Uncertain Data T.-P. Hong, S.-S. Tseng IEEE Transactions on Knowledge and Data Engineering,
For Wednesday Read chapter 6, sections 1-3 Homework: –Chapter 4, exercise 1.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Theoretical Analysis of Multi-Instance Leaning 张敏灵 周志华 南京大学软件新技术国家重点实验室
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Machine Learning: Ensemble Methods
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Yu-Feng Li 1, James T. Kwok2, Ivor W. Tsang3 and Zhi-Hua Zhou1
Data Mining Lecture 11.
Classification Techniques: Bayesian Classification
Multiple Instance Learning: applications to computer vision
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Multiple-Instance Learning Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998] Paper 2: EM-DD: An Improved Multiple-Instance Learning Technique [Zhang and Goldman, 2001]

Multiple-Instance Learning (MIL) A variation on supervised learning  Supervised learning: training data are well labeled.  MIL: each training example is a set (or bag) of instances along with a single label equal to the maximum label among all instances in the bag. Goal: to learn to accurately predict the label of previously unseen bags.

MIL Setup Training Data: D = {, …., } m bags where bag B i has label l i. Boolean labels Positive Bags: B i + Negative Bags: B i - If bag B i + = { B + i1,…, B + ij, … B + in }, then B + ij is the j th instance in B + i. B + ijk is the value of the k th feature of the instance B + ij Real-value labels l i = max(l i1, l i2, …, l in )

Diverse Density Algorithm [Maron and Lozano-Perez, 1998] Main idea:  Find a point in feature space that have a high Diverse Density – High density of positive instances (“close ” to at least one instance from each positive bag) Low density of negative instances (“far” from every instance in every negative bag)  Higher diverse density = higher probability of being the target concept.

A Motivating Example for DD To find an area where there is both high density of positive points and low density of negative points. The difficulty with using regular density, which adds up the contribution of every positive bag and subtracts negative bags, is illustrated in (b), Section B.

Diverse Density Assuming that the target concept is a single point t and x is some point in feature space, Pr(x = t | B 1 +,…, B n +, B 1 -,…, B n - ) ……………..(1) represents the probability that x is the target concept given the training examples. We can find t if we maximize the above probability over all points x.

Probabilistic Measure of Diverse Density Using Bayes’ rule, maximizing (1) is equivalent to maximizing Pr(B 1 +,…, B n +, B 1 -,…, B n - | x = t) ……………..(2) Further assuming that the bags are conditionally independent given t, the best hypothesis is argmax x ∏ i Pr(B i + | x = t) ∏ i Pr(B i - | x = t) ……(3)

General Definition of DD Again using Bayes’ rule, (3) is equivalent to argmax x ∏ i Pr(x = t |B i + ) ∏ i Pr(x = t |B i - ) ……(4) (assume a uniform prior over concept location) x will have high Diverse Density if every positive bag has an instance close to x and no negative bags are close to x.

Noise-or Model The causal probability of instance j in bag B i Pr(x = t |B ij ) = exp( -|| B ij – x || 2 ) A positive bag’s contribution: Pr(x = t |B i + ) = Pr(x = t |B i1 +, B i2 +,…) =1- ∏ j (1- Pr(x = t |B ij + ) ) A negative bag’s contribution: Pr(x = t |B i - ) = Pr(x = t |B i1 -, B i2 -,…) = ∏ j (1- Pr(x = t |B ij - ) )

Feature Relevance “closeness” depends on the features. Problem: some features might be irrelevant, and some others might be more important than the others. || B ij – x || 2 = ∑ k w k ( B ijk – x k ) 2 Solution: “weight” the features depending on their relevance. Find the best weighting of the features by finding the weights that maximize Diverse Density.

Label Prediction Predict the label of unknown bag B i for hypothesis t : Label(B i | t) = max j {exp[-∑ k (w k (B ijk – t k )) 2 ]} where w k is a scale factor indicating the importance of feature value for dimension k.

Finding the Maximum DD Use gradient ascent with multiple starting points  The maximum DD peak is made of contributions from some set of positive points.  Start an ascent from every positive point, one of them is likely to be closest to the maximum. We can contribute most to it and have a climb directly on it.  While this heuristic is sensible for maximizing w.r.t. location, maximizing w.r.t. scaling of feature weights may still lead to local maxima.

Experiments

Figure 3(a) shows the regular density surface for the data set in Figure 2, and it is clear that finding the peak is difficult. Figure 3(b) plots the DD surface, and it is easy to pick out the global maxima which is the desired concept.

Performance Evaluation The table below lists the average accuracy of twenty runs, compared with the performance of the two principal algorithms reproted in [Dietterich et al., 1997] (iterated- discrim APR and GFS elim-ked APR), as well as the MULTINST algorithm from [Auer, 1997].

EM-DD [Zhang and Goldman, 2001] In the MIL setting, the label of a bag is determined by the "most positive" instance in the bag, i.e., the one with the highest probability of being positive among all the instances in that bag. The difficulty of MIL comes from the ambiguity of not knowing which instance is the most likely one. In [Zhang and Goldman, 2001], the knowledge of which instance determines the label of the bag is modeled using a set of hidden variables, which are estimated using the Expectation Maximization style approach. This results in an algorithm called EM-DD, which combines this EM-style approach with the DD algorithm.

EM-DD Algorithm Expectation Maximization algorithm [Dempster,Laird and Rubin, 1977]  Start with an initial guess h (which can be obtained using original DD algorithm), set to some appropriate instance from a positive bag.  E-Step: h is used to pick one instance from each bag that is most likely (given generative model) to be responsible for its label.  M-Step: two-step gradient ascent search (quasi-newton search) of the standard DD algorithm to find a new h’ that maximizes DD(h).

Comparison of Performance

Thank you!