Learning on the Test Data: Leveraging “Unseen” Features Ben Taskar Ming FaiWong Daphne Koller.

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.
Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.
Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Dynamic Bayesian Networks (DBNs)
An Introduction to Variational Methods for Graphical Models.
Introduction of Probabilistic Reasoning and Bayesian Networks
Chapter 4: Linear Models for Classification
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Topic 6: Introduction to Hypothesis Testing
Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Evaluating Hypotheses
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Radial Basis Function Networks
Crash Course on Machine Learning
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller Presented by Michael Cafarella CSE574 May 25, 2005.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Linear Models for Classification
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Lecture 2: Statistical learning primer for biologists
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
CPSC 422, Lecture 17Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 17 Oct, 19, 2015 Slide Sources D. Koller, Stanford CS - Probabilistic.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Introduction on Graphic Models
Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Learning Deep Generative Models by Ruslan Salakhutdinov
Ch3: Model Building through Regression
10701 / Machine Learning.
Hidden Markov Models Part 2: Algorithms
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
Discriminative Probabilistic Models for Relational Data
Label and Link Prediction in Relational Data
Machine Learning: UNIT-3 CHAPTER-1
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Presentation transcript:

Learning on the Test Data: Leveraging “Unseen” Features Ben Taskar Ming FaiWong Daphne Koller

Introduction Most statistical learning models make the assumption that data instances are IID samples from some fixed distribution. In many cases, the data are collected from different sources, at different times, locations and under different circumstances. We usually build a statistical model of features under the assumption that future data will exhibit the same regularities as the training data. In many data sets, however, there are scope-limited features whose predictive power is only applicable to a certain subset of the data.

Examples 1. Classifying news articles chronologically: Suppose the task is to classify news articles chronologically. New events, people and places appear and disappear) in bursts over time. The training data might consist of articles taken over some time period; these are only somewhat representative of the future articles. The training data may contain some features that are not observed in the training data. 2. Classifying customers into categories: Our training data might be collected from one geographical region which may not represent the distribution in other regions.

We can get away with this difficulty by mixing all the examples and selecting the training and test sets randomly. But this homogeneity cannot be ensured in real world task, where only the non-representative training data is actually available for training. The test data may contain many features that were never or only rarely observed in training data. These features may be used for classification. For ex, in the news article task these local features might include the names of places or people currently in the news. In the customers ex, these local features might include purchases of products that are specific to a region.

Scoped Learning Suppose we want to classify news articles chronologically. The phrase “XXX said today” might appear in many places in data for different values of “XXX” These features are called scope limited features or local features. Another example: Suppose there are 2 labels grain and trade. Words like corn or wheat often appear in phrase “tons of wheat". So we can learn that if a word appears in the context of “tons of xxx” it is likely to be associated with grain. So if we find a phrase like “tons of rye” in the test data we can infer that it has some positive interaction with label grain. Scoped learning is a probabilistic framework that combines the traditional IID features with scope limited features.

The intuitive procedure for using the local features is to use the information from the global (IID) features to infer the rules that govern the local information for a particular subset of data. When data exhibits scope they found significant gains in performance over traditional models which only uses IID features. All the data instances within a particular scope exhibit some structural regularity and we assume that all the future data will exhibit the same structural regularity.

General Framework: Notion of scope: We assume that data instances are sampled from some set of scopes, each of which is associated with some data distribution. Different distributions share a probabilistic model for some set of global features, but can contain a different probabilistic model for a scope-specific set of local features. These local features may be rarely or never seen in the scopes comprising the training data.

Let X denote global features, Z denote local features, and Y the class variable. For each global feature Xi, there is a parameter γ i. Additionally, for each scope and each local feature Z i, there is a parameter λ i S. Then the distribution of Y given all the features and weights is

Probabilistic model: We assume that the global weights can be learned from training data. So their values are fixed when we encounter a new scope and the local feature weights are unknown and can be treated as hidden variables in the graphical model. Idea: The evidence from global features for the labels of some of the instances to modify our beliefs about the role of the local feature present in these instances to be consistent with the labels. By learning about the roles of these features, we can then propagate this information to improve accuracy on instances that are harder to classify using global features alone.

To implement this idea, we define a joint distribution over λ S and y 1,..., y m. Why use Markov Random Fields: Here the association between the variables are correlated rather than causal. Markov random fields are used to model spatial interactions or interacting features.

Markov Network Let V = (V d,V c ) denote a set of random variables, where V d are discrete and V c are continuous variables, respectively. A Markov network over V defines a joint distribution over V, assigning a density over V c for each possible assignment v d to V d. A Markov network M is an undirected graph whose nodes correspond to V. It is parameterization by a set of potential functions φ 1 (C 1 ),..., φ l (C l ) such that each C V is a fully connected subgraph, or clique, in M, i.e., each V i, V j C are connected by an edge in M. Here we assume that the φ(C) is a log-quadratic function The Markov network then represents the distribution:

In our case the log-quadratic model consists of 3 types of potentials i,Y j,X i j ) =exp( i Y j X i j)1) φ(γ i,Y j,X i j ) =exp( γ i Y j X i j) X i j i relates each global feature X i j in instance i to its weight γ i and the class variables Y j of the corresponding instance i. λ i,Y j,Z i j ) = exp(λ i Y j Z i j)2) φ( λ i,Y j,Z i j ) = exp(λ i Y j Z i j) Z i j λ i relates the local feature Z i j to its weight λ i and the label Y j Finally, as the local feature weights are assumed to be hidden, we introduce a prior over their values, or the form Overall, our model specifies a joint distribution as follows:

Markov network for two instances, two global features and three local features

The graph can be simplified further when we account for varaibles whose values are fixed. The global feature weights are learned from the training data and hence their value is fixed and we also know all the feature values. The resulting Markov network is shown below (Assuming that the instance (x 1, z1, y1) contains the features Z1 and Z2, and the instance(x2, z2, y2) contains the features Z2 and Z3.) Y 2 λ 1 λ 2 λ 3 Y 1

Z i j Y j λ i. This can be reduced further. When Z i j =0 there is no interaction between Y j and any of the variables λ i. λ i and Y j In this case we can simply omit the edge between λ i and Y j And the resulting Markov network is shown below Y 2 λ 1 λ 2 λ 3 Y 1

In this model, we can see that the labels of all of the instances are correlated with the local feature weights of features they contain, and thereby with each other. Thus, for example, if we obtain evidence (from global features) about the label Y 1, it would change our posterior beliefs about the local feature weight ¸2, which in turn would change our beliefs about the label Y 2. Thus, by running probabilistic inference over this graphical model, we obtain updated beliefs both about the local feature weights and about the instance labels.

Learning the Model: Learning Global Feature Weights: γIn this case we simply learn their parameters from the training data, using standard logistic regression. Maximum-likelihood (ML) estimation finds the weights γ that maximize the conditional likelihood of the labels given the global features. Learning Local feature Distributions: We can exploit such patterns by learning a model that predicts the prior of the local feature weights using meta features— features of features. More precisely, we learna model that predicts the prior mean µi for ¸i from someset of meta-features mi. As our predictive model for the mean µi we choose to use a linear regression model, setting µi = w ·mi.

Using the model Step1: Given a training set, we first learn the model. In the training set, there local and global features are treated identically. When applying the model to the test set, however, our first decision is to determine the set of local and global features. Step 2: Our next step is to generate the Markov network for the test set. Probabilistic inference over this model infers the effect of local features. Step 3: We use Expectation Propagation for inference. It maintains approximate beliefs (marginals) over nodes of the Markov network and iteratively adjusts them to achieve local consistency.

Experimental Results: Reuters: The Reuters news articles data set contains substantial number of documents hand labeled into grain, crude, trade, and money-fx. Using this data set, six experimental setups are created, by using all possible pairings of categories from the four categories chosen. The resulting sequence is divided into nine time segments with roughly the same number of documents in each segment.

WebKB2 This data set consists of hand-labeled web pages from Computer Science department web sites of four schools: Berkeley, CMU, MIT and Stanford and they are categorized into faculty, student, course and organization. Six experimental setups are created by using all possible pairings of categories from the four categories.