Matching Words and Pictures Rose Zhang Ryan Westerman
MOTIVATION
Why do we care? Users make requests based on image semantics, but most technology at this time fails to categorize based on objects in images Semantics of images are requested in different ways Request object kind (princess) and identities (Princess of Wales) Request images by thing that are visible and what images are about Users don’t really care about histograms or textures Useful in practice like newspaper archiving This paper aims to provide a solution for the discrepancy between what how users search for images and how the images themselves are categorized. Users typically make requests based on the semantics of an image. Technology at the time typically didn’t categorize images based on object semantics.
Proposed Applications Automated Image Annotation Allows categorization of images in large image archives Browsing Support Facilitates organizing collections of similar images for easier browsing Auto-Illustration Automatically provide an image based on descriptive text Proposed applications for this approaches in this paper include: Automated Image Annotation - which could be used to categorize images in large archives, Browsing support - which groups images in a way that is easier for a user to browse through, And Auto-Illustration - which could take a description of an image and find a close match
MODELS So, how do the authors propose to improve existing technology?
Hierarchical Aspect Model A generative approach to clustering documents Clusters appear as boxes at the bottom of this figure. A path from root to leaf above a cluster signifies the words most likely to be found in a document belonging to that cluster. The words at the leaf node are likely unique to these documents, compared to words at the root node which are shared across all clusters. The first proposed model is built around a Hierarchical Aspect Model. The original model, shown here, was used to cluster documents based on word occurances. A cluster of documents is represented by a square at the bottom of the figure. Each of these squares is associated with the triangular node above it, and as a result, the entire cluster is associated with the path from this leaf node to the root node. Because nodes closer to the root share more clusters, they will generate words that are likely to belong to more than one cluster, and will be less specific to a particular document topic. Image from T. Hofmann, Learning and representing topic. A hierarchical mixture model for word occurrence in document databases.
Multi-Modal Hierarchical Aspect Model Generates words to cluster images instead of documents Higher level nodes emit more generally applicable words Clusters represent groupings of annotations for images Model is trained using Expectation Maximization In the Multi-Modal version of this hierarchical model, we are classifying images instead of documents. Clusters represent groupings of images based on their annotations, which makes the emission of words a more difficult task given the limited text of real annotations. EM algorithm: given a multi-modal set of data, the EM algorithm estimates each distribution’s mean and variance, thereby separating data into clusters and abstraction levels Dividing regions: Image is cut into regions(each pixel is a node, adjacent nodes are connected with an edge which is weighted to how similar the pixels are, and then you find the min cut) 8 largest regions are represented by 40 features (based on color, size, position, texture, shape, etc) which together are called blobs The hierarchical model generates images and associated text Citation - T. Hofmann. Learning and representing topic. A hierarchical mixture model for word occurrence in document databases. In Workshop on learning from text and the web, CMU, 1998
Generating words from pictures c=cluster indices, w = words in document(image) d, b=image region indices in d, l=abstraction level, D=set of observations for d, B=set of blobs for d, W=set of words for d, where D=B U W, exponents normalize the differing number of words and blobs in each image. This model generates a set of observations D, associated with document d Relies on documents specific to the training set Good for search, bad for documents not in training set Would have to refit model for new document p(x|l,c) can be thought of as probability given a node Words and blobs are from hierarchical model p(w|l,c), we use frequency tables p(b|l,c), aka blob emission probabilities, we use a Gaussian distribution over the features of the region p(l|d) = probability of abstraction level given the document Need to refit model p(l|d) for each document (no document in the model, where does the document go in terms of cluster assignment in the hierarchical model?)
What about documents not in the training set? This model is generative and doesn’t depend on d in the equation. Replaces d with c which doesn’t significantly decrease quality of results Makes equation simpler: compute a cluster dependent average during training rather than calculating for each document Saves memory for large number of documents Replacing d with c makes sense b/c each document is in one cluster (though technically we only have a gaussian distribution of which cluster the document is in)
Image based word prediction Assume new document with blobs, B We are applying this to documents outside the training set, so this equation is based on the equation on the previous slide w= words in vocabulary Proportional to probability of word in a cluster which is then expanded
Multi-Modal Dirichlet Allocation Process Choose one of J mixture components c ∼ Multinomial(η). Conditioned on c, choose a mixture over J factors, θ ∼ Dir(αc). For each of the N words: Choose one of K factors zn ∼ Multinomial(θ). Choose one of V words wn from probability of wn given zn and c For each of the M blobs: Choose a factor sm ∼ Multinomial(θ). Choose a blob bm from a Gaussian distribution conditioned on sm and c. MoM-LDA is a generative model for an image and its corresponding words. The diagram shows the process of generating “blobs” and words using this method. M, N, and I are “Plates” and represent the repetition of M blobs, N words, and I images Mixture components C and Theta are sampled once per image. Then s and z are generated once per blob and word respectively.
Predictions using MoM-LDA Given an image and a MoM-LDA, we can Compute an approximate posterior over mixture components, φ Compute an approximate Dirichlet over factors, γ Using the following formula, we calculate the distribution over words given an image J represents all mixture components, and K represents all factors Figure out what Mixtures and Factors are The individual distributions that are combined to form the mixture distribution are called the mixture components, and the probabilities (or weights) associated with each component are called the mixture weights
Simple Correspondence Models remember: Simple Correspondence Models Predict words for specific regions instead of entire image Discrete-translation: match word to blob using joint probability table Hierarchical Clustering Use the hierarchical model, but for blobs instead of images See equation But, discrete-translation purposely ignores potentially useful training data and hierarchical clustering uses data that the model was not trained to represent Moving from annotation to object recognition Take set of observations(words+blobs) from annotation and plug into correspondence models Hierarchical Clustering: no direct link of word and blob, but tiger might always appear in orange stripey region and both are always at a shared node Equation: Similar to word prediction equation Different from image prediction equation which has words in a cluster. Here, words must be in a node
Integrated Correspondence and Hierarchical Clustering Strategy 1: if a node contributes little to the image region, then it also contributes little to the word Change p(D|d) equation (in beginning) to account for how blobs affect words. Can alter p(l|d) to p(l|c) Apply simple correspondence eq. Need to redo equation to find D (need words and blobs) Blobs are independent, but words depend on blobs All you change are the set of observations, still use simple correspondence eq to get correspondence model
Integrated Correspondence: Strategy 2 Strategy 2: Pair word and region Need to change training alg to pair w and b for p(w,b) calculation Changing the training algorithm: add graph matching Create bipartite graph with words on one side and image regions on the other and the edges are weighted with the negative log probabilities from the equation Find min cost assignment in graph matching Resume EM alg. w and b are paired in p(D) equation Log of num from 0-1 is negative
Integrated Correspondence: NULL Sometimes a region has no words, or the number of words and regions are different Assign NULL when the annotation with the highest probability is still too low (but outliers?) Tendency for error where NULL image regions generate every word or and every image region generate NULL word Can delete words generated by NULL image region or region that generates NULL word
EVALUATION
Experiment 160 CD’s, each with 100 images on a specific subject Excluded words which occurred <20 times in test set Vocabulary of about 155 words 160 CD’s 80 CD’s 80 CD’s Novel held out 75% images 25% images training Standard held out
Evaluating the model 2 1 3 N=# documents q(w|B)=computed predictive model, p(w)=target distribution, K=# words for the image Annotation models are evaluated on both well represented and not well represented data Correspondence models assume poor annotation means poor correspondence, otherwise would have to manually grade Remember simple correspondence model is based on the annotation model but for individual blobs Equation 1: negative Ekl = model is worse than empirical, positive is better Is the model actually “learning”? Empirical = word occurence density that came with the training set Compare error in empirical and model Unfortunately, don’t know p(w), so assume actual words predicted uniformly and others not at all → p(w)=1/K for words observed aka generated, and 0 otherwise.
Evaluating word prediction 1 2 N=vocabulary size, n=#words for image,r=words predicted correctly, w=words predicted incorrectly Equation 1 returns 0 if everything or nothing is predicted 1 for predicting exactly the actual word set -1 for predicting the complement of the word set Equation 2: larger values = better Ideally, better fitting model predicts words better Compare to empirical again
RESULTS
Annotation results Train model using a subset of training data, then use model as starting point for next set Held out set: most benefit after 10 iterations Novel held out data shows inability to generalize Better to simultaneously learn models for blobs and their linkage to words Looking at how changing training method (iterations) changes model error They tested four different annotation models, but looked mostly the same, here’s one Novel set: decreases in results (increase error) with increase in iterations
Normalized word prediction Refuse to predict level Designed to handle situations where an annotation does not mention an object Requires a minimum probability to predict words P = 10-(X/10) Extremes result in predicting everything and predicting nothing
Correspondence Results Discrete translation did the worst Paired word-blob emissions did better than annotation based methods Dependence of words on blobs performed the best Good annotation, bad correspondence Good results Complete failure
CRITIQUE
Experimental Decisions Using only ⅜ of available data for training, separating ½ of total data for novel testing Approximates correspondence performance by annotation performance No absolute scale to compare errors between models or to future results No true evaluation for correspondence results/didn’t actually evaluate how well each image region was labeled Small vocabulary of 155 words means limited applications even with good results Evaluation for correspondence is based on Some correspondences errors are less bad (cat for tiger vs car for tiger) but no evaluation done on it At what point do two errors differ enough to be considered a true difference? There is no scale for Ekl so a difference in thousandth place could be huge or
Questionable Evaluation ? p(w), the target distribution, is unknown. So the paper assumes p(w)=1/K for observed words. p(w)=0 for all other words in this assumption. What is log(0)? Could have solved this issue by smoothing p(w) Ideally, better fitting model predicts words better Compare to empirical again Image from: https://courses.lumenlearning.com/waymakercollegealgebra/chapter/characteristics-of-logarithmic-functions/
Future Research Moving from unsupervised input to a semi-supervised model Research into evaluation methods which don’t required manual checking of labeled images More robust datasets for word/image matching
Q&A