Matching Words with Pictures Chun Li Teng Chao Ji
Introduction Learning the joint distribution of image regions and words has many applications. While text and images are separately ambiguous, jointly they tend not to be.
Introduction The important points for images request 1. Users request images both by objects kinds and identities. 2.Users request images both by what they can see in the picture and what they are about. 3. Queries based on image histograms, texture, overall appearance. 4.Text associated with images are extremely useful
Introduction Several Practical Applications for methods that we can link text and images. 1. Automated image annotation: Archivists receive pictures and annotate them with useful keys. 2.Browsing support: Organize the collections for people to browse the collection. 3.Auto-illustrate: A tool that could automatically suggest images to illustrate blocks of text might expose value to casual users.
Introduction Two Main Task: 1. Annotation: attempt to predict annotations of the entire images using all information present. 2 Correspondence: attempt to associate particular words with particular image substructures. This is a peculiar feature of object recognition.
Input Representation and Preprocessing The features represent major visual properties: 1.Size is represented by the portion of the image covered by the region 2.Position is represented using the coordinates of the region center of mass normalized by the image dimensions. 3.Color is represented using the average and standard deviation of (R,G,B), (L,a,b) and (r=R/(R+G+B), g=G/(R+G+B)) 4. Texture is represented using the average and variance of 16 filter responses. 5. Shape is represented by the ratio of the area to perimeter squared, the moment of inertia, and ratio area to its convex hull.
Annotation Models Multi-Modal Hierarchical Aspect Models Mixture of Multi-Modal Latent Dirichlet Allocation
Multi-Modal Hierarchical Aspect Models Images and co-occuring text are generated by nodes arranged in a tree structure
Multi-Modal Hierarchical Aspect Models The nodes generate both image region using a Gaussian distribution and words using a multinomial distribution. Each cluster is associated with a path from a leaf to the root. Nodes close to the root are shared by many clusters, and nodes closer to leaves are shared by few clusters.
Multi-Modal Hierarchical Aspect Models c indexes clusters w indexes the words in document d, b indexes the image regions in document d , l indexes levels. D is the set of observations for the document, W is the set of words for the document, B is the set of blobs for the document The exponents are introduced to normalize for differing numbers of words and blobs in each image. N_wd denotes the number of words in document d, while N_w denotes the maximum number of words in any document.
Mixture of Multi-Modal Latent Dirichlet Allocation A graphical probabilistic model
Mixture of Multi-Modal Latent Dirichlet Allocation c is the parameter of the Dirichlet prior on the per- document topic distributions. θ is the topic distribution for document z is the topic for the word in document s is the topic for the blob in document b is the specific blob w is the specific word M is the set of blobs N is the set of words I is the entire document (entire image + words)
Mixture of Multi-Modal Latent Dirichlet Allocation Let φ denote the approximate posterior over mixture components, and γc denote the corresponding approximate posterior Dirichlet. The distribution over words given an image (that is, a collection of blobs) is:
Simple Correspondence Models Goal: to build models that can predict words for specific image regions. Method: step 1: vector-quantize representations of image regions step 2: exploit the analogy with statistical lexicon learning
Simple Correspondence Models discrete-translation use K-means to vector-quantize the set of features representing an image region label each region with a single label
Simple Correspondence Models
Simple Correspondence Models Problem: data missing Reason: we have to construct a joint probability table linking word tokens to blob tokens.However, the data set does not provide explicit correspondences
Simple Correspondence Models
Simple Correspondence Models
Simple Correspondence Models from a Hierarchical Clustering Model
Simple Correspondence Models hierarchical clustering models - not model the relationships - do encode correspondence to some extent through co-occurrence
Simple Correspondence Models Problem:
Simple Correspondence Models
Simple Correspondence Models Conclusion: None of the methods described above are wholly satisfactory for learning correspondence. why? solution(improvement): strengthen the relationship between words and image regions when building up the data set
Integrating Correspondence and Hierarchical Clustering First approach: linking word emission and region emission probabilities with mixture weights
Integrating Correspondence and Hierarchical Clustering Second approach: Paired Word and Region Emission at Nodes
Integrating Correspondence and Hierarchical Clustering
Integrating Correspondence and Hierarchical Clustering
Matching Words with Pictures Goal: to match word with pictures
Matching Words with Pictures Two Methods: Annotation Multi-Modal Hierarchical Aspect Models Mixture of Multi-Modal Latent Dirichlet Allocation Correspondence Discrete Data Translation Correspondence from a Hierarchical Clustering Model Problem: data missing
Matching Words with Pictures Two ways to improve (Hierarchical model): LinkingWord Emission and Region Emission Probabilities with Mixture Weights Paired Word and Region Emission at Nodes
Matching Words with Pictures The end THANKS