Matching Words and Pictures

Slides:

Advertisements

Similar presentations

Clustering Art & Learning the Semantics of Words and Pictures Manigantan Sethuraman.

Advertisements

LEARNING SEMANTICS OF WORDS AND PICTURES TEJASWI DEVARAPALLI.

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Continued Psy 524 Ainsworth

Visual Recognition Tutorial

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Lecture 6 Image Segmentation

x – independent variable (input)

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Evaluating Hypotheses

WORD-PREDICTION AS A TOOL TO EVALUATE LOW-LEVEL VISION PROCESSES Prasad Gabbur, Kobus Barnard University of Arizona.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Visual Recognition Tutorial

Radial Basis Function Networks

Building Face Dataset Shijin Kong. Building Face Dataset Ramanan et al, ICCV 2007, Leveraging Archival Video for Building Face DatasetsLeveraging Archival.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

Text Classification, Active/Interactive learning.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

1 Statistical Distribution Fitting Dr. Jason Merrick.

Hypotheses tests for means

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Object Recognition Part 2 Authors: Kobus Barnard, Pinar Duygulu, Nado de Freitas, and David Forsyth Slides by Rong Zhang CSE 595 – Words and Pictures Presentation.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Estimating standard error using bootstrap

Big data classification using neural network

INTRODUCTION TO STATISTICS

Chapter 21 More About Tests.

Data Mining K-means Algorithm

Classification of unlabeled data:

Introduction to Summary Statistics

Central Limit Theorem, z-tests, & t-tests

Introduction to Summary Statistics

Clustering Evaluation The EM Algorithm

Inverse Transformation Scale Experimental Power Graphing

Data Mining Practical Machine Learning Tools and Techniques

Introduction to Summary Statistics

Introduction to Summary Statistics

Introduction to Summary Statistics

Hidden Markov Models Part 2: Algorithms

Introduction to Summary Statistics

Bayesian Models in Machine Learning

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Matching Words with Pictures

Introduction to Summary Statistics

CSE373: Data Structures & Algorithms Lecture 5: AVL Trees

“grabcut”- Interactive Foreground Extraction using Iterated Graph Cuts

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Paired Samples and Blocks

Machine Learning in Practice Lecture 23

Psych 231: Research Methods in Psychology

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

Introduction to Summary Statistics

CSSE463: Image Recognition Day 30

Inferential Statistics

Topic Models in Text Processing

Psych 231: Research Methods in Psychology

Introduction to Summary Statistics

CSSE463: Image Recognition Day 30

Introduction to Summary Statistics

“Traditional” image segmentation

Chapter 9 Test for Independent Means Between-Subjects Design

Evaluation David Kauchak CS 158 – Fall 2019.

Presentation transcript:

Matching Words and Pictures Rose Zhang Ryan Westerman

MOTIVATION

Why do we care? Users make requests based on image semantics, but most technology at this time fails to categorize based on objects in images Semantics of images are requested in different ways Request object kind (princess) and identities (Princess of Wales) Request images by thing that are visible and what images are about Users don’t really care about histograms or textures Useful in practice like newspaper archiving This paper aims to provide a solution for the discrepancy between what how users search for images and how the images themselves are categorized. Users typically make requests based on the semantics of an image. Technology at the time typically didn’t categorize images based on object semantics.

Proposed Applications Automated Image Annotation Allows categorization of images in large image archives Browsing Support Facilitates organizing collections of similar images for easier browsing Auto-Illustration Automatically provide an image based on descriptive text Proposed applications for this approaches in this paper include: Automated Image Annotation - which could be used to categorize images in large archives, Browsing support - which groups images in a way that is easier for a user to browse through, And Auto-Illustration - which could take a description of an image and find a close match

MODELS So, how do the authors propose to improve existing technology?

Hierarchical Aspect Model A generative approach to clustering documents Clusters appear as boxes at the bottom of this figure. A path from root to leaf above a cluster signifies the words most likely to be found in a document belonging to that cluster. The words at the leaf node are likely unique to these documents, compared to words at the root node which are shared across all clusters. The first proposed model is built around a Hierarchical Aspect Model. The original model, shown here, was used to cluster documents based on word occurances. A cluster of documents is represented by a square at the bottom of the figure. Each of these squares is associated with the triangular node above it, and as a result, the entire cluster is associated with the path from this leaf node to the root node. Because nodes closer to the root share more clusters, they will generate words that are likely to belong to more than one cluster, and will be less specific to a particular document topic. Image from T. Hofmann, Learning and representing topic. A hierarchical mixture model for word occurrence in document databases.

Multi-Modal Hierarchical Aspect Model Generates words to cluster images instead of documents Higher level nodes emit more generally applicable words Clusters represent groupings of annotations for images Model is trained using Expectation Maximization In the Multi-Modal version of this hierarchical model, we are classifying images instead of documents. Clusters represent groupings of images based on their annotations, which makes the emission of words a more difficult task given the limited text of real annotations. EM algorithm: given a multi-modal set of data, the EM algorithm estimates each distribution’s mean and variance, thereby separating data into clusters and abstraction levels Dividing regions: Image is cut into regions(each pixel is a node, adjacent nodes are connected with an edge which is weighted to how similar the pixels are, and then you find the min cut) 8 largest regions are represented by 40 features (based on color, size, position, texture, shape, etc) which together are called blobs The hierarchical model generates images and associated text Citation - T. Hofmann. Learning and representing topic. A hierarchical mixture model for word occurrence in document databases. In Workshop on learning from text and the web, CMU, 1998

Generating words from pictures c=cluster indices, w = words in document(image) d, b=image region indices in d, l=abstraction level, D=set of observations for d, B=set of blobs for d, W=set of words for d, where D=B U W, exponents normalize the differing number of words and blobs in each image. This model generates a set of observations D, associated with document d Relies on documents specific to the training set Good for search, bad for documents not in training set Would have to refit model for new document p(x|l,c) can be thought of as probability given a node Words and blobs are from hierarchical model p(w|l,c), we use frequency tables p(b|l,c), aka blob emission probabilities, we use a Gaussian distribution over the features of the region p(l|d) = probability of abstraction level given the document Need to refit model p(l|d) for each document (no document in the model, where does the document go in terms of cluster assignment in the hierarchical model?)

What about documents not in the training set? This model is generative and doesn’t depend on d in the equation. Replaces d with c which doesn’t significantly decrease quality of results Makes equation simpler: compute a cluster dependent average during training rather than calculating for each document Saves memory for large number of documents Replacing d with c makes sense b/c each document is in one cluster (though technically we only have a gaussian distribution of which cluster the document is in)

Image based word prediction Assume new document with blobs, B We are applying this to documents outside the training set, so this equation is based on the equation on the previous slide w= words in vocabulary Proportional to probability of word in a cluster which is then expanded

Multi-Modal Dirichlet Allocation Process Choose one of J mixture components c ∼ Multinomial(η). Conditioned on c, choose a mixture over J factors, θ ∼ Dir(αc). For each of the N words: Choose one of K factors zn ∼ Multinomial(θ). Choose one of V words wn from probability of wn given zn and c For each of the M blobs: Choose a factor sm ∼ Multinomial(θ). Choose a blob bm from a Gaussian distribution conditioned on sm and c. MoM-LDA is a generative model for an image and its corresponding words. The diagram shows the process of generating “blobs” and words using this method. M, N, and I are “Plates” and represent the repetition of M blobs, N words, and I images Mixture components C and Theta are sampled once per image. Then s and z are generated once per blob and word respectively.

Predictions using MoM-LDA Given an image and a MoM-LDA, we can Compute an approximate posterior over mixture components, φ Compute an approximate Dirichlet over factors, γ Using the following formula, we calculate the distribution over words given an image J represents all mixture components, and K represents all factors Figure out what Mixtures and Factors are The individual distributions that are combined to form the mixture distribution are called the mixture components, and the probabilities (or weights) associated with each component are called the mixture weights

Simple Correspondence Models remember: Simple Correspondence Models Predict words for specific regions instead of entire image Discrete-translation: match word to blob using joint probability table Hierarchical Clustering Use the hierarchical model, but for blobs instead of images See equation But, discrete-translation purposely ignores potentially useful training data and hierarchical clustering uses data that the model was not trained to represent Moving from annotation to object recognition Take set of observations(words+blobs) from annotation and plug into correspondence models Hierarchical Clustering: no direct link of word and blob, but tiger might always appear in orange stripey region and both are always at a shared node Equation: Similar to word prediction equation Different from image prediction equation which has words in a cluster. Here, words must be in a node

Integrated Correspondence and Hierarchical Clustering Strategy 1: if a node contributes little to the image region, then it also contributes little to the word Change p(D|d) equation (in beginning) to account for how blobs affect words. Can alter p(l|d) to p(l|c) Apply simple correspondence eq. Need to redo equation to find D (need words and blobs) Blobs are independent, but words depend on blobs All you change are the set of observations, still use simple correspondence eq to get correspondence model

Integrated Correspondence: Strategy 2 Strategy 2: Pair word and region Need to change training alg to pair w and b for p(w,b) calculation Changing the training algorithm: add graph matching Create bipartite graph with words on one side and image regions on the other and the edges are weighted with the negative log probabilities from the equation Find min cost assignment in graph matching Resume EM alg. w and b are paired in p(D) equation Log of num from 0-1 is negative

Integrated Correspondence: NULL Sometimes a region has no words, or the number of words and regions are different Assign NULL when the annotation with the highest probability is still too low (but outliers?) Tendency for error where NULL image regions generate every word or and every image region generate NULL word Can delete words generated by NULL image region or region that generates NULL word

EVALUATION

Experiment 160 CD’s, each with 100 images on a specific subject Excluded words which occurred <20 times in test set Vocabulary of about 155 words 160 CD’s 80 CD’s 80 CD’s Novel held out 75% images 25% images training Standard held out

Evaluating the model 2 1 3 N=# documents q(w|B)=computed predictive model, p(w)=target distribution, K=# words for the image Annotation models are evaluated on both well represented and not well represented data Correspondence models assume poor annotation means poor correspondence, otherwise would have to manually grade Remember simple correspondence model is based on the annotation model but for individual blobs Equation 1: negative Ekl = model is worse than empirical, positive is better Is the model actually “learning”? Empirical = word occurence density that came with the training set Compare error in empirical and model Unfortunately, don’t know p(w), so assume actual words predicted uniformly and others not at all → p(w)=1/K for words observed aka generated, and 0 otherwise.

Evaluating word prediction 1 2 N=vocabulary size, n=#words for image,r=words predicted correctly, w=words predicted incorrectly Equation 1 returns 0 if everything or nothing is predicted 1 for predicting exactly the actual word set -1 for predicting the complement of the word set Equation 2: larger values = better Ideally, better fitting model predicts words better Compare to empirical again

RESULTS

Annotation results Train model using a subset of training data, then use model as starting point for next set Held out set: most benefit after 10 iterations Novel held out data shows inability to generalize Better to simultaneously learn models for blobs and their linkage to words Looking at how changing training method (iterations) changes model error They tested four different annotation models, but looked mostly the same, here’s one Novel set: decreases in results (increase error) with increase in iterations

Normalized word prediction Refuse to predict level Designed to handle situations where an annotation does not mention an object Requires a minimum probability to predict words P = 10-(X/10) Extremes result in predicting everything and predicting nothing

Correspondence Results Discrete translation did the worst Paired word-blob emissions did better than annotation based methods Dependence of words on blobs performed the best Good annotation, bad correspondence Good results Complete failure

CRITIQUE

Experimental Decisions Using only ⅜ of available data for training, separating ½ of total data for novel testing Approximates correspondence performance by annotation performance No absolute scale to compare errors between models or to future results No true evaluation for correspondence results/didn’t actually evaluate how well each image region was labeled Small vocabulary of 155 words means limited applications even with good results Evaluation for correspondence is based on Some correspondences errors are less bad (cat for tiger vs car for tiger) but no evaluation done on it At what point do two errors differ enough to be considered a true difference? There is no scale for Ekl so a difference in thousandth place could be huge or

Questionable Evaluation ? p(w), the target distribution, is unknown. So the paper assumes p(w)=1/K for observed words. p(w)=0 for all other words in this assumption. What is log(0)? Could have solved this issue by smoothing p(w) Ideally, better fitting model predicts words better Compare to empirical again Image from: https://courses.lumenlearning.com/waymakercollegealgebra/chapter/characteristics-of-logarithmic-functions/

Future Research Moving from unsupervised input to a semi-supervised model Research into evaluation methods which don’t required manual checking of labeled images More robust datasets for word/image matching

Q&A