Formulating Semantic Image Annotation as a Supervised Learning Problem Gustavo Carneiro and Nuno Vasconcelos CVPR ‘05 Presentation by: Douglas Turnbull.

Slides:



Advertisements
Similar presentations
Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Presented By: Vennela Sunnam
Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.
Image Retrieval Basics Uichin Lee KAIST KSE Slides based on “Relevance Models for Automatic Image and Video Annotation & Retrieval” by R. Manmatha (UMASS)
Supervised Learning Recap
1 Content-Based Retrieval (CBR) -in multimedia systems Presented by: Chao Cai Date: March 28, 2006 C SC 561.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Results Audio Information Retrieval using Semantic Similarity Luke Barrington, Antoni Chan, Douglas Turnbull & Gert Lanckriet Electrical & Computer Engineering.
Image Search Presented by: Samantha Mahindrakar Diti Gandhi.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Presented by Zeehasham Rasheed
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Semantic Similarity for Music Retrieval Luke Barrington, Doug Turnbull, David Torres & Gert Lanckriet Electrical & Computer Engineering University of California,
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
A structured learning framework for content- based image indexing and visual Query (Joo-Hwee, Jesse S. Jin) Presentation By: Salman Ahmad (270279)
Agenda Introduction Bag-of-words models Visual words with spatial location Part-based models Discriminative methods Segmentation and recognition Recognition-based.
Information Retrieval in Practice
A Search Engine for Historical Manuscript Images Toni M. Rath, R. Manmatha and Victor Lavrenko Center for Intelligent Information Retrieval University.
Crash Course on Machine Learning
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Image Annotation and Feature Extraction
Multimedia Databases (MMDB)
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
SVCL Automatic detection of object based Region-of-Interest for image compression Sunhyoung Han.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
HANOLISTIC: A HIERARCHICAL AUTOMATIC IMAGE ANNOTATION SYSTEM USING HOLISTIC APPROACH Özge Öztimur Karadağ & Fatoş T. Yarman Vural Department of Computer.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
SVCL 1 Content based Image Retrieval (at SVCL) Nikhil Rasiwasia, Nuno Vasconcelos Statistical Visual Computing Laboratory University of California, San.
COLOR HISTOGRAM AND DISCRETE COSINE TRANSFORM FOR COLOR IMAGE RETRIEVAL Presented by 2006/8.
ALIP: Automatic Linguistic Indexing of Pictures Jia Li The Pennsylvania State University.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Content-Based Image Retrieval Using Fuzzy Cognition Concepts Presented by Tienwei Tsai Department of Computer Science and Engineering Tatung University.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Jakob Verbeek December 11, 2009
Object Recognition Part 2 Authors: Kobus Barnard, Pinar Duygulu, Nado de Freitas, and David Forsyth Slides by Rong Zhang CSE 595 – Words and Pictures Presentation.
Image Classification for Automatic Annotation
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Yixin Chen and James Z. Wang The Pennsylvania State University
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1.Learn appearance based models for concepts 2.Compute posterior probabilities or Semantic Multinomial (SMN) under appearance models. -But, suffers from.
VISUAL INFORMATION RETRIEVAL Presented by Dipti Vaidya.
Jianping Fan Department of Computer Science University of North Carolina at Charlotte Charlotte, NC Relevance Feedback for Image Retrieval.
Content-Based Image Retrieval Using Color Space Transformation and Wavelet Transform Presented by Tienwei Tsai Department of Information Management Chihlee.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
KNN & Naïve Bayes Hongning Wang
Introduction Multimedia initial focus
Machine Learning Basics
Image Segmentation Techniques
Hidden Markov Models Part 2: Algorithms
Multimedia Information Retrieval
Speech recognition, machine learning
Information Retrieval
Speech recognition, machine learning
Presentation transcript:

Formulating Semantic Image Annotation as a Supervised Learning Problem Gustavo Carneiro and Nuno Vasconcelos CVPR ‘05 Presentation by: Douglas Turnbull CSE Department, UCSD Topic in Vision and Learning November 3, 2005

What is Image Annotation? Given an image, what are the words that describe the image?

What is Image Retrieval? Given a database of images and a query string (e.g. words), what are the images that are described by the words? Query String: “jet”

Problem: Image Annotation & Retrieval Based on the low cost of both digital camera and hard disk space, billions of consumer have the ability create and store digital images. There are already billions of digital images stored on personal computers and in commercial databases. How do store images in and retrieve images from a large database?

Problem: Image Annotation & Retrieval In general, people do not spent time labeling, organizing or annotating their personal image collections. Label: Images are often stored with the name that is produced by the digital camera: –“DSC jpg” When they are labeled, they are given a vague names that rarely describe the content of the image: –”GoodTimes.jpg”, “China05.txt” Organize: No standard scheme exists for filing images Individuals use ad hoc methods: “Chrismas2005Photos” and “Sailing_Photos” It is hard to merge image collections since the taxonomies (e.g. directory hierarchies) differ from user to user.

Problem: Image Annotation & Retrieval In general, people do not spent time labeling, organizing or annotating their personal image collections. Annotate: Explicit Annotation: Rarely do we explicitly annotate our images with captions. –An exception is when we are create web galleries i.e. My wedding photos on Implicit Annotation: Sometimes we do implicitly annotate images we imbed images into text (as is the case with webpages.) –Web-based search engines make use of the implicit annotation when they index images. i.e. Google Image Search, Picsearch

Problem: Image Annotation & Retrieval If we can’t depend on human labeling, organization, or annotation, we will have to resort to “content-based image retrieval”: –We will extract features vectors from each image –Based on these feature vectors, we will use statistical models to characterize the relationship between a query and image features. How do we specify a meaningful query to be able to navigate this image feature space?

Problem: Image Annotation & Retrieval Content-Based Image Retrieval: How do we specify a query? Query-by-sketch: Sketch a picture, extract features from the pictures, we the features to find similar images in the database. This requires that 1.we have a good drawing interface handy 2.everybody is able to draw 3.the quick sketch is able to capture the salient nature of the desired query Not a very feasible approach.

Problem: Image Annotation & Retrieval Content-Based Image Retrieval: How do we specify a query? Query-by-text: Input words into a statistical model that models models the relationship between words and image features. This requires that: 1. A keyboard 2. A statistical model that can relate words to image features 3. Words can be used to capture the salient nature of the desired query. A number of research systems have been develop that find a relationship content-based image features and text for the purpose of image annotation and retrieval. - Mori, Takahashi, Oka (1999) - Daygulu, Barnard, de Freitas (2002) - Blei, Jordan (2003) - Feng, Manmantha, Lavrenko (2004)

Outline Notation and Problem Statement Three General Approaches to Image Annotation 1.Supervised One vs. All (OVA) Models 2.Unsupervised Models using Latent Variables 3.Supervised M-ary Model Estimating P(image features|words) Experimental Setup and Results Automatic Music Annotation

Outline Notation and Problem Statement Three General Approaches to Image Annotation 1.Supervised One vs. All (OVA) Models 2.Unsupervised Models using Latent Variables 3.Supervised M-ary Model Estimating P(image features|words) Experimental Setup and Results Automatic Music Annotation

Notation and Problem Statement

x i = vector of image features x = {x 1, x 2, … } w i = one word w = {w 1, w 2, … } = vector of feature vectors = vector of words Image and CaptionImage Regions

Notation and Problem Statement

-

Image Regions Multiple Instance Learning: this regions has no visual aspect of “jet” Weak Labeling: this image depict sky eventhough the caption does contain “sky”

Outline Notation and Problem Statement Three General Approaches to Image Annotation 1.Supervised One vs. All (OVA) Models 2.Unsupervised Models using Latent Variables 3.Supervised M-ary Model Estimating P(image features|words) Experimental Setup and Results Automatic Music Annotation

Supervised OVA Models Early research posed the problem as a supervised learning problem: train a classifier for each semantic concept. Binary Classification/Detection Problems: Holistic Concepts: landscape/cityscape, indoor/outdoor scenes Object Detection: horses, buildings, trees, etc Much of the early work focused on feature design and used existing models developed by the machine learning community (SVM, KNN, etc) for classification.

Supervised OVA Models

Pro: Easy to implement Can design features and tune learning algorithm for each classification task Notion of optimal performance on each task Data sets represent basis of comparison – OCR data set Con: Doesn’t scale well with a large vocabulary Requires train and use L classifier Hard to compare posterior probabilities output by L classifier No natural ranking of keywords. Weak labeling is a problem: Images not labeled with keyword are placed in D 0

Unsupervised Models The goal is to estimate the joint distribution We introduce a latent (e.g. hidden) variable L that encode S hidden states of the world. i.e. “Sky” state, “Jet” state A state defines a joint distribution of image features and keywords. i.e. P(x=(blue, white, fuzzy), w=(“sky”, “cloud”,”blue”) | “Sky” State) will have high probability. We can sum over the S states variable to find the joint distribution Learning is based on the expectation maximization (EM): 1) E-step: update strength of association between image-caption with state 2) M-step: maximize likelihood of joint distribution for each state Annotation involves the most probable words under the joint distribution model

Unsupervised Models Multiple-Bernoulli Relevance Model (MBRM) – (Feng, Manmantha, Lavrenko CVPR ’04) Simplest unsupervised model which achieves best results Each of the D images in the training set is a “not-so-hidden” state Assume conditional independence between image features and keywords given state MBRM eliminates the need for EM since we don’t need to find the strength of association between image-caption and state. Parameter estimation is straight forward P X|L is estimated using a Gaussian kernel P W|L reduces to counting The algorithm becomes essentially “smoothed k-nearest neighbor”.

Unsupervised Models Pros: More scaleable than Supervised OVA –Size of vocabulary Natural ranking of keywords Weaker demands on quality of labeling –Robust to a weakly labeled dataset Cons: No guarantees of optimality since keywords are not explicitly treated as classes –Annotation: What is a good annotation? –Retrieval: What are the best images given a query string?

Supervised M-ary Model Critical Idea: Why introduce latent variables when a keyword directly represents a semantic class. A random variable W which takes values in {1,…,L} such that W = i if x is label with keyword w i. The class conditional distributions PX|W(x|i) are estimated using the images that have keyword w i. To annotate a new image with features x, the Bayes decision rule is invoked: Unlike Supervised OVA which consist of solving L binary decision problems, we are solving one decision problem with L classes. The keyword compete to represent the image features.

Supervised M-ary Model Pros: Natural Ranking of Keywords –Similar to unsupervised models –Posterior probabilities are relative to same classification problem. Does not require training of non-class models –Non-class model are the Yi = 0 in Supervised OVA –Robust to weakly labeled data set since images that contain concept but are not labeled with the keyword do not adversely effect learning. –Non-class models are computational bottleneck Learning a density estimates P X|W (x|i) is computationally equivalent to learning density estimates for each image in MBRM model. –Relies on Mixture Hierarchy method (Vasconcelos ’01) When vocabulary size (L) is smaller than the training set size (D), annotation is computationally more efficient than the most efficient unsupervised algorithm.

Outline Notation and Problem Statement Three General Approaches to Image Annotation 1.Supervised One vs. All (OVA) Models 2.Unsupervised Models using Latent Variables 3.Supervised M-ary Model Estimating P(image features|words) Experimental Setup and Results Automatic Music Annotation

Density Estimation For Supervised M-ary learning, we need to find the class-conditional density estimates P X|W (x|w i ) using a training data set D i. –All the images in D i have been labeled with w i Two Questions: 1)Given that a number of the image regions from images in D i will not exhibit visual properties that relate to w i, can we even estimate these densities? i.e An image labeled “jet” will have regions where only sky is present. 2) What is the “best” way to estimate these densities? –“best” – the estimate can be calculated using a computationally efficient algorithm –“best” – the estimate is accurate and general.

Density Estimation Multiple Instance Learning: a bag of instance receive a label for the entire bag if one or more instances deserves that label. This makes the data noisy, but with enough averaging we can get a good density estimate. For example: 1 Suppose all images has three regions. 2 Every image annotated with “jet” have one region with jet-like features (i.e. mu =20, sigma = 3). 3 The other two regions are uniformly distributed with mu ~ U(-100,1000) and sigma ~ U(0.1,10) 4 If we average 1000 images, the “jet” distribution emerges

Density Estimation For word w i, we have D i images each of which is represented by a vector of feature vectors. The authors discuss four methods of estimating P X|W (x|i). 1.Direct Estimation 2.Model Averaging 1)Histogram 2)Naïve Averaging 3.Mixture Hierarchies

Density Estimation 1) Direct Estimation All feature vectors for all images represent a distribution Need to does some heuristic smoothing – e.g. Use a Gaussian Kernel Does not scale well with training set size or number of vector per image Feature 2 Feature 1 Smoothed kNN

Density Estimation 2) Model Averaging Each image l in D i represents a individual distribution We average the image distributions to find one class distribution The paper mentions two techniques 1)Histograms – partition space and count Data sparsity problems for high dimensional feature vectors. 2)Naïve Averaging using Mixture Models Slow annotation time since there will be KD Gaussian if each image mixture has K components Feature 2 Feature 1 Feature 2 Feature 1 HistogramSmoothed kNN Feature 2 Feature 1 Mixtures

Density Estimation 3) Mixture Hierarchies – (Vasconcelos 2001) Each image l in D i represents a individual mixture of K Gaussian distributions We combine “redundant” mixture components using EM –E-Step: Compute weight between each of the KD components and the T components –M-Step: Maximize parameters of T components using weights The final distribution is one Mixture of T Gaussians for each keyword w i where T << KD. Di l1l1 l3l3 l Di l2l2...

Outline Notation and Problem Statement Three General Approaches to Image Annotation 1.Supervised One vs. All (OVA) Models 2.Unsupervised Models using Latent Variables 3.Supervised M-ary Model Estimating P(image features|words) Experimental Setup and Results Automatic Music Annotation

Experimental Setup Corel Stock Photos Data Set 5,000 images – 4,500 for training, 500 for testing Caption of 1-5 words per image from a vocabulary of L=371 keywords Image Features –Convert from RGB to YBR color space –Computes 8 x 8 discrete cosine transform (DCT) –Results is a 3*64 =192 dimensional feature vector for each image region –64 low frequency features are retain so that

Experimental Setup Two (simplified) tasks: Annotation: given a new image, what are the best five words that describe the image Retrieval: Given a one word query, what are the images that match the query. Evaluation Metrics: |w H | - number of images that have been annotated with w by humans |w A | - number of images that have been automatically annotated with w |w C | - number of images that have been automatically annotated with w AND where annotated with w by humans Recall = |w C |/|w H | Precision = |w C |/|w A | Mean Recall and Mean Precision are average over all the words found in the test set.

Other Annotation Systems 1. Co-occurrence (1999) – Mori, Takahashi, Oka Early work that clusters sub-images (block-based decomposition) and counts word frequencies for each cluster 2. Translation (2002) – Duygulu, Barnard, de Freitas, Forsyth –“Vocabulary of Blobs” Automatic Segmentation -> Feature Vectors -> Clustering -> Blobs –An image is made of of Blobs, Words are associated with Blobs -> New Caption –“Blobs” are latent states Block-Based Decomposition Automatic Decomposition

Other Annotation Systems 3. CRM (2003)- Lavrenko, Manmatha, Jeon Continuous-space Relevance Model “smoothed KNN” algorithm image features are modeled using kernel-based densities automatic image segmentation color, shape, texture features word features are modeled using multinomial distribution “Training Images” are latent states. 4. CRM-rect(2004) – Feng Manmantha, Lavrenko Same as CRM but using block-based decomposition rather than segmentation 5. MBRM (2004)– Feng, Manmantha, Lavrenko Multiple-Bernoulli Relevance Mode Same as CRM-rect but uses multiple-Bernoulli distribution to model word features shifts emphasis to presence of word rather than prominence of word.

New Annotation Systems 6. CRM-rect-DCT (2005) – Carneiro, Vasconcelos CRM-rect with DCT features 7. Mix-hier(2005) -Carneiro, Vasconcelos Supervised M-ary Learning Density estimation using Mixture Hierarchies DCT features

Annotation Results Examples of Image Annotations:

Annotation Results Performance of Annotation system on Corel test set 500 images, 260 keywords, generate 5 keywords per image Recall = |w C |/|w H |, Precision = |w C |/|w A | Gain of 16% recall at same or better level of precision Gain of 12% in words with positive recall i.e. a word is found in both human and automatic annotation at least once.

Annotation Results Annotation computation time for Mix-Hier scales with training set size. MBRM is O(TR), where T is training set size Mix-Hier is O(CR), where C is the size of the vocabulary R is the number of image regions per image. Complexity is measured in seconds to annotated a new images.

Retrieval Results First five ranked images for “mountain”, “pool”, “blooms”, and “tiger”

Retrieval Results Mean Average Precision For each word w i, find all n a,i images that have been automatically annotated with word w i. Out of the n a,i images, let n c,i be the number of images that have been annotated with w i by humans. The precision of w i is n c,i / n a,i. If we have L words in our vocabulary, mean average precision is Mix-Hier does 40% better on words with positive recall.

Outline Notation and Problem Statement Three General Approaches to Image Annotation 1.Supervised One vs. All (OVA) Models 2.Unsupervised Models using Latent Variables 3.Supervised M-ary Model Estimating P(image features|words) Experimental Setup and Results Automatic Music Annotation

Annotation: Given a song, what are the words that describe the music. –Automatic Music Reviews Retrieval: Given a text query, what are the songs that are best describe by the query. –Song Recommendation, playlist generation, music retrieval Features extraction involves applying filters to digital audio signals Fourier, Wavelet, Gammatone are common filterbank transforms Music may be “more difficult” to annotate since music is inherently subjective. -Music evokes different thoughts and feeling to different listeners -An individual experience with music changes all the time -All music is art unlike most digital images. -The Corel data set consists of concrete “object” and “landscape” scene -An similar dataset might focus on Modern Art (Pollack, Mondrian, Dali)

Automatic Music Annotation Computer Hearing (aka Machine Listening, Computer Audition): Music is one subdomain of sound –Sound Effects, Human speech, Animal Vocalization, Environment Sounds all represent other subdomains of sound Annotation is one problem –Query-by-humming, Audio Monitoring, Sound Segmentation, Speech-to-Text are examples of other Computer Hearing Problems

Automatic Music Annotation Computer Hearing and Computer Vision are closely related: 1.Large public and private database exist that are rapidly growing in size 2.Digital Medium Sound is 2D – intensity (amplitude) & time or frequency & magnitude Sound is often represented in 3D – magnitude, time and frequency Image is 3D – 2 spatial dimensions, an intensity (color) Video is 4D – 2 spatial dimensions, an intensity, time 3.Video is comprised of both images and sound 4.Feature extraction techniques are similar Applying filters to digital medium

Work Cited: Carneiro, Vasconcelos. “Formulating Semantic Image Annotation as a Supervised Learning Problem” (CVPR ’05) Vasconcelos. “Image Indexing with Mixture Hierarchies” (CVPR ’01) Feng, Manmatha, Lavernko. “Multiple Bernoulli Relevance Models for Image and Video Annotation” (CVPR ’04) Blei, Jordan. “Modeling Annotated Data” (SIGIR ’03)