Object Recognition as Machine Translation Matching Words and Pictures Heather Dunlop 16-721: Advanced Perception April 17, 2006.

Slides:

Advertisements

Similar presentations

Clustering Art & Learning the Semantics of Words and Pictures Manigantan Sethuraman.

Advertisements

LEARNING SEMANTICS OF WORDS AND PICTURES TEJASWI DEVARAPALLI.

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Presented by, Biswaranjan Panda and Moutupsi Paul Beyond Nouns -Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers Ref.

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Inference Network Approach to Image Retrieval Don Metzler R. Manmatha Center for Intelligent Information Retrieval University of Massachusetts, Amherst.

Image Retrieval Basics Uichin Lee KAIST KSE Slides based on “Relevance Models for Automatic Image and Video Annotation & Retrieval” by R. Manmatha (UMASS)

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Semantics of words and images Presented by Gal Zehavi & Ilan Gendelman.

Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.

Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.

Relevance Feedback based on Parameter Estimation of Target Distribution K. C. Sia and Irwin King Department of Computer Science & Engineering The Chinese.

February, Content-Based Image Retrieval Saint-Petersburg State University Natalia Vassilieva Il’ya Markov

Latent Dirichlet Allocation a generative model for text

WORD-PREDICTION AS A TOOL TO EVALUATE LOW-LEVEL VISION PROCESSES Prasad Gabbur, Kobus Barnard University of Arizona.

Presented by Zeehasham Rasheed

Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos.

Object Class Recognition using Images of Abstract Regions Yi Li, Jeff A. Bilmes, and Linda G. Shapiro Department of Computer Science and Engineering Department.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Beyond Nouns Abhinav Gupta and Larry S. Davis University of Maryland, College Park Exploiting Prepositions and Comparative Adjectives for Learning Visual.

Agenda Introduction Bag-of-words models Visual words with spatial location Part-based models Discriminative methods Segmentation and recognition Recognition-based.

Information Retrieval in Practice

A Search Engine for Historical Manuscript Images Toni M. Rath, R. Manmatha and Victor Lavrenko Center for Intelligent Information Retrieval University.

Computer Vision James Hays, Brown

Image Annotation and Feature Extraction

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation

Recognition using Regions (Demo) Sudheendra V. Outline Generating multiple segmentations –Normalized cuts [Ren & Malik (2003)] Uniform regions –Watershed.

Building local part models for category-level recognition C. Schmid, INRIA Grenoble Joint work with G. Dorko, S. Lazebnik, J. Ponce.

Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.

Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.

Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.

The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Object Recognition a Machine Translation Learning a Lexicon for a Fixed Image Vocabulary Miriam Miklofsky.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Object Recognition Part 2 Authors: Kobus Barnard, Pinar Duygulu, Nado de Freitas, and David Forsyth Slides by Rong Zhang CSE 595 – Words and Pictures Presentation.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

Exploiting Ontologies for Automatic Image Annotation Munirathnam Srikanth, Joshua Varner, Mitchell Bowden, Dan Moldovan Language Computer Corporation SIGIR.

Lecture 2: Statistical learning primer for biologists

Latent Dirichlet Allocation

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Towards Total Scene Understanding: Classiﬁcation, Annotation and Segmentation in an Automatic Framework N 工科所錢雅馨 2011/01/16 Li-Jia Li, Richard.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )

1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.

Word sense disambiguation with pictures Kobus Barnard, Matthew Johnson presented by Milan Iliev.

Hierarchical Clustering & Topic Models

Clustering (1) Clustering Similarity measure Hierarchical clustering

Multimodal Learning with Deep Boltzmann Machines

Bayesian Models in Machine Learning

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Matching Words with Pictures

Matching Words and Pictures

Presented by Wanxue Dong

Text Categorization Berlin Chen 2003 Reference:

Word embeddings (continued)

EM Algorithm and its Applications

Presentation transcript:

Object Recognition as Machine Translation Matching Words and Pictures Heather Dunlop : Advanced Perception April 17, 2006

Machine Translation Altavista’s Babel Fish: –There are three more weeks of classes! –Il y a seulement trois semaines supplémentaires de classes! –¡Hay solamente tres más semanas de clases! –Ci sono soltanto tre nuove settimane dei codici categoria! –Es gibt nur drei weitere Wochen Kategorien!

Statistical Machine Translation Statistically link words in one language to words in another Requires aligned bitext –eg. Hansard for Canadian parliament

Statistical Machine Translation Assuming an unknown one-one correspondence between words, come up with a joint probability distribution linking words in the two languages Missing data problem: solution is EM Given the translation probabilities, estimate the correspondences Given the correspondences, estimate the translation probabilities

Multimedia Translation Data: –Words are associated with images, but correspondences are unknown sun sea sky

Auto-Annotation Predicting words for the images tiger grass cat

Region Naming Can also be applied to object recognition Requires a large data set

Browsing

Auto-Illustration Moby Dick

Data Sets of Annotated Images Corel data set Museum image collections News photos (with captions)

First Paper Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary by Pinar Duygulu, Kobus Barnard, Nando de Freitas, David Forsyth –A simple model for annotation and correspondence

Overview

Input Representation Segment with Normalized Cuts: Only use regions larger than a threshold (typically 5-10 per image) Form vector representation of each region Cluster regions with k-means to form blob tokens sun sky waves sea word tokens

Input Representation Represent each region with a feature vector –Size: portion of the image covered by the region –Position: coordinates of center of mass –Color: avg. and std. dev. of (R,G,B), (L,a,b) and (r=R/(R+G+B),g=G/(R+G+B)) –Texture: avg. and variance of 16 filter responses –Shape: area / perimeter 2, moment of inertia, region area / area of convex hull

Tokenization

Assignments Each word is predicted with some probability by each blob

Expectation Maximization Select word with highest probability to assign to each blob probability that blob b ni translates to word w nj probability of obtaining word w nj given instance of blob b ni # of images # of words # of blobs

Expectation Maximization Initialize to blob-word co-occurrences: Iterate: Given the translation probabilities, estimate the correspondences Given the correspondences, estimate the translation probabilities

Word Prediction On a new image: –Segment –For each region: Extract features Find the corresponding blob token using nearest neighbor Use the word posterior probabilities to predict words

Refusing to Predict Require: p(word|blob) > threshold –ie. Assign a null word to any blob whose best predicted word lies below the threshold Prunes vocabulary, so fit new lexicon

Indistinguishable Words Visually indistinguishable: –cat and tiger, train and locomotive Indistinguishable with our features: –eagle and jet Entangled correspondence: –polar – bear –mare/foals – horse Solution: cluster similar words –Obtain similarity matrix –Compare words with symmetrised KL divergence –Apply N-Cuts on matrix to get clusters –Replace word with its cluster label

Experiments Train with 4500 Corel images –4-5 words for each image –371 words in vocabulary –5-10 regions per image –500 blobs Test on 500 images

Auto-Annotation Determine most likely word for each blob If probability of word is greater than some threshold, use in annotation

Measuring Performance Do we predict the right words?

Region Naming / Correspondence

Measuring Performance Do we predict the right words? Are they on the right blobs? Difficult to measure because data set contains no correspondence information Must be done by hand on a smaller data set Not practical to count false negatives

Successful Results

Unsuccessful Results

Refusing to Predict

Clustering

Merging Regions

Results light bar = average number of times blob predicts word in correct place dark bar = average number of times blob predicts word which is in the image

Second paper Matching Words and Pictures by Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, Michael I. Jordan –Comparing lots of different models for annotation and correspondence

Annotation Models Multi-modal hierarchical aspect models Mixture of multi-modal LDA

Multi-Model Hierarchical Aspect Model cluster = a path from a leaf to the root

Multi-Model Hierarchical Aspect Model All observations are produced independent of one another I-0: as above I-1: cluster dependent level structure –p(l|d) replaced with p(l|c,d) I-2: generative model –p(l|d) replaced with p(l|c) –allows prediction for documents not in training set document observations clusterslevels normalization Gaussian frequency tables

Multi-Model Hierarchical Aspect Model Model fitting is done with EM Word prediction: set of observed blobs

Mixture of Multi-Modal LDA multinomial Dirichlet multinomial multivariate Gaussian mixture component and hidden factor

Mixture of Multi-Modal LDA Distribution parameters estimated with EM Word prediction: posterior over mixture components posterior Dirichlet

Correspondence Models Discrete translation Hierarchical clustering Linking word and region emission probabilities Paired word and region emission

Discrete Translation Similar to first paper Use k-means to vector-quantize the set of features representing an image region Construct a joint probability table linking word tokens to blob tokens Data set doesn’t provide explicit correspondences –Missing data problem => EM

Hierarchical Clustering Again, using vector-quantized image regions Word prediction:

Linking Word and Region Emission Words emitted conditioned on observed blobs D-O: as above (D for dependent) D-1: cluster dependent level distributions –Replace p(l|c,d) with p(l|d) D-2: generative model –Replace p(l|d) with p(l) B U W

Paired Word and Region Emission at Nodes Observed words and regions are emitted in pairs: D={(w,b)} C-0: as above (C for correspondence) C-1: cluster dependent level structure –p(l|d) replaced with p(l|c,d) C-2: generative model –p(l|d) replaced with p(l|c)

Wow, That’s a Lot of models! Multi-modal hierarchical: I-0, I-1, I-2 Multi-modal LDA Discrete translation Hierarchical clustering Linked word and region emission: D-0, D-1, D-2 Paired word and region emission: C-0, C-1, C-2 Count = 12 Why so many?

Evaluation Methods Annotation performance measures: –KL divergence between predicted and target distributions: –Word prediction measure: n = # of words in image r = # of words predicted correctly # of words predicted is set to # of actual keywords –Normalized classification score: w = # of words predicted incorrectly N = vocabulary size

Results Methods using clustering are very reliant on having images that are close to the training data MoM-LDA has strong resistance to over-fitting D-0 (linked word and region emission) appears to give best results, taking all measures and data sets into consideration

Successful Results

Unsuccessful Results good annotation, poor correspondence complete failure

N-cuts vs. Blobworld Normalized Cuts Blobworld

N-cuts vs. Blobworld

Browsing Results Clustering by text onlyClustering by image features only

Browsing Results Clustering by both text and image features only

Search Results query: tiger, river tiger, cat, water, grass tiger, cat, grass, trees tiger, cat, water, grass tiger, cat, grass, forest tiger, cat, water, grass

Auto-Illustration Results Passage from Moby Dick: –“The large importance attached to the harpooneer's vocation is evinced by the fact, that originally in the old Dutch Fishery, two centuries and more ago, the command of a whale-ship!…” Words extracted from the passage using natural language processing tools –large importance attached fact old dutch century more command whale ship was per son was divided officer word means fat cutter time made days was general vessel whale hunting concern british title old dutch official present rank such more good american officer boat night watch ground command ship deck grand political sea men mast

Auto-Illustration Results Top-ranked images retrieved using all extracted words:

Conclusions Lots of different models developed –Hard to tell which is best Can be used with any set of features Numerous applications: –Auto-annotation –Region naming (aka object recognition) –Browsing –Searching –Auto-illustration Improvements in translation from visual to semantic representations lead to improvements in image access