Inference Network Approach to Image Retrieval Don Metzler R. Manmatha Center for Intelligent Information Retrieval University of Massachusetts, Amherst.

Inference Network Approach to Image Retrieval Don Metzler R. Manmatha Center for Intelligent Information Retrieval University of Massachusetts, Amherst

Motivation Most image retrieval systems assume:  Implicit “AND” between query terms  Equal weight to all query terms  Query made up of single representation (keywords or image) “tiger grass” => “find images of tigers AND grass where each is equally important” How can we search with queries made up of both keywords and images? How do we perform the following queries?  “swimmers OR jets”  “tiger AND grass, with more emphasis on tigers than grass”  “find me images of birds that are similar to this image”

Related Work Inference networks Semantic image retrieval Kernel methods

Inference Networks Inference Network Framework [Turtle and Croft ‘89]  Formal information retrieval framework  INQUERY search engine  Allows structured queries phrases, term weighting, synonyms, etc… #wsum( 2.0 #phrase ( image retrieval ) 1.0 model )  Handles multiple document representations (full text, abstracts, etc…) MIRROR [deVries ‘98]  General multimedia retrieval framework based on inference network framework  Probabilities based on clustering of metadata + feature vectors

Image Retrieval / Annotation Co-occurrence model [Mori, et al] Translation model [Duygulu, et al] Correspondence LDA [Blei and Jordan] Relevance model-based approaches  Cross-Media Relevance Models (CMRM) [Jeon, et al]  Continuous Relevance Models (CRM) [Lavrenko, et al]

Goals Input  Set of annotated training images  User’s information need Terms Images “Soft” Boolean operators (AND, OR, NOT) Weights  Set of test images with no annotations Output  Ranked list of test images relevant to user’s information need

Data Corel data set †  4500 training images (annotated)  500 test images  374 word vocabulary Each image automatically segmented using normalized cuts  Each image represented as set of representation vectors  36 geometric, color, and texture features  Same features used in similar past work † Available at: http://vision.cs.arizona.edu/kobus/research/data/eccv_2002/

Features Geometric (6)  area  position (2)  boundary/area  convexity  moment of inertia Color (18)  avg. RGB x 2 (6)  std. dev. of RGB (3)  avg. L*a*b x 2 (6)  std. dev. of L*a*b (3) Texture (12)  mean oriented energy, 30 deg. increments (12)

Image representation cat, grass, tiger, water annotation vector (binary, same for each segment) representation vector (real, 1 per image segment)

Image Inference Network J – representation vectors for image, (continuous, observed) q w – word w appears in annotation, (binary, hidden) q r – representation vector r describes image, (binary, hidden) q op – query operator satisfied (binary, hidden) I – user’s information need is satisfied, (binary, hidden) I J q r1 q rk … q op1 q op2 q w1 q wk … “Image Network” “Query Network” fixed (based on image) dynamic (based on query)

Example Instantiation #or #and tigergrass

What needs to be estimated? P(q w | J) P(q r | J) P(q op | J) P(I | J) I J q r1 q rk … q op1 q op2 q w1 q wk …

P(q w | J) [ P( tiger | ) ] Probability term w appears in annotation given image J Apply Bayes’ Rule and use non-parametric density estimation Assumes representation vectors are conditionally independent given term w annotates the image ???

How can we compute P(r i | q w )? training set representation vectors representation vectors associated with image annotated by w area of high likelihood area of low likelihood

P(q w | J) [final form] Σ assumed to be diagonal, estimated from training data

Regularized estimates… P(q w | J) are good, but not comparable across images termP(q w | J) cat0.45 grass0.35 tiger0.15 water0.05 termP(q w | J) cat0.90 grass0.05 tiger0.01 water0.03 Is the 2 nd image really 2x more “cat-like”? Probabilities are relative per image

Regularized estimates… Impact Transformations  Used in information retrieval  “Rank is more important than value” [Anh and Moffat] Idea:  rank each term according to P(q w | J)  give higher probabilities to higher ranked terms  P(q w | J) ≈ 1/rank qw Zipfian assumption on relevant words  a few words are very relevant  a medium number of words are somewhat relevant  many words are not relevant

Regularized estimates… termP(q w | J)1/rank cat0.450.48 grass0.350.24 tiger0.150.16 water0.050.12 termP(q w | J)1/rank cat0.900.48 grass0.050.24 tiger0.010.12 water0.030.16

P(q r | J) [ P( | ) ] Probability representation vector observed given J Use non-parametric density estimation again Impose density over J’s representation vectors just as we did in the previous case Estimates may be poor  Based on small sample (~ 10 representation vectors)  Naïve and simple, yet somewhat effective

Model Comparison Relevance modeling-based  CMRM, CRM  General form: Fully non-parametric  Model used here  General form:

Query Operators “Soft” Boolean operators  #and / #wand (weighted and)  #or  #not One node added to query network for each operator present in query Many others possible  #max, #sum, #wsum  #syn, #odn, #uwn, #phrase, etc…

#or( #and ( tiger grass ) ) #or #and tigergrass

Operator Nodes Combine probabilities from term and image nodes Closed forms derived from corresponding link matrices Allows efficient inference within network Par(q) = Set of q’s parent nodes

… but where do they come from? AB Q P(Q=true|a,b)AB 0false 0 true 0 false 1true

Results - Annotation ModelTranslationCMRMCRMInfNet # words with recall >= 04966107117 Results on full vocabulary Mean per-word recall0.040.090.190.24 Mean per-word precision0.060.100.160.17 F-measure0.050.090.170.20

foals (0.46) mare (0.33) horses (0.20) field (1.9E-5) grass (4.9E-6) railroad (0.67) train (0.27) smoke (0.04) locomotive (0.01) ruins (1.7E-5) sphinx (0.99) polar (5.0E-3) stone (1.0E-3) bear (9.7E-4) sculpture (6.0E-4)

Results - Retrieval Precision @ 5 retrieved images 1 word2 word3 word CMRM0.19890.13060.1494 CRM0.24800.19020.1888 InfNet0.25250.16720.1727 InfNet-reg0.25470.19640.2170 Mean Average Precision 1 word2 word3 word CMRM0.16970.16420.2030 CRM0.23530.25340.3152 InfNet0.24840.21550.2478 InfNet-reg0.26330.26490.3238

Future Work Use rectangular segmentation and improved features Different probability estimates  Better methods for estimating P(q r | J)  Use CRM to estimate P(q w | J) Apply to documents with both text and images Develop a method/testbed for evaluating for more “interesting” queries

Conclusions General, robust model based on inference network framework Departure from implied “AND” between query terms Unique non-parametric method for estimating network probabilities Pros  Retrieval (inference) is fast  Makes no assumptions about distribution of data Cons  Estimation of term probabilities is slow  Requires sufficient data to get a good estimate

Inference Network Approach to Image Retrieval Don Metzler R. Manmatha Center for Intelligent Information Retrieval University of Massachusetts, Amherst.

Similar presentations

Presentation on theme: "Inference Network Approach to Image Retrieval Don Metzler R. Manmatha Center for Intelligent Information Retrieval University of Massachusetts, Amherst."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Inference Network Approach to Image Retrieval Don Metzler R. Manmatha Center for Intelligent Information Retrieval University of Massachusetts, Amherst.

Similar presentations

Presentation on theme: "Inference Network Approach to Image Retrieval Don Metzler R. Manmatha Center for Intelligent Information Retrieval University of Massachusetts, Amherst."— Presentation transcript:

Similar presentations

About project

Feedback