Semantics of words and images Presented by Gal Zehavi & Ilan Gendelman
What is semantics? Semantics – “a branch of philosophy dealing with the relations between signs and what they refer to (their meaning)” (Webster)
Content 1) Motivation
Motivation waterfall bear fish Motivation Object recognition Better segments aggregation Image reconstruction
Motivation More motivation… Data mining: Content Based Image Retrieval (CBIR) higher precision / higher recall / quicker search Applications Biomedicine (X-ray, Pathology, CT, MRI, …) Government (radar, aerial, trademark, …) Commercial (fashion catalog, journalism, …) Cultural (Museums, art galleries, …) Education and training Entertainment, WWW (100 billions?!), …
Motivation Moby Dick Even more motivation… Auto illustrator Auto annotator Ocean Helicopter Shark Man Bridge Mountain
Content 1) Motivation 2)Introduction to Semantics
Introduction to Semantics Different aspects of semantics…
Introduction to Semantics Specific Objects The elephant played chess
Introduction to Semantics Family of Objects The tree stood alone on the hill. This car is as fast as the wind.
Introduction to Semantics Scenarios A couple on the beach but variety is large… Easy to imagine (statistically clear)
Introduction to Semantics Semantics from Context The captain was on the bridge.
Introduction to Semantics Abstract Semantics Vacation Strength Experience ?
Content 1) Motivation 2) Introduction to Semantics 3)Difficulties
Difficulties Difficulties in Text-Image Semantics Ambiguity Level of Abstraction coralscouple, beach, sunset men, bottle, liquid competitors, race, alcohol M. Schumacher, Formula 1, Dom-Perignon drivers, car race, champagne celebration, winning, happiness
Content 1) Motivation 2) Introduction to Semantics 3) Difficulties 4)Possible Approaches
Possible approaches Possible Approaches Search by segment features Search by image feature similarity + low complexity - missing real image semantics Semantics through user interaction + refined visualisation + higher level of abstraction - a complex user interface is required Query by example + relating to user’s visualisation - missing real image semantics Searching images by text only + use existing text search infrastructure + no specific processing - images must be in textual context - missing real image semantics and features
Content 1) Motivation 2) Introduction to Semantics 3) Difficulties 4) Possible Approaches 5)Models
Models Models Object recognition as machine translation “Object Recognition as Machine translation: Learning a Lexicon for a Fixed Image Vocabulary” P.Duygulu, K.Barnard, N.Freitas and D.Forsyth (2002) Learning semantics by hierarchical clustering of words and image regions "Learning the semantics of words and pictures“ K.Barnard, D.Forsyth (2001) "Clustering Art” K.Barnard, P.Duygulu, D.Forsyth (2001)
Content 1) Motivation 2) Introduction to Semantics 3) Difficulties 4) Possible Approaches 5) Models Model #1 :Model #1 : Object recognition as machine translation Object recognition as machine translation
Model #1 Object recognition as machine translation Description: learning a lexicon for a fixed image vocabulary
Model #1 Our goal is describing a scenery using the lexicon we learned. sand sky sea mountain/forest rock
Model #1 How do we do it? By applying a similar method to a language translator translating words by using many sentences correspondence, and creating a statistic.
Model #1 The blob notion First we segment the image into regions using min-cut method.
Model #1 Assigning Regions to Blobs How do we assign a region to a blob? Eliminate regions < threshold Define a set of features Discretise each feature distribution Cluster the finite dimension vectors Cluster = Blob
Model #1 In the article experiment We discretise the 33 different features using the k – mean algorithm the flowing features are included *region color *convexity *standard deviation *first moment *region orientation energy *region size *location
Model #1 For applying the method to images we need to discretise the image Using the Corel data base with 371 words and 4500 images that have an 5 to 10 and 4500 images that have an 5 to 10 segment each. segment each. ~35000 segments – 500 blobs ~35000 segments – 500 blobs
Model #1 The K-means algorithm Matching the best k values to a continues histogram Given a distribution of a n – dimensional vector we can create a cluster
Model #1 K-means Output : discrete data Input : continues data Some Data Acknowledgments to Andrew W. Moore, Carnegie Mellon University
Model #1 K-means Iterative algorithm Iterate… Choose k (i.e 5) Randomly guess k center locations Each data point “belongs” to one center Each center finds the centroid of its points Centroid defines the new center Acknowledgments to Andrew W. Moore, Carnegie Mellon University
Model #1 K-means An example Acknowledgments to Andrew W. Moore, Carnegie Mellon University
Model #1 Now for every image we have a set of blobs Match with a set of words So the translation can begin
Model #1 Notation The set of words w = (w n 1 w n 2 …. w n m ) where “m” is the size of the words string The event The event a n j =i means that the j th word in the possible translation translates the i th blob The aliment a n = ( a n 2 ……. a n m ) where “n” is the n th image The aliment a n = (a n 1 a n 2 ……. a n m ) where “n” is the n th image The set of blobs b = (b n 1 b n 2 ……. b n l ) where “l” is the size of the blobs string where “l” is the size of the blobs string
Model #1 More about aliment space A(w,) More about aliment space A(w,b) For b = (b 1 b 2 ……. b l ) and w =(w 1 w 2 …. w m ) a = (a 1 a 2 …. a m ) is a series taking the values 0 – l So a = (a 1 a 2 …. a m ) is a series taking the values 0 – l representing a discreet function w 1 w 2 w w 4 …. w m w 1 w 2 w 3 w 4 …. w m b 1 b 2 b 3 …. b l b 1 b 2 b 3 …. b l Possible a
Model #1 The likelihood function How do we generate it ? Since it gives us the probability distribution that a set of words is the translation given a set of blobs. We call the conditional probability p(w|b) the likelihood function
Model #1 lexicon: (book) (chair) (sky) (tree) (sun) (fish) (ship) (ring) (sea) (cloud) … w1w1 w2w2 w3w3 b1b1 b2b2 b3b3b Possible translation Set of Blobs given P(a 1 |b,m) P(a 2 |a 1, w 1,b,m) P(a 3 |a 2, w 2,b,m) sky P(w 1 |a 1, b, m) P(w 2 |a 2, w 1,b,m) cloud sun P(w 3 |a 3, w 2,b,m) String size m P(m|b) P(w|b,a ) = P(m|b) P(a 1 |b,m) P(w 1 |a 1,b,m) P(a 3 |a 2,w 2,b,m) P(w 3 |a 3,w 2,b,m) P(a 2 |a 1,w 1,b,m)P(w 2 |a 2,w 1,b,m)
Model #1 Finally we get, without loss of generality : Problem with this formulation is the enormous number of system parameters
Model #1 A more simple model Assumptions : 1)Disregarding the context of the blobs and the words 2) Assume that the aliment is affected only from the position of the translating word 3) Assuming that translating strings could have any length
Model #1 lexicon: (book) (chair) (sky) (tree) (sun) (fish) (ship) (ring) (sea) (cloud) … w1w1 w2w2 w3w3 Possible word translation b1b1 b2b2 b3b3b Set of Blobs given String size m P(m|b)=const sun t(w 3 |b a 3 )=P(w 3 |a 3, w 2,b,m) P(a 3 | 3, b,m)=P(a 3 |a 2, w 2,b,m)
Model #1 Our mathematical goal is to manufacture a probability table that represents the probability for a given blob the distribution of all the possible translations blob 3 circle blob 1 0.9sun 0.95ball 0.01man 0.89earth
Model #1
E-step: defining the expectation of the complete-data log likelihood And computing it And computing it
Model #1 Taking into consideration the following constraints and We can use the LaGrange multipliers to maximize the likelihood function The obtained lagrngian is And the equations needed to be solved for maximization are : M-step: maximizing the expectation we computed
Model #1 Solving this set of equations yields a new set that converges iteratively
Model #1 Further refinements Words may not be predicted with highest probability for any blob. Choosing smaller lexicon Rerunning process Assigning NULL words when P(word|blob) > threshold
Model #1 Visually indistinguishable Indistinguishable words Practically indistinguishable Entangled correspondence polar – bearmare/foals - horse Clustering similar words Rerunning process
Model #1 - Results Experimental Results Settings: 4500 Corel images, with 4-5 keywords each 371 words vocabulary typically 5-10 regions each image 500 blobs 33 features for each region
Model #1 - Results Annotation Recall / Precision precision recall Original words Original words Refitted words Refitted words Clustered words Clustered words 500 test images Only 80 words predicted
Model #1 - Results Correspondence 100 test images Prediction rate Original words Prediction rate Clustered words Null threshold 0.2 Dark blue – total # of time a blob predicts the word, which is one of the image keywords Light blue – average # of times a blob predicts the word correctly in the right place
Model #1 - Results Some Results Successful results
Model #1 - Results Some Results Non-successful results
Model #1 - Results Some Results Assigning null
Model #1 - Results Some Results Clustering words – 1 st iteration
Model #1 - Results Some Results Clustering words – 2 nd iteration
Model #1 - Results Some Results Clustering words – 3 rd iteration
Content 1) Motivation 2) Introduction to Semantics 3) Difficulties 4) Possible Approaches 5) Models Model #1 :Model #1 : Object recognition as machine translation Object recognition as machine translation Model #2 :Model #2 : Learning semantics by hierarchical clustering of words and image regions Learning semantics by hierarchical clustering of words and image regions
Model #2 Learning semantics by hierarchical clustering of words and image segments Description: Statistical modeling of words and image feature occurrence and co-occurrence, organizing image collections into clusters using a hierarchical model
Model #2 Objective Indexing image databases, by integrating semantic information provided by visual information and associated text Organize images in a way that exposes as much semantic structure to the user as possible
Model #2 Given: Set of Images + Associated text to each image Processing Indexed data Browsing Query items Search Image / text Auto-Annotate / Auto-Illustrate
Model #2 Hierarchical Structure Encourages semantic perception - levels of generalization levels of generalization general specific general specific Useful structure for browsing Natural data organization – coarse fine coarse fine
Model #2 Hierarchical Structure Hierarchy of occurrences
Model #2 Initial Processing Image segmentation Segments Blobs Items countDistribution histogram Per image
Model #2 Hierarchy Creation Histogram discretization Items Occurrences K-means
Model #2 Hierarchy Creation Building tree graph levels 4 levels tree
Model #2 Hierarchy Creation Creating hierarchy trees 3 levels tree P(co-occurrences) Eliminating: P < threshold or fixed tree fan-out
Model #2 Hierarchy Creation Cluster = leaves leaves = path Cluster of items
Model #2 Hierarchy Creation Modelling data as being generated by the nodes along a path sky sun, sea waves sky sun sea waves
Model #2 Hierarchy Creation Adjacent clusters sun, sea waves sky sun sea waves sky rocks sky sun sea rocks
Model #2 Indexing Document indexing = # of document items in cluster # of document items Normalized
Model #2 Indexing Calculating P(c) n d = total # of documents
Model #2 Using the Model Browsing Ocean Dolphins Whales Corals Etc… Ocean Dolphins Tale Head Etc… Using the tree structure and nodes items
model #2 Using the Model Search Given a set of observations Q Probability value P(Q|d) for each document d in the database Threshold A set of documents matching the observations Q
Model #2 Using the Model Search = probability of item in node = # of items from node in document # of document items Conditional probability p(Q|d), (likelihood function) :
Model #2 - Results Experimental Results Settings: Corel database: 300 Coral images, with 4-5 keywords each 64 clusters SF Fine Art Museum database: Training on 8405 museum images, with attached text. 3319 words vocabulary 256 clusters typically 5-10 regions each image Processing ~ Hours
Model #2 - Results Browsing Results Do the clusters found, make sense to humans? 64 clusters64 random sets 94% accuracy
Model #2 - Results Browsing Results Successful clusters
Model #2 - Results Browsing Results Non-successful cluster
Model #2 - Results Browsing Results Does clustering on image segments and words has an advantage over either alone? Clustering by text only Clustering by image features only
Model #2 - Results Browsing Results Clustering by both text and image features only
Model #2 - Results Search Results query: tiger, river tiger, cat, water, grass tiger, cat, water, trees tiger, cat, water, grass tiger, cat, grass, forest tiger, cat, water, grass
Model #2 - Results Auto - Annotation Associating words with images grass, tiger, cat, forest hippo, bull, mouth, walk flower, coralberry, leaves, plant tiger, grass, cat, people, water, Bengal, buildings water, hippos, rhino, river, grass, reflection, plain fish, reef, church, wall, people, water, landscape
Content 1) Motivation 2) Introduction to Semantics 3) Difficulties 4) Possible Approaches 5) Models Model #1 :Model #1 : Object recognition as machine translation Object recognition as machine translation Model #2 :Model #2 : Learning semantics by clustering of words and image regions Learning semantics by clustering of words and image regions Summery
The End