INTRODUCTION a flat landscape with a dry meadow in the foreground, a lagoon behind it and many clouds in the sky Annotated Image Caption

GOAL o Sky o Cloud o Landscape o Lagoon o Meadow To correctly map images and caption entities

MOTIVATION Improve image retrieval by complementing images with accompanying text Mapping to translate knowledge and information across text and images Deduce geometric/spatial relations in images using captions

DATASET Segmented and Annotated IAPR TC-12 Benchmark data set (Escalantea et al., 2010) that consists of about 20,000 photographs with a wide variety of themes Each image has a short caption that describes its content, most often consisting of one to three sentences separated by semicolons Each region is labelled with one out of 275 predefined image labels

PREPROCESSING The Stanford CoreNLP pipeline is applied to the captions to extract the entities. Consists of a part-of-speech tagger, lemmatizer, named entity recognizer (Finkel et al., 2005), dependency parser, and coreference solver. The cross product of all image entities with caption entities is taken to get (IE i, CE i ) pairs The test set is built by manually annotating 200 randomly selected images

RANKING ENTITY PAIRS INITIAL RANKING Using Semantics distance: by employing 8 metrices provided by WS4J (PATH,WUP,RES,JCN,HSO,LIN,LCH and LESK) Using Statistical Associations: Co-occurrence counts, Pointwise Mutual Information (PMI) and simplified Student’s t-score All the pairs are ranked using the above 8+3 scoring functions RERANKING The spatial features of images (topographical, horizontal and vertical) are taken into consideration on their own as well as were aggregated with corresponding syntactic features of the caption entity

Reranking Example

IMPROVING THE RESULTS Highest accuracy of 78.6% obtained by HSO similarity metric Further reranking done by creating an ensemble of the classifiers based on all scoring functions using a hard voting heuristic The number of votes for each classifier were picked from [0,3] and all possible permutations were tested The highest achieved accuracy thus obtained was 88.76%


FIN The slides are part of paper review for course CS671. All the work is authored by: Weegar, Rebecka, Kalle Aström, and Pierre Nugues. "Linking Entities Across Images and Text." CoNLL 2015 (2015): 185. This presentation, in no way, claims owernship of all or any content in the slides.