Above and below the object level
Object level: Single object recognition (ImageNet)
Below the object level: recognizing local configurations The human visual system makes highly effective use of limited information: it can recognize not only objects, but severely reduced sub-configurations in terms of size or resolution. ‘Configurations’ rather than well-defined parts.
Minimizing variability Motivation: interested because they are useful for the interpretation of complex scenes Will see more, but the basic reason is that the reduce variability Generalization was much better with the reduced images Useful for the interpretation of complex scenes
Searching for Minimal Images Eg 40 to 35 Over 15,000 Mechanical Turk subjects ‘Atoms of Recognition’ PNAS 2016
Pairs Parent – MIRC, Child – ‘sub-MIRC’
0.79 0.0
0.88 0.14
0.88 0.16
Average 16 per class After removing highly overlapping ones Jaccard > 0.5
Cover Average 16.9 / class Highly redundant Each MIRC is non-redundant: each feature is important Highly redundant, each MIRC is non-redundant Number of visual elements is the same at different scales
Testing computational models
DNN do not reach human level recognition on minimal images Recognition of minimal images does not emerge by training any of the existing models tested. The large gap at the minimal level is not reproduced The accuracy of recognition is lower than human’s Representations used by existing models do not capture differences that human recognition is sensitive to
Minimal Images: Internal Interpretation Humans can interpret detailed sub-structures within the ‘atomic’ MIRC. Cannot be done by current feed-forward models Another basic aspect that people can do and models cannot:
Internal Interpretation Another limitation of current models / A related task that people can do Also tells us about features and representations that humans use Examples of internal interpretations, produced automatically by a model, validated by Turk These structures do not appear in the same way in the false detections. False detection: using Felzenszwalb Internal interpretations, perceived by humans Cannot be produced by existing feed-forward models
Current recognition models are feed-forward only This is likely to limit their ability to providing interpretation of find details A model that can produce interpretation of MIRCs The model uses a top-down stage (back to V1) Ben-Yossef et al, CogSci 2015 A model for full local image interpretation
Above the object level
Image Captioning
Automatic caption: A brown horse standing next to a building. 2017
Automatic caption: a man is standing in front woman in white shirt.
Stealing Human description: ‘Two women sitting at restaurant table while a man in a black shirt takes one of their purses off the chair, while they are not looking’
Components: Man, Woman, Purse, Chair Properties: Young, dark-hair, red, Relations: Man grabs purse, Purse hanging of chair, Woman sitting on chair
This is what vision produces, and we identify the event as ‘stealing’ Purse Hanging-of Chair Red Grabbing Sitting Man Woman Dark Not-looking We also connect each component to the image or in some working memory This is what vision produces, and we identify the event as ‘stealing’
Stealing Grabbing Owner Not-looking Person A Object X Person B Object X and 2 people: we also need to know that person B is unaware of the grabbing Abstract and cognitive (shared with non-sighted), broad generalization
The ‘ownership’ is added to the structure, Purse Hanging-of Chair Red Owns Grabbing Sitting Man Woman Dark Not-looking The ‘ownership’ is added to the structure, This cognitive addition contributes to ‘stealing’