Download presentation
Presentation is loading. Please wait.
Published byShannon Pope Modified over 9 years ago
1
1 Viewing Vision-Language Integration as a Double-Grounding case Katerina Pastra Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K. Language Technology Group, Institute for Language and Speech Processing (ILSP), Athens, Greece
2
2 Vision-Language Integration What is computational V-L integration? (definition) How is it achieved? (state of the art, trends, needs) Why is it needed in AI agents? (explanatory/theoretical ground) How far can we go? (implementation suggestions, the VLEMA prototype) Content integration vs. technical integration à Small part within cognitive architectures à Small part of the integration story still, lack of an AI study of V-L integration
3
3 The notion of integration State of the art V-L integration prototypes: a)Deal with blocksworlds or miniworlds (scalability issues) b)Work with already abstracted/analysed visual input c)Rely on integration resources for V-L association Descriptive Definition Computational V-L integration is a process of associating visual and corresponding linguistic representations It may take the form of one of four integration processes according to the integration purpose to be served
4
4 Classification of V-L integration prototypes System typeIntegration Process Performance Enhancement Medium x analysis Medium y analysis (NL IU, or NL IU) Medium Translation Source medium analysis Target medium gen. (image language or image language) Multimedia Generation Abstracted data Multimedia generation (tabular data or knowledge representation) Situated Dialogue Multimedia analysis Medium/multimedia gen. (NL analysis and shared visual scene action/MM)
5
5 Why do agents need V-L integration ? Inherent characteristics of integrated media: Does each one of them lack something that the other can compensate for? Gains for an agent in communication: Are agents with V-L integration abilities more intelligent? Why do we need to know ? to decide on the significance of such a mechanism for an artificial agent to get the theoretical ground needed for research that is currently done mostly ad hoc and in isolation within different AI sub-areas
6
6 Inherent Characteristics Images: - reference object: physical or mental - lack inherent means of indicating focus/salience (cf. indexical-deictic mechanisms in vision theories) - lack inherent means of indicating type:token distinctions (i.e. level of abstraction) Language: - reference object: mental - has subtle mechanisms for indicating level of abstraction - has mechanisms for controlling attendance to details, focus etc. - lacks direct access to the physical world (cf. indexicals)
7
7 From Symbol Grounding… From the Symbol Grounding debate we get the following : - Language lacks direct access to the physical world - Language needs such access to express intentionality - Symbol grounding is a process of associating symbols/language with percepts (visual percepts) - Symbol grounding provides language direct access to the physical world - An agent must perform symbol grounding on its own to be intrinsically intentional (must go beyond instantiation of associations to inference)
8
Visual Perception Representations Linguistic Representations Association Direct Access Grounding
9
9 Shifting the focus from symbols… Relying on the inherent characteristics of images, one may argue that : - Images lack controlled access to mental aspects of the world - Images need such access to express intentionality - Image grounding is a process of associating images with language - Image grounding provides images controlled access to the mental world - An agent must perform image-grounding on its own to be intrinsically intentional
10
Visual Representations Linguistic Representations Association Direct Access Uncontrolled Access Grounding
11
11 From Symbol Grounding to Double Grounding The Double-Grounding Theory: - Double-grounding is a process of associating symbolic with iconic representations - Double-grounding provides language a direct access to the physical world, and at the same time it provides vision a controlled access to mental aspects of the world - Vision-language integration is a case of double-grounding - V-L integration compensates for features images and language inherently lack on their own – it is necessary for expressing and understanding intentionality in V-L MM situations
12
12 V-L integration abilities are needed for an agent to be intentional in MM situations Exploring how V-L integration can be achieved computationally, one realises that this research issue involves not only the perceptual and linguistic modules of a cognitive architecture, but also the learning and reasoning ones. The corresponding AI communities need, therefore, to join forces for addressing the challenges in endowing agents with their own V-L integration abilities Viewing integration as a double-grounding case
13
13 The AI quest for V-L Integration In relying on human created data, state of the art V-L integration systems avoid core integration challenges and therefore fail to perform real integration Can we do better? How far can we go ??? Challenging current practices means that a prototype should: work with real visual scenes analyse its visual data automatically associate images and language automatically Is it feasible to develop such a prototype ???
14
14 An optimistic answer VLEMA: A Vision-Language intEgration MechAnism Input: automatically re-constructed static scenes in 3D (VRML format) from RESOLV (robot-surveyor) Integration task: Medium Translation from images (3D sitting rooms) to text (what and where in EN) Domain: estates surveillance Horizontal prototype Implemented in shell programming and ProLog
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.