Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Viewing Vision-Language Integration as a Double-Grounding case Katerina Pastra Department of Computer Science, Natural Language Processing Group, University.

Similar presentations


Presentation on theme: "1 Viewing Vision-Language Integration as a Double-Grounding case Katerina Pastra Department of Computer Science, Natural Language Processing Group, University."— Presentation transcript:

1 1 Viewing Vision-Language Integration as a Double-Grounding case Katerina Pastra Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K. Language Technology Group, Institute for Language and Speech Processing (ILSP), Athens, Greece

2 2 Vision-Language Integration  What is computational V-L integration? (definition)  How is it achieved? (state of the art, trends, needs)  Why is it needed in AI agents? (explanatory/theoretical ground)  How far can we go? (implementation suggestions, the VLEMA prototype)  Content integration vs. technical integration à Small part within cognitive architectures à Small part of the integration story still, lack of an AI study of V-L integration 

3 3 The notion of integration State of the art V-L integration prototypes: a)Deal with blocksworlds or miniworlds (scalability issues) b)Work with already abstracted/analysed visual input c)Rely on integration resources for V-L association Descriptive Definition  Computational V-L integration is a process of associating visual and corresponding linguistic representations  It may take the form of one of four integration processes according to the integration purpose to be served

4 4 Classification of V-L integration prototypes System typeIntegration Process Performance Enhancement Medium x analysis  Medium y analysis (NL  IU, or NL  IU) Medium Translation Source medium analysis  Target medium gen. (image  language or image  language) Multimedia Generation Abstracted data  Multimedia generation (tabular data or knowledge representation) Situated Dialogue Multimedia analysis  Medium/multimedia gen. (NL analysis and shared visual scene  action/MM)

5 5 Why do agents need V-L integration ?  Inherent characteristics of integrated media: Does each one of them lack something that the other can compensate for?  Gains for an agent in communication: Are agents with V-L integration abilities more intelligent? Why do we need to know ?  to decide on the significance of such a mechanism for an artificial agent  to get the theoretical ground needed for research that is currently done mostly ad hoc and in isolation within different AI sub-areas

6 6 Inherent Characteristics  Images: - reference object: physical or mental - lack inherent means of indicating focus/salience (cf. indexical-deictic mechanisms in vision theories) - lack inherent means of indicating type:token distinctions (i.e. level of abstraction)  Language: - reference object: mental - has subtle mechanisms for indicating level of abstraction - has mechanisms for controlling attendance to details, focus etc. - lacks direct access to the physical world (cf. indexicals)

7 7 From Symbol Grounding… From the Symbol Grounding debate we get the following : - Language lacks direct access to the physical world - Language needs such access to express intentionality - Symbol grounding is a process of associating symbols/language with percepts (visual percepts) - Symbol grounding provides language direct access to the physical world - An agent must perform symbol grounding on its own to be intrinsically intentional (must go beyond instantiation of associations to inference)

8 Visual Perception Representations Linguistic Representations Association Direct Access Grounding

9 9 Shifting the focus from symbols… Relying on the inherent characteristics of images, one may argue that : - Images lack controlled access to mental aspects of the world - Images need such access to express intentionality - Image grounding is a process of associating images with language - Image grounding provides images controlled access to the mental world - An agent must perform image-grounding on its own to be intrinsically intentional

10 Visual Representations Linguistic Representations Association Direct Access Uncontrolled Access Grounding

11 11 From Symbol Grounding to Double Grounding The Double-Grounding Theory: - Double-grounding is a process of associating symbolic with iconic representations - Double-grounding provides language a direct access to the physical world, and at the same time it provides vision a controlled access to mental aspects of the world - Vision-language integration is a case of double-grounding - V-L integration compensates for features images and language inherently lack on their own – it is necessary for expressing and understanding intentionality in V-L MM situations

12 12  V-L integration abilities are needed for an agent to be intentional in MM situations  Exploring how V-L integration can be achieved computationally, one realises that this research issue involves not only the perceptual and linguistic modules of a cognitive architecture, but also the learning and reasoning ones.  The corresponding AI communities need, therefore, to join forces for addressing the challenges in endowing agents with their own V-L integration abilities Viewing integration as a double-grounding case

13 13 The AI quest for V-L Integration In relying on human created data, state of the art V-L integration systems avoid core integration challenges and therefore fail to perform real integration Can we do better? How far can we go ??? Challenging current practices means that a prototype should:  work with real visual scenes  analyse its visual data automatically  associate images and language automatically Is it feasible to develop such a prototype ???

14 14 An optimistic answer VLEMA: A Vision-Language intEgration MechAnism  Input: automatically re-constructed static scenes in 3D (VRML format) from RESOLV (robot-surveyor)  Integration task: Medium Translation from images (3D sitting rooms) to text (what and where in EN)  Domain: estates surveillance  Horizontal prototype  Implemented in shell programming and ProLog


Download ppt "1 Viewing Vision-Language Integration as a Double-Grounding case Katerina Pastra Department of Computer Science, Natural Language Processing Group, University."

Similar presentations


Ads by Google