Presentation is loading. Please wait.

Presentation is loading. Please wait.

Vision-Language Integration in AI: a reality check Katerina Pastra and Yorick Wilks Department of Computer Science, Natural Language Processing Group,

Similar presentations


Presentation on theme: "Vision-Language Integration in AI: a reality check Katerina Pastra and Yorick Wilks Department of Computer Science, Natural Language Processing Group,"— Presentation transcript:

1 Vision-Language Integration in AI: a reality check Katerina Pastra and Yorick Wilks Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K.

2 Setting the context Artificial Intelligence: From technical integration of modalities  multimodal meaning integration From Multimedia  Intellimedia + Intelligent Interfaces Purpose: intelligent, natural, coherent communication  We focus on: vision and language integration  Visual modalities = images (visual perception and/or visualisation representations physically realised as e.g. 2D/3D graphics, photos…)  Linguistic modalities = text and/or speech

3 The problem  What is computational V-L integration? (definition)  How is it achieved computationally? (state of the art, practices, tendencies, needs)  How far can we go? (implementation suggestions, the VLEMA prototype)  Multimodal Integration: an old AI aspiration (cf. Kirsch 1964)  A wide variety of V-L integration prototypes in AI but lack of an AI study of V-L integration, lack of a reality check 

4 In search of a definition  Defining computational V-L Integration: could a review of related applied AI research hold the answer ? Related work:  Srihari 1994: review of V-L integration prototypes  limited number of prototypes reviewed  suggestions and implementations are mixed  no clear focus on how integration is achieved  system classification according to input type  includes cases of quasi-integration  criteria for such a review ???

5 The notion of quasi-integration  Quasi-integration: fusion of results obtained by modality-dependent processes (= intersection or combination of results, or even the results of one process constrain the search space for another ) The basketball player... Our champion came first... and the soccer player. NLP (key phrase identification) IP (key frame identification from frames that correspond to the key sentence(s) extracted) Video summary

6 Defining integration through classification  It is diachronic: from SHRDLU (Winograd ´72) to conversational robots of the new millennium (e.g. Shapiro and Ismail 2003, Roy et al. 2003)  Main criterion for considering a prototype for review: V-L integration to be essential for the task the prototype is built for. Specifics of the review:  It crosses over into diverse AI areas and applications: more than 60 prototypes reviewed from IR to Robotics  System classification criterion: the integration purpose served

7 Classification of V-L integration prototypes System typeIntegration Process Performance Enhancement Medium x analysis  Medium y analysis (NL  IU, or NL  IU) Medium Translation Source medium analysis  Target medium gen. (image  language or image  language) Multimedia Generation Abstracted data  Multimedia generation (tabular data or knowledge representation) Situated Dialogue Multimedia analysis  Medium/multimedia gen. (NL analysis and shared visual scene  action/MM)

8 Examples SystemInput Integration Resources Integration Mechanisms Output PICTION Photos, captions (EN) Integr. KBSemantic Networks Face identification SOCCER Soccer video, trajectories GSD/frame, event model Event model inst., verbalisation history... Textual (GER) event description MAGIC Patient fileSchemasSchema inst. media selector, co-reference Speech or text (EN), animation CASSIE Speech (EN), 3D blocks KL-PML asso lists UnificationObject ident. limited conversation

9 Beyond differences  different visual and linguistic modalities involved  different tasks performed  different integration purposes served, but Integration resources = Associations between : Visual and corresponding linguistic information e.g. words/concepts and visual features or image models Form: lists, integrated KB, scene/event models in KR Integration mechanisms = KR instantiation, translation rules, media selection, coordination…  similar integration resources are used (though represented and instantiated differently)

10 A descriptive definition Descriptive Definition = a) Intensional Definition (what the term is e.g. its genus et differentia)  b) Extensional Definition (what the term applies to) a) Computational Vision-Language Integration is a process of associating visual and corresponding linguistic pieces of information (indirect back-up from Cognitive Science: cf. notion of learned associations in Minsky´s "Society of Mind" 1986, and Jackendoff´s theory of associating concepts and 3D models, 1987) b) Computational Vision-Language Integration may take the form of one of 4 integration processes according to the integration purpose to be served

11 The AI quest for V-L Integration Argument : In relying on human created data, state of the art V-L integration systems avoid core integration challenges and therefore fail to perform real integration  Simulated or manually abstracted visual input is used  to avoid difficulties in image analysis  Applications are restricted to blocksworlds/miniworlds  scaling issues  Manually constructed integration resources used  to avoid difficulties in associating V-L Difficulties in integration: correspondence problem etc. but, difficulties lie there where developers intervene...

12 How far can we go? Challenging current practices in V-L integration system development requires that an ambitious system specification is formulated A prototype should:  work with real visual scenes  analyse its visual data automatically  associate images and language automatically Is it feasible to develop such a prototype ???

13 An optimistic answer VLEMA: A Vision-Language intEgration MechAnism  Input: automatically re-constructed static scenes in 3D (VRML format) from RESOLV (robot-surveyor)  Integration task: Medium Translation from images (3D sitting rooms) to text (what and where in EN)  Domain: estates surveillance  Horizontal prototype  Implemented in shell programming and ProLog

14 The Input

15 OntoVis + KB “…a heater … and a sofa with 3 seats…” Description Data Transformations Object Segmentation Object Naming System Architecture

16 The Output Wed Jul 7 13:22:22 GMTDT 2004 VLEMA V1.0 Katerina Pastra@University of Sheffield Description of the automatically constructed VRML file “development-scene.wrl” This is a general view of a room. We can see the front wall, the left-side wall, the floor, A heater on the lower part of the front-wall and a sofa with 3 seats. The heater is shorter in length than the sofa. It is on the right of the sofa.

17 Conclusion *** Could occasional reality checks re-direct (part of) AI research ? ***  Descriptive definition of V-L integration in AI  a theoretical explanatory one in: K. Pastra (2004), “Viewing Vision-Language Integration as a Double-Grounding Case”, Proceedings of the AAAI Fall Symposium Series, Washington DC.  Review and critique of the state of the art in AI  The VLEMA prototype – a baseline for future research that will challenge current practices


Download ppt "Vision-Language Integration in AI: a reality check Katerina Pastra and Yorick Wilks Department of Computer Science, Natural Language Processing Group,"

Similar presentations


Ads by Google