1 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Exploitation of knowledge in video recordings Dr. Alexia Briassouli, Dr. Yiannis Kompatsiaris Multimedia Knowledge Laboratory CERTH-ITI October 24, 2008 Thessaloniki, Greece
2 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute 2 Evolution of Content 1-2 exabytes (millions of terabytes) of new information produced world-wide annually 80 billion of digital images are captured each year Over 1 billion images related to commercial transactions are available through the Internet This number is estimated to increase by ten times in the next two years new films are produced each year world-wide available films television stations and radio stations 100 billions of hours of audiovisual content Personal Content Sport - News Movies Web Mobile
3 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute 3 Multimedia Content Networks Storage & Devices Segmentation KA Analysis Labeling Cross-media analysis Context Reasoning Metadata Generation & Representation Content adaptation and distribution - Multiple Terminal & Networks Hybrid / Content-based retrieval recommendations and personalization Semantic technology in Markets Web 2.0 photo - video applications
4 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute 4 Need for annotation + medatata “The value of information depends on how easily it can be found, retrieved, accessed, filtered or managed in an active, personalized way”
5 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Video Analysis Video analysis that exploits knowledge provides significant advantages: Improved accuracy of semantics from video Higher level concepts inferred through exploitation of knowledge combined with video processing: Knowledge about behavior, event detection More efficient storage, access, retrieval, dissemination of multimodal data because of the (automatically generated) annotations
6 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Video Analysis in JUMAS
7 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute 7 Text-based indexing Manual annotation + Straightforward + High/Semantic level + Efficient during content creation Most commonly used Necessary in a number of applications - Time consuming - Operator-application dependent - Text related problems (synonyms etc) Annotation using captions and related text Web, Video, Documents etc + Straightforward + High/Semantic level + Multimodal approach - Text processing restrictions and limitations - Captions must exist
8 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute 8 Semantic Gap Addressing the Semantic Gap Semantic Gap for multimedia: To map automatically generated numerical low level- features to higher level human-understandable semantic concepts Dominant Color Descriptor of a sky region This image contains a sky region and is a holiday image
9 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute 9 Problem definition Semantic image analysis: how to translate the automatically extracted visual descriptions into human like conceptual ones Low-level features provide cues for strengthen/weaken evidence based on visual similarity Prior knowledge is needed to support semantics disambiguation
10 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute 10 Additional Analysis Information Knowledge Infrastructure (Multimedia Ontology) Manual Annotation - Models Semantic Analysis Single Modality Analysis Knowledge Extraction A common view Feature extraction Text, Image analysis Segmentation, SVMs Evidence generation “Vehicle”, “Building” Classifiers fusion Global vs. Local Modalities fusion Context “Ambulance” Reasoning Fusion of annotations Consistency checking Higher-level concepts/events “Emergency scene” Multimedia content annotation tools Training (Statistical) Modeling Domain Multimedia content Annotations Algorithms - Features Context
11 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Knowledge from Video analysis Semantics from video Implicitly derived via machine learning methods i.e. based on training: SVM, HMM, Neural Networks, Bayesian Networks Training uses appropriate data, relevant to the semantics that interest us Training finds models that connect low level features (e.g. motion trajectories) with high-level annotations These models are then applied to test data
12 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute 12 Natural-Person: Sailing-Boat: Sand: Building: Pavement: Road: Body-Of-Water: Cliff: Cloud: Mountain: Sea: Sky: Stone: Waterfall: Wave: Dried-Plant: Dried-Plant-Snowed: Foliage: Grass: Tree: Trunk: Snow: Sunset: Car: Ground: Lamp-Post: Statue: Classification Results aceMedia Segment’s hypothesis set
13 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute 13 Frame Region – Concept Association Region feature vector formed from local descriptors Region feature vector formed from local descriptors Individual SVM introduced for every defined local concept, receiving as input the region feature vector Individual SVM introduced for every defined local concept, receiving as input the region feature vector Training identical to global concept training case Training identical to global concept training case Every region evaluated by all trained SVMs, segment’s local concept hypothesis set created ( ) Every region evaluated by all trained SVMs, segment’s local concept hypothesis set created ( ) Ground: 0.89 Grass: 0.44 Mountain: 0.21 Boat: 0.07 Smoke: 0.41 Dirty-Water: 0.18 Trunk: 0.12 Foam: 0.19 Debris: 0.34 Mud: 0.31 Water: 0.42 Sky: 0.22 Ashes: 0.11 Subtitles: 0.24 Flames: 0.13 Vehicle: 0.12 Building: Foliage: 0.84 Person: 0.32 Road: 0.39 Segment’s hypothesis set
14 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute 14 Initial Region-Concept Association Region feature vector formed from local descriptors Individual SVM introduced for every defined concept, receiving as input the region feature vector Training identical to global training case Every region evaluated by all trained SVMs, segment’s concept hypothesis set created ( ) Building: 0.89 Roof: 0.29 Grass: 0.21 Tree: 0.07 Stone: 0.41 Ground: 0.15 Dried-plant: 0.12 Sky: 0.19 Person: 0.34 Trunk: 0.31 Vegetation: 0.42 Rock: 0.22 Boat: 0.11 Sand: 0.44 Sea: 0.13 Wave: 0.12 Segment’s hypothesis set
15 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Knowledge for Video analysis Explicit Semantics from video Based on previously known models Explicitly defined models, rules, facts Rules from preliminary scripts and standards from similar cases Explicit and implicit knowledge can be combined with results from low-level video processing to extract meaningful high-level knowledge
16 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute System Overview Multimedia Content Video Analysis (face recognition, motion segmentation etc) Knowledge Infrastructure (explicit or implicit) Semantic Multimedia Description
17 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Video analysis Motion Analysis Motion detection Tracking Detection of when motion occurs Motion Segmentation Object segmentation based on motion characteristics Generation of ‘active regions’
18 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Activity Areas from motion analysis
19 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Sub-activity Areas After statistical processing for temporal localization of motion and events People walking towards each other People meet People leave together
20 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute 20 Fight Sequence
21 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Video Processing (1) Pre-processing Separate video from audio Split video into frames Noise removal via spatiotemporal filtering Scene/shot detection Shot = frames taken by single camera Detect transition between frames Uses only low-level information Scene = story-telling unit Uses higher-level knowledge, semantics
22 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Video Processing (2) Spatial segmentation: Spatial segmentation in images, video frames Extracts object(s) based on color, texture features Motion segmentation: Groups pixels with similar motion Spatiotemporal segmentation: Finds objects over several frames through combination of motion, appearance features Merges spatial and motion segmentation results
23 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Knowledge in Video Analysis (1) Low level features can be combined with knowledge/rules for higher-level results Spatiotemporally segmented objects can be used for object recognition Face/gesture recognition after training with faces/gestures of significance Motion in specific parts of a video (e.g. near court entrance, near prisoner’s seat) has additional significance: Needs prior knowledge of which parts of the video frames are important and why
24 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Knowledge in Video Analysis (2) Knowledge structures can provide additional information about the relations between different low-level features Interactions e.g. two motions in opposite directions, relation of extracted gestures, may mean something: people meeting, fighting, pointing, gesticulating Face recognition combined with prior knowledge can show who is present when an event occurs
25 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Conclusions Combined use of video processing with knowledge can lead to richer and more accurate high-level descriptions of multimedia data Can be used in many more applications than currently, because the knowledge introduces flexibility and adaptability to the system: The same algorithms and low-level features can provide much more information when used in combination with explicit and implicit knowledge
26 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute 26 Thank you! CERTH-ITI / Multimedia Knowledge Laboratory
27 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Video Analysis State of the Art Spatiotemporal segmentation: Find spatiotemporally homogeneous objects i.e. similar appearance and motion Apply spatial segmentation on each frame Match segmented objects in successive frames using low-level features (e.g. similar color, texture, continuous motion) Use motion information – project position of object in current/next frames
28 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Video Analysis State of the Art
29 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Video Analysis State of the Art
30 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Video Analysis State of the Art Spatial segmentation: Spatial segmentation in images, video frames: Region Based: Most methods are based on grouping similar features like color, texture, location – based on homogeneity of intensity, texture, position Gradient/edge based: detecting changes in spatial distribution of features e.g. pixel illumination Some methods combine region/edge information