Download presentation
Presentation is loading. Please wait.
Published bySherman Quinn Modified over 9 years ago
1
An Architecture for Mining Resources Complementary to Audio-Visual Streams J. Nemrava, P. Buitelaar, N. Simou, D. Sadlier, V. Svátek, T. Declerck, A. Cobet, T. Sikora, N. O'Connor, V. Tzouvaras, H. Zeiner, J. Petrák
2
Introduction Video retrieval can strongly benefit from textual sources related to the A/V stream Vast textual resources available on the web can be used for fine-grained event recognition. Good example is sport-related videos Summaries of matches Tabular (list of player, cards, substitutions) Textual (minute-by-minute reports)
3
Available Resources Audio-Video Streams A/V analysis captures features from the video using suitable detectors Primary Complementary Directly attached to the media Overlay text, spoken commentaries, Secondary Complementary Independent from the media Written commentaries, summaries, analysis
4
Audio-Video Analysis Crowd image detector Speech-Band Audio Activity On-Screen Graphics Tracking Motion activity measure Field Line orientation Close-up
5
Primary complementary resources Video track Overlay text OCR text region detection Time synchronization Merging 16 frames to recognize moving from static objects in the video Textual information such as overlay text and players numbers provide additional primary resource Audio track Speech commentaries
6
Secondary Complementary Resources Tabular Summaries, list of players, goals, cards “meta” information Location, referee, attendance, date
7
Secondary Complementary Resources Unstructured Several minute-by-minute sources Text analysis and event extraction using SPRouT Player actions Player Names German and English Ontology based IE tool SProUT ‘A beautiful pass by Ruud Gullit set up the first Rijkaard header.’
8
Ontology SProUT uses SmartWeb football ontology for Player action Referee action Trainer action
9
Architecture Overview 21
10
Architecture overview
11
Reasoning over complementary resources of football games Textual Sources (per coarse-grained minute) Extraction of semantic concepts from unstructured texts using DFKI ontology based information extraction tool Video Analysis (for every second) - DCU Crowd image detector – values range ∈ [0,1] Speech-Band Audio Activity - values range ∈ [0,1] Motion activity measure - values range ∈ [0,1] Close-up - values range ∈ [0,1] Field Line orientation - values range ∈ [0,90]
12
Video Analysis Fuzzification A period of 20 seconds is evaluated A threshold value was set according to the detectors mean value during the game. Top value was mapped to [0,1] Similar process for motion, close up and crowd detectors
13
Video Analysis Fuzzification Line angle Values between 0-7 are Middle Field Values between 17-27 are End of Field Fuzzification according to their occurrences in the period of 20 seconds Example Middle Field 13 occurrences Fuzzy Value = 0.65 End of Field 4 occurrences Fuzzy Value = 0.2 Other 3 occurrences Fuzzy Value = 0.15
14
Declaring Alphabet … Concepts = {Scoringopportunity Outofplay Handball Kick Scoregoal Cross Foul Clear Cornerkick Dribble Freekick Header Trap Shot Throw Pass Ballpossession Offside Charge Lob Challenge Booked Goalkeeperdive Block Save Substitution Tackle EndOfField MiddleField Other Crowd Motion CloseUp Audio} Roles = {consistOf} Individuals = {min0 sec20 sec40 sec60 min1 sec80 sec100 sec120 min2 sec140 sec160 sec180 min3 sec200…}
15
Knowledge Representation- ABox 〈 min1 : Kick ≥ 1 〉 〈 min1 : Scoregoal ≥ 1 〉 〈 sec80 : Audio ≥ 0.06 〉 〈 sec80 : Crowd ≥ 0.231 〉 〈 sec80 : Motion ≥ 0.060 〉 〈 sec80 : EndOfField ≥ 0.05 〉 〈 (min1 : sec60 ) : consistOf ≥ 1 〉 〈 (min1 : sec80 ) : consistOf ≥ 1 〉 〈 (min1 : sec100 ) : consistOf ≥ 1 〉 〈 (min1 : sec120 ) : consistOf ≥ 1 〉
16
Knowledge Representation- TBox
17
Query Examples
18
Architecture Overview 21
19
Cross-Media Features Basic idea Identify which video detectors are more prominent for which event class For instance for CORNERKICK the “end-zone” video detector should be significantly high Strategy Analyze distribution of video detectors over event classes Identify significant detectors for each class Feedback into the video event detection algorithm
20
Cross-Media Features purpose of the cross-media descriptors is to capture the features and relations in multimodal data so as to be able to retrieve complementary information when dealing with one of the data sources build up model to classify events in video independently from the video Use of cross-media features in event-type classification of video segments by use of fuzzy reasoning with the FiRe inference engine Fire is focused on events retrieval
21
Thank you for your attention
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.