Presentation is loading. Please wait.

Presentation is loading. Please wait.

MUSCLE movie data base is a multimodal movie corpus collected to develop content- based multimedia processing like: - speaker clustering - speaker turn.

Similar presentations


Presentation on theme: "MUSCLE movie data base is a multimodal movie corpus collected to develop content- based multimedia processing like: - speaker clustering - speaker turn."— Presentation transcript:

1 MUSCLE movie data base is a multimodal movie corpus collected to develop content- based multimedia processing like: - speaker clustering - speaker turn detection - visual speech activity detection - face detection - facial feature detection - face clustering scene segmentation - saliency detection - multimodal dialogue detection

2 This database covers four different modalities : - audio - video - audiovisual - text Video annotation tool ANVIL and Anthropos 7 Editor are described This database covers four different modalities : - audio - video - audiovisual - text Video annotation tool ANVIL and Anthropos 7 Editor are described

3 VIDEO ANNOTATION TOOL ANVIL : video annotation tool ANVIL : video annotation tool It offers hierarchical multi-layered annotation It offers hierarchical multi-layered annotation Annotation board shows colour-coded elements on multiple tracks in time-alignment Annotation board shows colour-coded elements on multiple tracks in time-alignment ANVIL can import data from PRAAT and XWaves ANVIL can import data from PRAAT and XWaves

4 Anthropos 7 Editor Anthropos 7 Editor is an annotation tool for MPEG-7 data Anthropos 7 Editor is an annotation tool for MPEG-7 data It makes viewing and editing MPEG-7 data easier It makes viewing and editing MPEG-7 data easier To visualise information Anthropos 7 Editor uses the Timeline Area. Information based on a single frame is visualised in the Video Area, static movie information in the Static Information Area. To visualise information Anthropos 7 Editor uses the Timeline Area. Information based on a single frame is visualised in the Video Area, static movie information in the Static Information Area. These areas communicate with each other These areas communicate with each other Anthropos 7 Editor can visualise the ROI (Region of interest) of each actor. The user can interact using the mouse. Anthropos 7 Editor can visualise the ROI (Region of interest) of each actor. The user can interact using the mouse. Every image region encompassing an actor can be overlaid as a Box and it can be modified by a user Every image region encompassing an actor can be overlaid as a Box and it can be modified by a user

5 MUSCLE movie data base specifications Concepts like dialogue, saliency must be described independently: audio-only, video-only and also audio- visual description Concepts like dialogue, saliency must be described independently: audio-only, video-only and also audio- visual description

6 Dialogue annotation 54 movie scenes extracted from 8 movies 54 movie scenes extracted from 8 movies The language for all scenes is English The language for all scenes is English Duration of each scene is 24-123 seconds Duration of each scene is 24-123 seconds Each movie scene was separated in two different files: an audio file, a video file Each movie scene was separated in two different files: an audio file, a video file

7 MUSCLE movie data base description Movie title Number of Number of non-dialogue Scenes per Movie Dialogue scenes scenes Analyze That 4 2 6 Cold Mountain 5 1 6 Jackie Brown 3 3 6 Lord of the Rings I 5 3 8 Platoon 4 2 6 Secret Window 4 6 10 The Prestige 4 2 6 American Beauty 10 0 10 Total number 39 19 58

8 Types of dialogues for audios: - with low-level audio background: BD (dialogue wih background): dialogue in the presence of noisy background or music - monologue is classified as CM (clean monologue) or BM (Monologue with backrgound) all scenes not labeled CD or BD are considered non- dialogue Types of dialogues for video: - CD: 2 actors present in the scene - BD: at least two actors are present - monologues types for video labeled as CM or BM

9 Metadata for audio files: Speech activity data: Speech activity data: Speech intervals (from the start and the end time) Speech intervals (from the start and the end time) Metadata for video files: Lip activity data (defined by the start and end time and frame) Lip activity data (defined by the start and end time and frame)

10 States to label lip activity intervals: 0 : ack of actor’s head visible 0 : ack of actor’s head visible 1 :actor’s frontal face is visible 1 :actor’s frontal face is visible 2 : actor’s frontal face visible + lip activity 2 : actor’s frontal face visible + lip activity

11 Afterwards: Face tracking info extracted from the scenes Face tracking info extracted from the scenes The extracted info is processed by a human annotator The extracted info is processed by a human annotator face of each actor in a dialogue or monologue is assigned a bounding box face of each actor in a dialogue or monologue is assigned a bounding box Data saved in xml MPEG-7 format Data saved in xml MPEG-7 format Two files (audio, video) merged into one xml file for each scene Two files (audio, video) merged into one xml file for each scene

12 Saliency annotation Based on detection of „pops-out” (abrupt changes, abnormalities e.g. in speech, environmental noises etc.) Based on detection of „pops-out” (abrupt changes, abnormalities e.g. in speech, environmental noises etc.)

13 3 movie clips (27 mins) from 3 different movies of different genres 3 movie clips (27 mins) from 3 different movies of different genres Chosen carefully to represent all cases of saliency Chosen carefully to represent all cases of saliency Audio content includes: speech in a dialogie, with background sound like music, noises. Audio content includes: speech in a dialogie, with background sound like music, noises. The background sounds: animals, knockings, cars etc. The background sounds: animals, knockings, cars etc. Visual content: abrupt scene changes,editing effects e.g. computer made light Visual content: abrupt scene changes,editing effects e.g. computer made light

14 clips annotated by two different annotators an event considered salient is annotated separately for audio this event depends on the importance of sounds it makes in scenes for the annotator for visual: pop-out colour and motion sudden events can be regarded as salient silence is not annotated

15 ANVIL used for saliency detection ANVIL used for saliency detection 3 main saliency categories of the annotation scheme: visual, audio, generic saliency 3 main saliency categories of the annotation scheme: visual, audio, generic saliency Audio saliency is annotated using auditory sense Audio saliency is annotated using auditory sense Visual saliency using the visual sense Visual saliency using the visual sense Generic saliency using both modalities simultaneously Generic saliency using both modalities simultaneously

16 Audio saliency Description of the audio in the scene Description of the audio in the scene Chosen categories: dialogue, music, noise, sound effect, environmental sound, machine sound, background sound, unclassified sound, mixed sound. Chosen categories: dialogue, music, noise, sound effect, environmental sound, machine sound, background sound, unclassified sound, mixed sound. The annotator can chose more than one sound The annotator can chose more than one sound Speech saliency measured by intensity and loudness of voice Speech saliency measured by intensity and loudness of voice

17 Visual saliency Description of the object’s motion Description of the object’s motion Pop-out events annotated as well Pop-out events annotated as well Visual Saliency Motion Start-Stop, Stop-Start, Impulsive event, Static, Moving, Other Changes of cast (binary decision) Pop-out event (binary decision) Saliency Factor None, Low, Mid, High

18 Generic saliency A low-level description of saliency A low-level description of saliency Description features are: audio, visual, audiovisual Description features are: audio, visual, audiovisual Saliency measured as high, mid or low Saliency measured as high, mid or low


Download ppt "MUSCLE movie data base is a multimodal movie corpus collected to develop content- based multimedia processing like: - speaker clustering - speaker turn."

Similar presentations


Ads by Google