Sriram Tata SID: 800448062. Introduction: Large digital video libraries require tools for representing, searching, and retrieving content. One possibility.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
ECG Signal processing (2)
DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.
An Introduction of Support Vector Machine
An Overview of Machine Learning
Supervised Learning Recap
Uncertainty Representation. Gaussian Distribution variance Standard deviation.
Visual Recognition Tutorial
Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
Pattern Recognition and Machine Learning
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Multimedia Search and Retrieval Presented by: Reza Aghaee For Multimedia Course(CMPT820) Simon Fraser University March.2005 Shih-Fu Chang, Qian Huang,
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
SVM Active Learning with Application to Image Retrieval
1 Integrating User Feedback Log into Relevance Feedback by Coupled SVM for Content-Based Image Retrieval 9-April, 2005 Steven C. H. Hoi *, Michael R. Lyu.
Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.
Dept. of Computer Science & Engineering, CUHK Pseudo Relevance Feedback with Biased Support Vector Machine in Multimedia Retrieval Steven C.H. Hoi 14-Oct,
Presented by Zeehasham Rasheed
ICME 2004 Tzvetanka I. Ianeva Arjen P. de Vries Thijs Westerveld A Dynamic Probabilistic Multimedia Retrieval Model.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
Radial Basis Function Networks
DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.
Information Retrieval in Practice
TEMPORAL VIDEO BOUNDARIES -PART ONE- SNUEE KIM KYUNGMIN.
Semantic Indexing of multimedia content using visual, audio and text cues Written By:.W. H. Adams. Giridharan Iyengar. Ching-Yung Lin. Milind Ramesh Naphade.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Università degli Studi di Modena and Reggio Emilia Dipartimento di Ingegneria dell’Informazione Prototypes selection with.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Multimodal Information Analysis for Emotion Recognition
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Competence Centre on Information Extraction and Image Understanding for Earth Observation 29th March 2007 Category - based Semantic Search Engine 1 Mihai.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Image Classification for Automatic Annotation
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Statistical techniques for video analysis and searching chapter Anton Korotygin.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Chapter 7. Classification and Prediction
Visual Information Retrieval
Dynamical Statistical Shape Priors for Level Set Based Tracking
Image Segmentation Techniques
Multimedia Information Retrieval
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Presentation transcript:

Sriram Tata SID:

Introduction: Large digital video libraries require tools for representing, searching, and retrieving content. One possibility is the query-by-example (QBE) approach, in which users provide (usually visual) examples of the content they seek. since most users wish to search in terms of semantic-concepts rather than by visual content, work in the video retrieval area has begun to shift from QBE to query-by-keyword (QBK) approaches, which allow the users to search by specifying their query in terms of a limited vocabulary of semantic concepts. This paper presents an overview of an ongoing IBM project which is developing a trainable QBK system for the labeling and retrieval of generic multimedia semantic concepts in video

Motivation: In prior work, the emphasis has been on the extraction of semantics from individual modalities, in some instances, using audio and visual modalities. This paper combines audio and video content analysis with information retrieval in a unified setting for the semantic labeling of multimedia content.

Motivation: In prior work, the emphasis has been on the extraction of semantics from individual modalities, in some instances, using audio and visual modalities. This paper combines audio and video content analysis with information retrieval in a unified setting for the semantic labeling of multimedia content.

Research’s Approach: Researcher’s approached semantic labeling as machine learning problem. Assumption is that the a priori definition of a set of atomic-semantic concepts like objects, scenes and events are broad enough to cover the semantic query space of interest. The set of atomic concepts are annotated manually in audio, speech, and/or video within a set of “training” videos.

Challenges: Firstly, Low-level features appropriate for labeling atomic concepts must be identified as different features may be appropriate for different concepts and appropriate schemes for modeling these features are to selected. Needed techniques for segmenting objects automatically from video. Secondly, High-level concepts must be linked to the presence of other concepts and statistical models for combining these concept models into a high-level model must be chosen. Thirdly, cutting across these levels, information from multiple modalities must be integrated or fused.

Semantic – Content Analysis System The proposed IBM system for semantic-content analysis and retrieval comprises three components: 1.tools for defining a lexicon of semantic-concepts and annotating examples of those concepts within a set of training videos. 2. schemes for automatically learning the representations of semantic- concepts in the lexicon based on the labeled examples. 3. tools supporting data retrieval using the semantic concepts.

Lexicon of semantic concepts: The lexicon of semantic-concepts defines the working set of intermediate- and high-level concepts, covering events, scenes, and objects.

Annotation: Manually labeled training data is required in order to learn the representations of each concept in the lexicon. Annotation of visual data is performed at shot level; since concepts of objects like rockets and cars etc may occupy only a region within a shot, tools also allow users to associate object labels with an individual region in a key-frame image by specifying manual bounding boxes (MBB). Annotation of audio data is performed by specifying time spans over which each audio concept such as speech, occurs. Speech segments are then manually transcribed. Multimodal annotation follows with synchronized playback of audio and video during the annotation process.

Learning semantic concepts from features: Mapping low-level features to semantics is a challenging problem. For the labeled training data, useful features must be extracted and used to construct a representation of each atomic concept. For this purposes in this paper, human knowledge is used to determine the type of features that are appropriate for each concept. In this paper, atomic concepts are modeled using features from a single modality and the integration of cues from multiple modalities occurs only within models of high-level concepts.

Probabilistic modeling of semantic-concepts and events using models such as Gaussian mixtures models (GMMs ), Hidden Marchov models (HMMs) and Bayesian networks. Discriminant approaches such as Support Vector machines (SVM’s) Modeling techniques :

A semantic concept is modeled as a class conditional probability density function over a feature space. GMMs are used for independent observation vectors and HMMs for time series data. A GMM defines a probability density function of an n-dimensional observation vector x given a model M, Where μ i is an n-dimensional vector, Σ i is an n × n matrix, and π i is the mixing weight for the ith gaussian. Probabilistic modeling for semantic-classification :

An HMM [20] allows us to model a sequence of observations ( x1, x2,..., xn) as having been generated by an unobserved state sequence s1,..., sn with a unique starting state s0, giving the probability of the model M generating the output sequence as where the probability q( xi|si−1, si) can be modeled using a GMM, for instance, and p(si|i−1) are the state transition probabilities.

Discriminant techniques: Support Vector Machines: The reliable estimation of class conditional parameters in the previous section requires large amounts of training data for each class, but for many semantic-concepts of interest, this may not be available. So SVM’s with radial basis kernels are one possibility. An SVM tries to find a best-fitting hyper plane that maximizes the generalization capability while minimizing misclassification errors. Assume that we have a set of training samples ( x1,..., xn) and their corresponding labels ( y1,..., yn) where yi ∈ {−1, 1}, then SVMs map the samples to a higher-dimensional space using a predefined nonlinear mapping Φ ( x) and solve a minimization problem in this high- dimensional space that finds a suitable linear hyper plane separating the two classes ( w · Φ (xi) + b), subject to minimizing the misclassification cost,

Learning Visual concepts : In case of static visual scenes or objects, the class conditional density functions of the feature vector under the true and null hypotheses are modeled as mixtures of multidimensional Gaussians. In this paper, we compare the performance of GMMs and SVMs for the classification of static scenes and objects. In both cases, the features being modeled are extracted from regions in the video or from the entire frame depending on the type of the concept.

Learning audio concepts : The scheme for modeling audio-based atomic concepts, such as silence, rocket engine explosion, or music, begins with the annotated audio training set. One scheme for incorporating duration modeling is HMM. Representing concepts using speech Speech cues may be derived from one of two sources: manual transcriptions such as close captioning or the results of automatic speech recognition (ASR) on the speech segments of the audio. the transcriptions must be split into documents and preprocessed ready for retrieval. Documents are defined here in two ways: the words corresponding to a shot or words occurring symmetrically around the center of a shot.

This document construction scheme gives a straightforwardmapping between documents and shots. The procedure for labeling a particular semantic-concept using speech information alone assumes the a priori definition of a set of query terms pertinent to that concept. One straightforward scheme for obtaining such a set of query terms automatically would be to use the most frequent words occurring within shots annotated by a particular concept Representing concepts using speech

Till now the concept are modeled in individual modalities. Each of these models is used to generate scores for these concepts in unseen video. One or more of these concept scores are then combined or fused within models of high-level concepts, which may in turn contribute scores to other high-level concepts. Learning multimodal concepts:

Bayesian network is used to combine audio, visual, and textual information. Bayesian networks allows us to graphically specify a particular form of the joint probability density function. The above figure represents just one of many possible Bayesian network model structures for integrating scores from atomic concept models Inference using graphic models:

In this approach, the scores from all the intermediate concept classifiers are concatenated into a vector, and this is used as the feature in the SVM. The below illustrated figure shows this.. Classifying concepts using SVM’s:

If you consider a cluster in the feature space, this maps into a 1- dimensional cluster of scores for any given classifier. If we consider a set of classifiers, the combination of this 1-dimensional cluster of scores will now map into a cluster in this semantic feature space. We can then view the SVM for fusion as operating in this new “feature” space and find a new decision boundary. This is explained in the below figure for 2- dimensional feature space and 2 classifiers. Classifying concepts using SVM’s:

We now demonstrate the application of the semantic-content analysis framework to the task of detecting several semantic-concepts from the NIST Video TREC 2001 corpus. Annotation is applied at the level of camera shots. A total of 7 videos consisting of 1248 video shots are used. They are sequences entitled anni005, anni006, anni009, anni010, nad28, nad30, and nad55 in the TREC 2001 corpus. The examination of the corpus justifies our hypothesis that the integration of cues from multiple modalities is necessary to achieve good concept labeling or retrieval performance. Experimental Results:

Shot segmentation of these videos was performed using the IBM Cue Video toolkit. Key frames are selected from each shot and low-level features representing color, structure, and shape are extracted.. Visual shot detection : Audio feature detection : The low-level features used to represent audio are 24-dimmel- frequency cepstral coefficients (MFCCs), common in ASR systems

The current lexicon comprises more than fifty semantic concepts for describing events, sites, and objects with cues in audio, video, and/or speech. Only a subset is described in these experiments. (i) Visual Concepts: rocket object, fire/smoke, sky, outdoor. (ii) Audio Concepts: rocket engine explosion, music, speech, noise. (iii) Multimodal Concept: rocket launch. Lexicon:

These are results presented on the detection of visual concepts GMM classification builds a GMM for the positive and the negative hypotheses for each feature type for each semantic concept. We then merge results across features for these multiple classifiers using the naive Bayes approach. The below table shows the overall retrieval effectiveness for a variety of intermediate visual semantic-concepts with SVM and GMM classifiers. Retrieval using models for visual features: Results: GMM versus SVM classification

The following figure shows the precision – recall curves for 4 different visual concepts outdoors, sky, rocket object and fire/smoke. Results: GMM versus SVM classification Outdoor(a) Sky (b)

Results: GMM versus SVM classification Rocket(a) Fire/smoke (b)

Retrieval using models for audio features: This section presents two sets of results: The first examines the effects of minimum duration modeling upon intermediate concept retrieval The second examines different schemes for fusing scores from multiple audio-based intermediate concept models in order to retrieve the high- level rocket launch concept.

Results: minimum duration modeling The below figure compares the retrieval of the rocket engine explosion concept with HMM and GMM scores, respectively. Notice that the HMM model has significantly higher precision for all recall values compared to the GMM model..

Results: fusion of scores from multiple audio models The below figure compares implicit and explicit fusion of the atomic audio concepts for the high-level concept (rocket launch) retrieval.

Retrieval using speech This section presents two set of results: The retrieval of the rocket launch concept using manually produced ground truth transcriptions. Retrieval using transcriptions produced using ASR.

Retrieval using fusion of multiple modalities: This section presents results for rocket launch concept which is inferred from concept models based on multiple modalities. This presents results for two different integration schemes Bayesian network integration and SVM.

Bayesian network integration: A Bayesian network is used to combine the soft decision of the visual classifier for rocket object with the soft decision of the audio classifier for explosion in a model of the rocket launch concept. The below figure illustrates results of using bayesian network for doing fusion. It shows precision recall values for first 100 documents retrieved.

SVM Integration: For fusion with SVM, scores from all the semantic models are considered like audio, video and text modalities concatenating into 9 dimensional feature vector.

Results: The table below shows the FOM for both the fusion models which is obvious that fusion models are superior to the retrieval results of individual modalities.

Results: The figure above shows the qualitative evidence of success of SVM model. In the top 20 images retrieved there are 19 rocket launch shots.

Conclusion: This paper presented an overview of a trainable QBK system for labeling semantic-concepts within unrestricted video. These experimental results are suffice to show that information from multiple modalities visual, audio, speech, and potentially video text can be successfully integrated to improve semantic labeling performance over that achieved by any single modality. Finally the proposed fusion scheme achieves more than 10% relative improvement over the best unimodal concept detector.

Thank You