Semantic Indexing of multimedia content using visual, audio and text cues Written By:.W. H. Adams. Giridharan Iyengar. Ching-Yung Lin. Milind Ramesh Naphade.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Finding Structure in Home Videos by Probabilistic Hierarchical Clustering Daniel Gatica-Perez, Alexander Loui, and Ming-Ting Sun.
Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Sriram Tata SID: Introduction: Large digital video libraries require tools for representing, searching, and retrieving content. One possibility.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Multimedia Search and Retrieval Presented by: Reza Aghaee For Multimedia Course(CMPT820) Simon Fraser University March.2005 Shih-Fu Chang, Qian Huang,
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
1 Integrating User Feedback Log into Relevance Feedback by Coupled SVM for Content-Based Image Retrieval 9-April, 2005 Steven C. H. Hoi *, Michael R. Lyu.
Architecture & Data Management of XML-Based Digital Video Library System Jacky C.K. Ma Michael R. Lyu.
Visual Information Retrieval Chapter 1 Introduction Alberto Del Bimbo Dipartimento di Sistemi e Informatica Universita di Firenze Firenze, Italy.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Presented by Zeehasham Rasheed
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
ICME 2004 Tzvetanka I. Ianeva Arjen P. de Vries Thijs Westerveld A Dynamic Probabilistic Multimedia Retrieval Model.
Scalable Text Mining with Sparse Generative Models
A structured learning framework for content- based image indexing and visual Query (Joo-Hwee, Jesse S. Jin) Presentation By: Salman Ahmad (270279)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Information Retrieval in Practice
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
TEMPORAL VIDEO BOUNDARIES -PART ONE- SNUEE KIM KYUNGMIN.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
WP5.4 - Introduction  Knowledge Extraction from Complementary Sources  This activity is concerned with augmenting the semantic multimedia metadata basis.
Bridge Semantic Gap: A Large Scale Concept Ontology for Multimedia (LSCOM) Guo-Jun Qi Beckman Institute University of Illinois at Urbana-Champaign.
Recognition of meeting actions using information obtained from different modalities Natasa Jovanovic TKI University of Twente.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Università degli Studi di Modena and Reggio Emilia Dipartimento di Ingegneria dell’Informazione Prototypes selection with.
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Finding Better Answers in Video Using Pseudo Relevance Feedback Informedia Project Carnegie Mellon University Carnegie Mellon Question Answering from Errorful.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Recognizing Activities of Daily Living from Sensor Data Henry Kautz Department of Computer Science University of Rochester.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Designing multiple biometric systems: Measure of ensemble effectiveness Allen Tang NTUIM.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Competence Centre on Information Extraction and Image Understanding for Earth Observation 29th March 2007 Category - based Semantic Search Engine 1 Mihai.
Using Webcast Text for Semantic Event Detection in Broadcast Sports Video IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 7, NOVEMBER 2008.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Image Classification for Automatic Annotation
2004 謝俊瑋 NTU, CSIE, CMLab 1 A Rule-Based Video Annotation System Andres Dorado, Janko Calic, and Ebroul Izquierdo, Senior Member, IEEE.
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
1 CS 430 / INFO 430 Information Retrieval Lecture 17 Metadata 4.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Statistical techniques for video analysis and searching chapter Anton Korotygin.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Ontology-based Automatic Video Annotation Technique in Smart TV Environment Jin-Woo Jeong, Hyun-Ki Hong, and Dong-Ho Lee IEEE Transactions on Consumer.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
An Ontology framework for Knowledge-Assisted Semantic Video Analysis and Annotation Centre for Research and Technology Hellas/ Informatics and Telematics.
Visual Information Retrieval
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Multimedia Information Retrieval
Presentation transcript:

Semantic Indexing of multimedia content using visual, audio and text cues Written By:.W. H. Adams. Giridharan Iyengar. Ching-Yung Lin. Milind Ramesh Naphade. Chalapathy Neti. Harriet J. Nock. John R. Smith Presented by: Archana Reddy Jammula

WALKING THROUGH…….. INTRODUCTION SEMANTIC-CONTENT ANALYSIS SYSTEM EXPERIMENTAL RESULTS CONCLUSIONS

INTRODUCTION Large digital video libraries require tools for representing, searching, and retrieving content. QUERY-BY-EXAMPLE (QBE ) QBE TO QUERY-BY-KEYWORD (QBK) SHIFT

OVERVIEW IBM project Trainable QBK system for the labeling and retrieval of generic multimedia semantic concepts in video Focus on detection of semantic-concepts using information cues from multiple modalities

RELATED WORK A novel probabilistic framework for semantic video indexing by learning probabilistic multimedia representations of semantic events to represent keywords and key concepts A library of examples approach A rule-based system for indexing basketball videos by detecting semantics in audio. A framework for detecting sources of sounds in audio using such cues as onset and offset. A hidden-Markov-model (HMM) framework for generalized sound recognition. Usage of tempo to characterize motion pictures.

INTRODUCTION OF NEW APPROACH Emphasis on the extraction of semantics from individual modalities, in some instances, using audio and visual modalities. Combines content analysis with information retrieval in a unified setting for the semantic labeling of multimedia content. Proposal of a novel approach for representing semantic-concepts using a basis of other semantic-concepts and propose a novel discriminant framework to fuse the different modalities. CHANGE: Approach that combines audio and visual content analysis with textual information retrieval for semantic modeling of multimedia content.

APPROACH Visualize it as a machine learning problem. Assumption: A priori definition of a set of atomic semantic-concepts (objects, scenes, and events) which is assumed to be broad enough to cover the semantic query space of interest high level Concepts. Steps: The set of atomic concepts are annotated manually in audio, speech, and/or video within a set of “training” videos. The annotated training data is then used to develop explicit statistical models of these atomic concepts; each such model can then be used to automatically label occurrences of the corresponding concept in new videos.ech, and/or video within a set of “training” videos.

CHALLENGES Low-level features appropriate for labeling atomic concepts must be identified and appropriate schemes for modeling these features are to be selected High-level concepts must be linked to the presence /absence of other concepts and statistical models for combining these concept models into a high-level model must be chosen. Cutting across these levels, information from multiple modalities must be integrated or fused

Three components: (i)Tools for defining a lexicon of semantic-concepts and annotating examples of those concepts within a set of training videos; (ii)Schemes for automatically learning the representations of semantic-concepts in the lexicon based on the labeled examples (iii) Tools supporting data retrieval using the (defined) semantic-concepts. SEMANTIC CONCEPT ANALYSIS SYSTEM:

LEXICON OF SEMANTIC- CONCEPTS Working set of intermediate- and high-level concepts, covering events, scenes, and objects. Defined independently of the modality in which their cues occur: whilst some are naturally expressed in one modality over the other Imposing hierarchy is difficult but if imposed then it is defined on terms of feature extraction or deriving from other concepts.

Semantic concept Parade is defined as: -collection of people -music -context in which this clip is interpreted as parade

ANNOTATING A CORPUS Annotation of visual data is performed at shot level; Annotation of audio data is performed by specifying time spans over which each audio concept (such as speech) occurs Multimodal annotation follows with synchronized playback of audio and video during the annotation process. Media Streams present a lexicon of semantic concepts in terms of set of icons.

LEARNING SEMANTIC- CONCEPTS FROM FEATURES Atomic concepts are modeled using features from a single modality and the integration of cues from multiple modalities occurs only within models of high-level concepts (a late integration approach); Focus is on the joint analysis of audio, visual, and textual modalities for the semantic modeling of video. Modeling approaches: 1. Probabilistic modeling of semantic-concepts and events using models such as GMMs, HMMs 2.Bayesian networks and discriminant approaches such as support vector machines (SVMs).

PROBABILISTICMODELING FOR SEMANTIC CLASSIFICATION LOGIC: Model a semantic-concept as a class conditional probability density function over a feature space. In the given set of semantic-concepts and a feature observation, choose the label as that class conditional density which yields the maximum likelihood of the observed feature. As true class conditional densities are not available, assumptions are made and choices made generally are: GMMs for independent observation vectors HMMs for time series data.

GMM - PROBABILITY DENSITY FUNCTION GMM defines a probability density function of an n- dimensional observation vector x given a model M,

HMM - PROBABILITY DENSITY FUNCTION An HMM allows us to model a sequence of observations (x1, x2,..., xn) as having been generated by an unobserved state sequence s1,..., sn with a unique starting state s0, giving the probability of the model M generating the output sequence as

DISCRIMINANT TECHNIQUES: SUPPORT VECTOR MACHINES Flaw of probabilistic modeling for semantic classification technique: The reliable estimation of class conditional parameters in the requires large amounts of training data for each class and the forms assumed for class conditional distributions may not be the most appropriate. Benefit of using discriminant techniques: support vector machines technique: Use of a more discriminant learning approach requires fewer parameters and assumptions may yield better results

SVM technique: Separates two classes using a linear hyper plane. The classes are represented as:

LEARNING AUDIO CONCEPTS Scheme for modeling audio-based atomic concepts starts with annotated audio training set. For a set of HMMs, one for each audio concept, during testing, we use two distinct schemes to compute the confidences of the different hypotheses.

REPRESENTING CONCEPTS USING SPEECH Speech cues may be derived from one of two sources: Manual transcriptions The results of automatic speech recognition (ASR) on the speech segments of the audio. Procedure for labeling a particular semantic-concept using speech information alone assumes a priori definition of a set of query terms pertinent to that concept.

Cont.. SCHEME FOLLOWED: Scheme for obtaining such a set of query terms automatically would be to use the most frequent words occurring within shots annotated by a particular concept,the set might also be derived using human knowledge or word net. Tagging, morphologically analyzing, and applying the stop list to this set of words yield a set of query terms Q for use in retrieving the concept of interest. Retrieval of shots containing the concept then proceeds by ranking documents against Q according to their score, as in standard text retrieval, which provides a ranking of shots.

LEARNING MULTIMODAL CONCEPTS Information cues from one or more modalities are integrated. we can build richer models that exploit the interrelationships between atomic concepts, which may not be possible if we model these high-level concepts in terms of their features.

INFERENCE USING GRAPHICAL MODELS Models used: Bayesian networks of various topologies and parameterizations. Advantage : Bayesian networks allows us to graphically specify a particular form of the joint probability density function. Joint probability function encoded by the Bayesian network, is P(E, A,V,T) = P(E)P(A/E)P(V/E)P(T/E) (a) Bayesian Networks

CLASSIFYING CONCEPTS USING SVMS Scores from all the intermediate concept classifiers are concatenated into a vector, and this is used as the feature in the SVM. (b) Support Vector Machines

EXPERIMENTAL RESULTS

THE CORPUS Dataset: Subset of the NIST Video TREC 2001 corpus, which comprises production videos derived from sources such as NASA and Open Video Consortium. 7 videos comprising 1248 video shots. The 7 videos describe NASA activities including its space program. The most pertinent audio cues are: Music- 84% of manually labeled audio samples Rocket engine explosion- 60% of manually labeled audio samples

PREPROCESSING AND FEATURE EXTRACTION Visual shot detection and feature extraction Color Structure Shape Audio feature extraction

LEXICON Lexicon in this experiment comprise more than 50 semantic concepts for describing events, sites, and objects with cues in audio, video, and/or speech. A subset is described in Visual, Audio and Multimodal Concept experiments.

EVALUATION METRICS Retrieval performance is measured using precision-recall curves Figure-of-merit (FOM) of retrieval effectiveness is used to summarize performance, defined as average precision. over the top 100 retrieved documents.

RETRIEVAL USING MODELS FOR VISUAL FEATURES Results: GMM versus SVM classification Depicts the overall retrieval effectiveness for a variety of intermediate (visual) semantic-concepts with SVM and GMM classifiers.

PRECISION-RECALL CURVES

RETRIEVAL USING MODELS FOR AUDIO FEATURES Results: Minimum duration modeling Fig: Effect of duration modeling

IMPLICIT VERSUS EXPLICIT FUSION PERFORMACE GRAPH Fig: Implicit versus explicit fusion

RESULTS: FUSION OF SCORES FROM MULTIPLE AUDIO MODELS FOM results: audio retrieval, different intermediate concepts.

RESULTS: FUSION OF SCORES FROM MULTIPLE AUDIO MODELS FOM results: audio retrieval, GMM versus HMM performance and implicit versus explicit fusion.

RETRIEVAL USING SPEECH Two sets of results: The retrieval of the rocket launch concept using manually produced ground truth transcriptions Retrieval using transcriptions produced using ASR.

Cont.. FOM results: speech retrieval using human knowledge based query.

BAYESIAN NETWORK INTEGRATION All random variables are assumed to be binary valued The scores emitted by the individual classifiers (rocket object and rocket engine explosion) are processed to fall into the 0–1 range by using the precision-recall curve as a guide Map acceptable operating points on the precision-recall curve to the 0.5 probability

SVM INTEGRATION For fusion using SVMs: Procedure: Take the scores from all semantic models concatenating them into a 9-dimensional feature vector.

Fig: Fusion of audio, text, and visual models using the SVM fusion model for rocket launch retrieval.

Fig: The top 20 video shots of rocket launch/take-off retrieved using multimodal detection based on the SVM model. Nineteen of the top 20 are rocket launch shots.

SVM INTEGRATION cont.. FOM results for unimodal retrieval and the two multimodal fusion models

CONCLUSION Feasibility of the framework was demonstrated for the semantic-concept rocket launch: -For concept classification using information in single modalities -For concept classification using information from multiple modalities. Experimental results show that information from multiple modalities can be successfully integrated to improve semantic labeling performance over that achieved by any single modality.

FUTURE WORK Schemes must be identified for automatically determining the low-level features which are most appropriate for labeling atomic concepts and for determining atomic concepts which are related to higher-level semantic-concepts. The scalability of the scheme and its extension to much larger numbers of semantic-concepts must also be investigated.

THANK YOU