Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Slides:



Advertisements
Similar presentations
Generation of Multimedia TV News Contents for WWW Hsin Chia Fu, Yeong Yuh Xu, and Cheng Lung Tseng Department of computer science, National Chiao-Tung.
Advertisements

Image Retrieval With Relevant Feedback Hayati Cam & Ozge Cavus IMAGE RETRIEVAL WITH RELEVANCE FEEDBACK Hayati CAM Ozge CAVUS.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Automatic Video Shot Detection from MPEG Bit Stream Jianping Fan Department of Computer Science University of North Carolina at Charlotte Charlotte, NC.
SmartPlayer: User-Centric Video Fast-Forwarding K.-Y. Cheng, S.-J. Luo, B.-Y. Chen, and H.-H. Chu ACM CHI 2009 (international conference on Human factors.
Multimedia Retrieval. Outline Audio Retrieval Spoken information Music Document Image Analysis and Retrieval Video Retrieval.
Image Information Retrieval Shaw-Ming Yang IST 497E 12/05/02.
Ming-yu Chen and Jun Yang School of Computer Science Carnegie Mellon University Carnegie Mellon Feature Extraction Techniques CMU at TRECVID 2004.
1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,
Broadcast News Parsing Using Visual Cues: A Robust Face Detection Approach Yannis Avrithis, Nicolas Tsapatsoulis and Stefanos Kollias Image, Video & Multimedia.
Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
ICIP 2000, Vancouver, Canada IVML, ECE, NTUA Face Detection: Is it only for Face Recognition?  A few years earlier  Face Detection Face Recognition 
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
ADVISE: Advanced Digital Video Information Segmentation Engine
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Multimedia Search and Retrieval Presented by: Reza Aghaee For Multimedia Course(CMPT820) Simon Fraser University March.2005 Shih-Fu Chang, Qian Huang,
MUSCLE movie data base is a multimodal movie corpus collected to develop content- based multimedia processing like: - speaker clustering - speaker turn.
T.Sharon 1 Internet Resources Discovery (IRD) Introduction to MMIR.
Visual Information Retrieval Chapter 1 Introduction Alberto Del Bimbo Dipartimento di Sistemi e Informatica Universita di Firenze Firenze, Italy.
Presented by Zeehasham Rasheed
Video Search Engines and Content-Based Retrieval Steven C.H. Hoi CUHK, CSE 18-Sept, 2006.
A fuzzy video content representation for video summarization and content-based retrieval Anastasios D. Doulamis, Nikolaos D. Doulamis, Stefanos D. Kollias.
Overview of Search Engines
DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.
Information Retrieval in Practice
Content-Based Video Retrieval System Presented by: Edmund Liang CSE 8337: Information Retrieval.
TEMPORAL VIDEO BOUNDARIES -PART ONE- SNUEE KIM KYUNGMIN.
Media Retrieval Information Retrieval Image Retrieval Video Retrieval Audio Retrieval Information Retrieval Image Retrieval Video Retrieval Audio Retrieval.
Video Classification By: Maryam S. Mirian
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Bridge Semantic Gap: A Large Scale Concept Ontology for Multimedia (LSCOM) Guo-Jun Qi Beckman Institute University of Illinois at Urbana-Champaign.
Multimedia Databases (MMDB)
RIAO video retrieval systems. The Físchlár-News-Stories System: Personalised Access to an Archive of TV News Alan F. Smeaton, Cathal Gurrin, Howon.
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Content-Based Image Retrieval
Carnegie Mellon TRECVID 2004 Workshop – November 2004 Mike Christel, Jun Yang, Rong Yan, and Alex Hauptmann Carnegie Mellon University
1 CS 430 / INFO 430 Information Retrieval Lecture 23 Non-Textual Materials 2.
Finding Better Answers in Video Using Pseudo Relevance Feedback Informedia Project Carnegie Mellon University Carnegie Mellon Question Answering from Errorful.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Understanding the Semantics of Media Lecture Notes on Video Search & Mining, Spring 2012 Presented by Jun Hee Yoo Biointelligence Laboratory School of.
Multimodal Information Analysis for Emotion Recognition
1 Multiple Classifier Based on Fuzzy C-Means for a Flower Image Retrieval Keita Fukuda, Tetsuya Takiguchi, Yasuo Ariki Graduate School of Engineering,
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
Prof. Thomas Sikora Technische Universität Berlin Communication Systems Group Thursday, 2 April 2009 Integration Activities in “Tools for Tag Generation“
Levi Smith.  Reading papers  Getting data set together  Clipping videos to form the training and testing data for our classifier  Project separation.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Duraid Y. Mohammed Philip J. Duncan Francis F. Li. School of Computing Science and Engineering, University of Salford UK Audio Content Analysis in The.
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,
MMDB-9 J. Teuhola Standardization: MPEG-7 “Multimedia Content Description Interface” Standard for describing multimedia content (metadata).
M4 / September Integrating multimodal descriptions to index large video collections M4 meeting – Munich Nicolas Moënne-Loccoz, Bruno Janvier,
Igor Rosenberg Summer internship Creating a building detector June 16 th to September 15 th in Dublin City University, Ireland Supervisor: Alan Smeaton.
1 CS 430 / INFO 430 Information Retrieval Lecture 17 Metadata 4.
Query by Image and Video Content: The QBIC System M. Flickner et al. IEEE Computer Special Issue on Content-Based Retrieval Vol. 28, No. 9, September 1995.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Carnegie Mellon Maximizing the Synergy between Man and Machine Alex Hauptmann School of Computer Science Carnegie Mellon University Exploiting Human Abilities.
Visual Information Retrieval
Automatic Video Shot Detection from MPEG Bit Stream
Multimedia Content-Based Retrieval
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Image Segmentation Techniques
Multimedia Information Retrieval
Presentation transcript:

Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel, P. Duygulu, C. Huang, R. Jin, W.-H. Lin, T. Ng, N. Moraveji, N. Papernick, C.G.M. Snoek, G. Tzanetakis, J. Yang, R. Yang, and H.D. Wactlar

Overview (1/3) TRECVID 2003 Shot boundary determination identify the shot boundaries in the given video clip(s) Story segmentation identify the story boundary and types (miscellaneous or news) High-level feature extraction Outdoors, news subject face, People, Building, Road, Animal.. Search Given the search test collection, a multimedia statement of info. need (topic), return a ranked list of common reference shots from the test collection

Overview (2/3) Search Interactive Search Manual Search

Overview (3/3) Semantic Classifiers most are trained on keyframes Interactive Search allow more effective browsing and visualization of the results of text queries using a variety of filter strategies Manual Search use multiple retrieval agents (color, texture, ASR, OCR and some of the classifiers, e.g. anchor, PersonX) Negative Pseudo-relevance Co-retrieval Even text-based baseline using the OKAPI formula performed better other groups

Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (1/3) Audio Features These features assist the extraction of the following medium-level audio-based features: music, male speech, female speech, and noise. Based on the magnitude spectrum calculated using a Short Time Fourier Transform. consist of features that summarize the overall spectral characteristics: Spectral Centroid, Rolloff, Relative Subband energies and the Mel Frequency Cepstral Coefficients male/female: using Average Magnitude Difference Function (AMDF)

Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (2/3) Low-level Image Features The color feature is the mean and variance of each color channels in HSV (Hue-Saturation-Value) color space in a 5*5 image tessellation. Another low-level feature is the canny edge direction histogram. Face Features Schneiderman’s face detector algorithm Size and position of the largest face are used as additional face features

Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (3/3) Text-based features the most reliable high-level feature Automatic Speech Transcripts (ASR), Video Optical Character Recognition (VOCR) Video OCR (VOCR) Manber and Wu’s approximate string matching technique, e.g. “Clinton” may retrieve “Cllnton”, “Ciintonfi”, “Cltnton” and “Clinton” However, incorrect text like “EIICKINSON” (for “DICKINSON”), and “Cincintoli” (for “Cincinnati”)

Fisher Linear Discriminant for Anchors and Commercials (1/2) Multimode combination approach: use FLD to every feature set and synthesize new feature vectors Using these synthesized feature vectors to represent the content and then apply standard feature vector classification approaches. Two different SVM-based classifiers: anchor: color histogram, face info., and speaker info. commercial: color histogram and audio feature

Fisher Linear Discriminant for Anchors and Commercials (2/2) FLD weights for anchor detection Anchor and Commercial classifier result

Feature Classifiers (1/7) Baseline SVM Classifier with Common Annotation Data SVM with the power=2 polynomial use only image features (no face) perform a video based cross validation with portions of the common annotation data MAP Outdoors0.112 Buildings0.071 Roads0.028 Vegetation0.112 Cars0.040 Aircraft0.059 Sports0.051 Weather News0.017 Physical violence0.012 Animals0.017

Feature Classifiers (2/7) Building Detection explore a classifier by adapting man-made structure detection method by Kumar and Hebert this method produces binary detection outputs for each of 22*16 grids, extract 5 features from the binary detection outputs. number of positive grids; area of the bounding box that includes all the positive grids; x and y coordinates of the center of the mass of the bounding grids; ratio of the width and height; compactness 462 are images used as positive examples, and 495 images are used as negative examples, by FLD, SVM MAP (man-made structures) vs (baseline SVM)

Feature Classifiers (3/7) Plane Detection using additional still image data use image features described above 3368 plane examples are selected from web, Corel data set and from the University of Oxford data set as positive examples 3516 negative examples By FLD and SVM, MAP vs (baseline) Car Detection modify the Schneiderman face detector algorithm Outperform the baseline with MAP vs

Feature Classifiers (4/7) Zoom Detection use MPEG motion vectors to estimate the probability of a zoom pattern MAP Female Speech use an SVM trained on the LIMSI provided speech features, together with the face characteristics MAP 0.465

Feature Classifiers (5/7) Text and Timing for Weather News, Outdoors, Sporting Event, Physical Violence and Person X Classifiers Model only based on text info. are better than random baselines on the development data

Feature Classifiers (6/7) Timing info. is the implicit temporal structure of the broadcast news, especially weather reports and sports.

Feature Classifiers (7/7) For each shot, both predictions from text-based and timing-based classifiers have to be considered Except for weather news, the results suggest the text info. of the broadcast news in the shot may not be enough to detect these high-level features.

News Subject Monologues (1/2) Based on the LIMSI speech annotations they developed a voice over detector and a frequent speaker detector VOCR is applied to extract overlaid text in the hoping of finding people names

News Subject Monologues (2/2) Another feature measures the average amount of motion in a camera shot, based on frame difference also use commercial and anchor detectors combine individual detectors and features by using two well-known classifier combination schemes, namely stacking and bagging MAP 0.616

Finding Person X in Broadcast News (1/3) Use text info. from a transcript and face info. Relationship between the name of person x and time S: one shot; T S : key frame; T O : time of person namel;

Finding Person X in Broadcast News (2/3) More limited face recognition based on video shot collect sample faces {F 1, F 2, …, F n } for person X and all faces {f 1, f 2, …, f m } of i-frames in the news shot which P text is larger than zero build the eigenspace for those faces {f 1, f 2, …, f m, F 1, F 2, …, F n } and represent them by the eigenfaces {eigf 1, eigf 2, …, eigf m, eigF 1, …, eigF n } combination rank score and estimate which shots has high possibility to contain that face

Finding Person X in Broadcast News (3/3) Using “Madeleine Albright” as person x, we obtained 20 faces from a Google image search as sample query faces.

Learning Combination Weights in Manual Retrieval (1/5) Shot-based video retrieval, a set of features is extracted each shot is associated with a vector of individual retrieval scores from different media search modules finally, these retrieval scores are fused into a final ordered list via some aggregation algorithm

Learning Combination Weights in Manual Retrieval (2/5) use the weighted Borda fuse model as the basic combination approach for multiple search modules, i.e. for each shot its final score is Similarity Measures For video frame, a harmonic mean of the Euclidean distances from each query images (color, texture, edge) is computed to be the distance between query and video frames For text, CC and OCR transcripts is done using the OKAPI BM-25 formula

Learning Combination Weights in Manual Retrieval (3/5) Negative Pseudo-Relevance Feedback (NPRF) NPRF is effective at providing a more adaptive similarity measure for image retrieval Propose a better strategy to sample negative examples, that is inspired by the Maximal Marginal Relevance Maximal Marginal Irrelevance (MMIR)

Learning Combination Weights in Manual Retrieval (4/5) The Value of Intermediate-level Detectors Text-based feature is good at global ranking and other features is useful in refining the ranking afterwards Learning Weights for each Modality in Video Retrieval Baseline: Setting weights based on query types Person query: w=(text 2, face 1, color 1, anchor 0) Non-person query: w=(text 2, face -1, color 1, anchor -1) Aircraft and animal: w=(text 2, face -1, edge 1, anchor -1)

Learning Combination Weights in Manual Retrieval (5/5) Learning weights using training labeled set Supervised learning algorithm in the development set Co-Retrieval a set of video shots are first labeled as relevant shots using text- based features, and the results are augmented by learning with the other visual and intermediate level features Experimental results

Interactive TREC Video Retrieval Evaluation for 2003 (1/2) This interface has the following features: Storyboards of images spanning across video story segments Emphasizing matching shots to a user’s query to reduce the image count Resolution and layout under the user control Additional filtering provided through shot classifiers Display of filter count and distribution to guide manipulation of storyboard views

Interactive TREC Video Retrieval Evaluation for 2003 (2/2)

Conclusions We believe the browsing interfaces and image-based search improvements made for 2003 led to the increase in performance for the new system, as these strategies allowed relevant content to be found having no associated narrative or text metadata.