Download presentation
Presentation is loading. Please wait.
Published byGiles Eaton Modified over 9 years ago
1
Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel, P. Duygulu, C. Huang, R. Jin, W.-H. Lin, T. Ng, N. Moraveji, N. Papernick, C.G.M. Snoek, G. Tzanetakis, J. Yang, R. Yang, and H.D. Wactlar
2
Overview (1/3) TRECVID 2003 Shot boundary determination identify the shot boundaries in the given video clip(s) Story segmentation identify the story boundary and types (miscellaneous or news) High-level feature extraction Outdoors, news subject face, People, Building, Road, Animal.. Search Given the search test collection, a multimedia statement of info. need (topic), return a ranked list of common reference shots from the test collection
3
Overview (2/3) Search Interactive Search Manual Search
4
Overview (3/3) Semantic Classifiers most are trained on keyframes Interactive Search allow more effective browsing and visualization of the results of text queries using a variety of filter strategies Manual Search use multiple retrieval agents (color, texture, ASR, OCR and some of the classifiers, e.g. anchor, PersonX) Negative Pseudo-relevance Co-retrieval Even text-based baseline using the OKAPI formula performed better other groups
5
Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (1/3) Audio Features These features assist the extraction of the following medium-level audio-based features: music, male speech, female speech, and noise. Based on the magnitude spectrum calculated using a Short Time Fourier Transform. consist of features that summarize the overall spectral characteristics: Spectral Centroid, Rolloff, Relative Subband energies and the Mel Frequency Cepstral Coefficients male/female: using Average Magnitude Difference Function (AMDF)
6
Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (2/3) Low-level Image Features The color feature is the mean and variance of each color channels in HSV (Hue-Saturation-Value) color space in a 5*5 image tessellation. Another low-level feature is the canny edge direction histogram. Face Features Schneiderman’s face detector algorithm Size and position of the largest face are used as additional face features
7
Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (3/3) Text-based features the most reliable high-level feature Automatic Speech Transcripts (ASR), Video Optical Character Recognition (VOCR) Video OCR (VOCR) Manber and Wu’s approximate string matching technique, e.g. “Clinton” may retrieve “Cllnton”, “Ciintonfi”, “Cltnton” and “Clinton” However, incorrect text like “EIICKINSON” (for “DICKINSON”), and “Cincintoli” (for “Cincinnati”)
8
Fisher Linear Discriminant for Anchors and Commercials (1/2) Multimode combination approach: use FLD to every feature set and synthesize new feature vectors Using these synthesized feature vectors to represent the content and then apply standard feature vector classification approaches. Two different SVM-based classifiers: anchor: color histogram, face info., and speaker info. commercial: color histogram and audio feature
9
Fisher Linear Discriminant for Anchors and Commercials (2/2) FLD weights for anchor detection Anchor and Commercial classifier result
10
Feature Classifiers (1/7) Baseline SVM Classifier with Common Annotation Data SVM with the power=2 polynomial use only image features (no face) perform a video based cross validation with portions of the common annotation data MAP Outdoors0.112 Buildings0.071 Roads0.028 Vegetation0.112 Cars0.040 Aircraft0.059 Sports0.051 Weather News0.017 Physical violence0.012 Animals0.017
11
Feature Classifiers (2/7) Building Detection explore a classifier by adapting man-made structure detection method by Kumar and Hebert this method produces binary detection outputs for each of 22*16 grids, extract 5 features from the binary detection outputs. number of positive grids; area of the bounding box that includes all the positive grids; x and y coordinates of the center of the mass of the bounding grids; ratio of the width and height; compactness 462 are images used as positive examples, and 495 images are used as negative examples, by FLD, SVM MAP 0.042 (man-made structures) vs. 0.071 (baseline SVM)
12
Feature Classifiers (3/7) Plane Detection using additional still image data use image features described above 3368 plane examples are selected from web, Corel data set and from the University of Oxford data set as positive examples 3516 negative examples By FLD and SVM, MAP 0.008 vs. 0.059 (baseline) Car Detection modify the Schneiderman face detector algorithm Outperform the baseline with MAP 0.114 vs. 0.040
13
Feature Classifiers (4/7) Zoom Detection use MPEG motion vectors to estimate the probability of a zoom pattern MAP 0.632 Female Speech use an SVM trained on the LIMSI provided speech features, together with the face characteristics MAP 0.465
14
Feature Classifiers (5/7) Text and Timing for Weather News, Outdoors, Sporting Event, Physical Violence and Person X Classifiers Model only based on text info. are better than random baselines on the development data
15
Feature Classifiers (6/7) Timing info. is the implicit temporal structure of the broadcast news, especially weather reports and sports.
16
Feature Classifiers (7/7) For each shot, both predictions from text-based and timing-based classifiers have to be considered Except for weather news, the results suggest the text info. of the broadcast news in the shot may not be enough to detect these high-level features.
17
News Subject Monologues (1/2) Based on the LIMSI speech annotations they developed a voice over detector and a frequent speaker detector VOCR is applied to extract overlaid text in the hoping of finding people names
18
News Subject Monologues (2/2) Another feature measures the average amount of motion in a camera shot, based on frame difference also use commercial and anchor detectors combine individual detectors and features by using two well-known classifier combination schemes, namely stacking and bagging MAP 0.616
19
Finding Person X in Broadcast News (1/3) Use text info. from a transcript and face info. Relationship between the name of person x and time S: one shot; T S : key frame; T O : time of person namel;
20
Finding Person X in Broadcast News (2/3) More limited face recognition based on video shot collect sample faces {F 1, F 2, …, F n } for person X and all faces {f 1, f 2, …, f m } of i-frames in the news shot which P text is larger than zero build the eigenspace for those faces {f 1, f 2, …, f m, F 1, F 2, …, F n } and represent them by the eigenfaces {eigf 1, eigf 2, …, eigf m, eigF 1, …, eigF n } combination rank score and estimate which shots has high possibility to contain that face
21
Finding Person X in Broadcast News (3/3) Using “Madeleine Albright” as person x, we obtained 20 faces from a Google image search as sample query faces.
22
Learning Combination Weights in Manual Retrieval (1/5) Shot-based video retrieval, a set of features is extracted each shot is associated with a vector of individual retrieval scores from different media search modules finally, these retrieval scores are fused into a final ordered list via some aggregation algorithm
23
Learning Combination Weights in Manual Retrieval (2/5) use the weighted Borda fuse model as the basic combination approach for multiple search modules, i.e. for each shot its final score is Similarity Measures For video frame, a harmonic mean of the Euclidean distances from each query images (color, texture, edge) is computed to be the distance between query and video frames For text, CC and OCR transcripts is done using the OKAPI BM-25 formula
24
Learning Combination Weights in Manual Retrieval (3/5) Negative Pseudo-Relevance Feedback (NPRF) NPRF is effective at providing a more adaptive similarity measure for image retrieval Propose a better strategy to sample negative examples, that is inspired by the Maximal Marginal Relevance Maximal Marginal Irrelevance (MMIR)
25
Learning Combination Weights in Manual Retrieval (4/5) The Value of Intermediate-level Detectors Text-based feature is good at global ranking and other features is useful in refining the ranking afterwards Learning Weights for each Modality in Video Retrieval Baseline: Setting weights based on query types Person query: w=(text 2, face 1, color 1, anchor 0) Non-person query: w=(text 2, face -1, color 1, anchor -1) Aircraft and animal: w=(text 2, face -1, edge 1, anchor -1)
26
Learning Combination Weights in Manual Retrieval (5/5) Learning weights using training labeled set Supervised learning algorithm in the development set Co-Retrieval a set of video shots are first labeled as relevant shots using text- based features, and the results are augmented by learning with the other visual and intermediate level features Experimental results
27
Interactive TREC Video Retrieval Evaluation for 2003 (1/2) This interface has the following features: Storyboards of images spanning across video story segments Emphasizing matching shots to a user’s query to reduce the image count Resolution and layout under the user control Additional filtering provided through shot classifiers Display of filter count and distribution to guide manipulation of storyboard views
28
Interactive TREC Video Retrieval Evaluation for 2003 (2/2)
29
Conclusions We believe the browsing interfaces and image-based search improvements made for 2003 led to the increase in performance for the new system, as these strategies allowed relevant content to be found having no associated narrative or text metadata.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.