Ming-yu Chen and Jun Yang School of Computer Science Carnegie Mellon University Carnegie Mellon Feature Extraction Techniques CMU at TRECVID 2004.

Slides:

Advertisements

Similar presentations

Image Retrieval With Relevant Feedback Hayati Cam & Ozge Cavus IMAGE RETRIEVAL WITH RELEVANCE FEEDBACK Hayati CAM Ozge CAVUS.

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Presented by Xinyu Chang

Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.

Detecting Categories in News Video Using Image Features Slav Petrov, Arlo Faria, Pascal Michaillat, Alex Berg, Andreas Stolcke, Dan Klein, Jitendra Malik.

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

AN IMPROVED AUDIO Jenn Tam Computer Science Dept. Carnegie Mellon University SOAPS 2008, Pittsburgh, PA.

Broadcast News Parsing Using Visual Cues: A Robust Face Detection Approach Yannis Avrithis, Nicolas Tsapatsoulis and Stefanos Kollias Image, Video & Multimedia.

Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.

Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,

Event prediction CS 590v. Applications Video search Surveillance – Detecting suspicious activities – Illegally parked cars – Abandoned bags Intelligent.

Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.

Tracking Objects with Dynamics Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/15 some slides from Amin Sadeghi, Lana Lazebnik,

1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

A Study of Approaches for Object Recognition

Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,

Face Detection: a Survey Speaker: Mine-Quan Jing National Chiao Tung University.

Project 4 out today –help session today –photo session today Project 2 winners Announcements.

Fig. 2 – Test results Personal Memory Assistant Facial Recognition System The facial identification system is divided into the following two components:

Visual Information Retrieval Chapter 1 Introduction Alberto Del Bimbo Dipartimento di Sistemi e Informatica Universita di Firenze Firenze, Italy.

Presented by Zeehasham Rasheed

Computer Vision I Instructor: Prof. Ko Nishino. Today How do we recognize objects in images?

Smart Traveller with Visual Translator for OCR and Face Recognition LYU0203 FYP.

Computer Vision Group University of California Berkeley Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA Greg Mori and Jitendra Malik.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

A fuzzy video content representation for video summarization and content-based retrieval Anastasios D. Doulamis, Nikolaos D. Doulamis, Stefanos D. Kollias.

DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.

Image Processing David Kauchak cs458 Fall 2012 Empirical Evaluation of Dissimilarity Measures for Color and Texture Jan Puzicha, Joachim M. Buhmann, Yossi.

Information Retrieval in Practice

Computer vision.

TEMPORAL VIDEO BOUNDARIES -PART ONE- SNUEE KIM KYUNGMIN.

What’s Making That Sound ?

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

GM-Carnegie Mellon Autonomous Driving CRL TitleAutomated Image Analysis for Robust Detection of Curbs Thrust AreaPerception Project LeadDavid Wettergreen,

Bridge Semantic Gap: A Large Scale Concept Ontology for Multimedia (LSCOM) Guo-Jun Qi Beckman Institute University of Illinois at Urbana-Champaign.

The MPEG-7 Color Descriptors

Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)

EADS DS / SDC LTIS Page 1 7 th CNES/DLR Workshop on Information Extraction and Scene Understanding for Meter Resolution Image – 29/03/07 - Oberpfaffenhofen.

Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

Carnegie Mellon TRECVID 2004 Workshop – November 2004 Mike Christel, Jun Yang, Rong Yan, and Alex Hauptmann Carnegie Mellon University

Finding Better Answers in Video Using Pseudo Relevance Feedback Informedia Project Carnegie Mellon University Carnegie Mellon Question Answering from Errorful.

Multimodal Information Analysis for Emotion Recognition

1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.

Soccer Video Analysis EE 368: Spring 2012 Kevin Cheng.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment Richang Hong, Meng Wang, Mengdi Xuy Shuicheng Yany and Tat-Seng Chua School.

Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.

Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.

1 CS 430 / INFO 430 Information Retrieval Lecture 17 Metadata 4.

Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.

ECE 417 Lecture 1: Multimedia Signal Processing

Efficient Image Classification on Vertically Decomposed Data

Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.

Tracking Objects with Dynamics

Semantic Video Classification

Introductory Seminar on Research: Fall 2017

Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science

Efficient Image Classification on Vertically Decomposed Data

The Open World of Micro-Videos

Multimedia Information Retrieval

Paper Reading Dalong Du April.08, 2011.

Midterm Exam Closed book, notes, computer Similar to test 1 in format:

Announcements Project 4 out today Project 2 winners help session today

Midterm Exam Closed book, notes, computer Similar to test 1 in format:

Week 7 Presentation Ngoc Ta Aidean Sharghi

Presentation transcript:

Ming-yu Chen and Jun Yang School of Computer Science Carnegie Mellon University Carnegie Mellon Feature Extraction Techniques CMU at TRECVID 2004

Carnegie Mellon Outline Low level features Generic high level feature extractions Uni-modal Multi-modal Multi-concept Specialized approach for person finding Failure analysis

Carnegie Mellon Low level features overview Low level features CMU distributed 16 feature sets available to all TRECVID participants Development set: Test set: These features were used for all our submissions We encourage people to compare against these features to eliminate confusion about better features vs better algorithms

Carnegie Mellon Low level features Image features Color histogram Texture Edge Audio features FFT MFCC Motion features Kinetic energy Optical flow Detector features Face detection VOCR detection

Carnegie Mellon Image features 5 by 5 grids for key-frame per shot Color histogram (*.hsv, *.hvc, *.rgb) 5 by 5, 125 bins color histogram HSV, HVC, and RGB color space 3125 dimensions (5*5*125) row-wise grids _CNN.hsv , , , , , , , , , , , , , ,………………… , , , , , , , , , , , , , , ………………….. Texture (*.texture_5x5_bhs) Six orientated Gabor filters Edge (*.cannyedge) Canny edge detector, 8 orientations

Carnegie Mellon Audio features & Motion features Every 20 msecs (512 windows at HZ sampling rate) FFT (*.FFT) – Short Time Fourier Transform MFCC (*.MFCC) – Mel-Frequency cepstral coefficients SFFT (*.SFFT) – simplified FFT Kinetic energy (*.kemotion) Capture the pixel difference between frames Mpeg motion (*.mpgmotion) Mpeg motion vector extracted from p-frame Optical flow (*.opmotion) Capture optical flow in each grid

Carnegie Mellon Detector features Face detector (*.faceinfo) Detecting faces in the images VOCR detector (*.vocrinfo and *.mpg.txt) Detecting and recognizing VOCR

Carnegie Mellon Closed caption alignment and Shot mapping Closed caption alignment (*.wordtime) Each word in the closed caption file is assigned an approximate time in millisecond Shot Break (*.shots) Provides the mapping table of the shot We encourage people to utilize these features Eliminate confusion of better features or better algorithms Encourage more participants who can emphasize their efforts on algorithms

Carnegie Mellon Outline Low level features Generic high level feature extractions Uni-modal Multi-modal Multi-concept Specialized approach for person finding Failure analysis

Carnegie Mellon Generic high level feature extraction Uni-Modal Features Multi-modal Features Feature Tasks Timing Kinetic Motion Video OCR Face Feature SFFT MFCC Optical Motion Canny Edge HSV/HVC/GRB Color Gabor Texture Transcript Concept 1 OOOOOO 1. Boat / Ship OOOOOO Concept 2 Concept 3 Concept 4 Concept 168 SVM-based Combination Multi-concepts Combination 2. Madeleine Albright 3. Bill Clinton 10. Road Visual Info. Audio Info. Textual Info. Structural Info. Common Annotation

Carnegie Mellon Multi-concepts Learning Bayesian Networks from 168 common annotation concepts Pick top 4 most related concepts to combine with the target concept Boat/ShipBoat, Water_Body, Sky, Cloud TrainCar_Crash, Man_Made_scene, Smoke, Road BeachSky, Water_Body, Nature_Non-Vegetation, Cloud Basket ScoredCrowd, People, Running, Non-Studio_Setting Airplane TakeoffAirplane, Sky, Smoke, Space_Vehicle_Launch People Walking/running Walking, Running, People, Person Physical violenceGun_Shot, Building, Gun, Explosion RoadCar, Road_Traffic, Truck, Vehicle_Noise

Carnegie Mellon Top result for TRECVID tasks Boat*TrainBeachBasketAirplaneWalkingViolenceRoad Uni- modal Multi- modal Multi- concept Uni-modal gets 2 best over CMU results Multi-modal gets 3 best, but includes Boat/Ship which is the best overall all Multi-concept gets 6 best

Carnegie Mellon Outline Low-level features Generic high-level feature extractions Specialized approaches for person finding Failure analysis

Carnegie Mellon A Text Retrieval Approach Search for shots with names appearing in transcript Vector-based IR model with TF*IDF weighting shots transcript Temporal Mismatch Drawback Faces do not always temporally co-occur with names Cause false alarms and misses

Carnegie Mellon Expand the Text Retrieval Results Propagate the text score to neighbor shots based on the distribution Timing score = F ( Distance (shot, name) ) after the name before the name name position Model 1: Gaussian model (trained using Maximum Likelihood) Model 2: Linear model (different gradients set on two sides)

Carnegie Mellon Context Information Sometimes, a news story has the name but not the face E.g., “.... a big pressure on Clinton administration...” Cause many false alarms Related to the context “Clinton administration” Given the context, how likely a story has the face? Collect bigrams of type “___ Clinton” or “ Clinton ___” Compute P i (Clinton appears in the story | bigram i ) BigramP (face | bigram) Clinton says Clinton made Clinton administration

Carnegie Mellon Multimodal Features Multimodal features provide weak clues for person search Face detection – shots w/o detected faces are less likely to be the results Facial recognition – matching detected faces with facial model based on Eigenface representation Anchor classifier – anchor shots rarely have intended faces Video OCR – Fuzzy match by edit distance between video OCR and the target name

Carnegie Mellon Combining Multimodal Info. with Text Search Updated Text Score: R’ = R * Timing Score * Avg (P bigram_i) Linear combination of all the features with text score Features normalized into (pseudo-) probabilities [0,1] Feature selection based on chi-square statistics Combinational weights trained by logistic regression Featuresweight Updated text score6.14 Face similarity3.94 Face detection0.50 Anchor detection-5.65

Carnegie Mellon Performance Comparison One of our “Albright” runs is the best among all submissions Combining multimodal features helps both tasks Context helps “Clinton” but hurts “Albright” Probably due to sparse training data for “Albright”

Carnegie Mellon Outline Low-level features Generic high-level feature extractions Specialized approaches for person finding Failure analysis

Carnegie Mellon Performance Drop on Anchor Classifier 95% cross-validation accuracy on development set 10 videos in testing set has 0 or 1 detected anchor Average # of anchor shots per video is 20-30

Carnegie Mellon Different Data Distribution Different images: change on background, anchor, clothes Common types (development set) Outliers (testing set) Similar images, but probably different MPEG encoding “Peter Jennings” has similar clothes and background in both sets In videos with “Peter Jennings” as the anchor 19 detected per video in development set 13 detected per video in testing set

Carnegie Mellon Other Semantic Features Similar performance drop observed on Commercial, Sport News, etc Compromises both high-level feature extraction and search Possible solutions Get consistent data next year Rely less on sensitive image features (color, etc) Rely more on robust features -- “Video grammar” Timing, e.g., sports news are in the same temporal session Story structure, e.g., the first shot of a story is usually an anchor shot Speech, e.g., anchor’s speaker ID is easy to identify Constraints, e.g, weather forecast appear only once Re-encoding MPEG video

Carnegie Mellon Q & A Thank you!