Presentation is loading. Please wait.

Presentation is loading. Please wait.

Taking Advances in Multimedia Content Analysis to Product: Challenges and Solutions Ajay Divakaran Vision and Multi-Sensor Systems December 1, 2009.

Similar presentations


Presentation on theme: "Taking Advances in Multimedia Content Analysis to Product: Challenges and Solutions Ajay Divakaran Vision and Multi-Sensor Systems December 1, 2009."— Presentation transcript:

1 Taking Advances in Multimedia Content Analysis to Product: Challenges and Solutions Ajay Divakaran Vision and Multi-Sensor Systems December 1, 2009

2 Acknowledgements Mitsubishi Electric – Research Labs USA, Japan Huifang Sun Isao Otsuka, Tokumichi Murakami Regunathan Radhakrishnan, Ziyou Xiong, Lexing Xie, Kadir Peker, Kevin Wilson Many others in the Multimedia Community – Especially all attendees today Harpreet Singh Sawhney 2009 Sarnoff Corp. Copyrighted

3 Outline Introduction and Motivation Video Browsing Enabled DVD Recorder – A Case Study Future Possibilities 2009 Sarnoff Corp. Copyrighted

4 Consumer Video Browsing Summarization ContentIndexing 2009 Sarnoff Corp. Copyrighted

5 Web-based vs Personal View 2009 Sarnoff Corp. Copyrighted World Personal Content WORLDWIDE CONTENT

6 Consumer Video Browsing What is the business model? – Rapid Browsing as yet another feature Smart FF, REW etc.? – Advertisement based – Is that a contradiction? 2009 Sarnoff Corp. Copyrighted

7 Existing Products Browsing enabled DVD and Blu-ray recorders – Mitsubishi – Sports and Music – Sony – Sports and Music – Hitachi - Sports Content-based Editing – Microsoft Web-based “Skins” for Media Players Video Editing Tools for Broadcasters – Not the focus of this talk Limited to locally stored content mostly 2009 Sarnoff Corp. Copyrighted

8 Highlights Playback Enabled DVD Recorder: A Case Study Video: Highlights Playback Enabled DVD Recorder

9 A Video-Browsing Enabled Personal Video Recorder Ajay Divakaran

10 OUTLINE Background and Motivation Basic Concept – Audio Classification Framework – Sports Highlight Extraction – Data Structure of Meta-data Realization on Target Platform Demonstration Further Topics if time permits Conclusion

11 Background : Video Summarization Compact viewable representation of content Past work relies heavily on intuitively gathered domain knowledge Unscripted Content - Sports – based on detection of specific events – Based on video analysis [Ekin, Pan] – Based on audio analysis [Z.Xiong] – Based on closed caption text analysis [Babaguchi] – Unsupervised structure discovery-play/break [L.Xie] Unscripted Content - Surveillance – Object segmentation & tracking followed by unusual event detection [E. Chang] Scripted Content – News, Drama, Films etc – Focus on generation of TOC (Table of Contents) – Audio-visual feature extraction for semantic boundary extraction (MCA community) – Table of Contents Extraction (e.g. Rui et al, Dimitrova et al, S-F. Chang et al)

12 Scripted and Unscripted content News Drama Sports Surveillance Existence of structure Ease of discovery scripted unscripted

13 Summarizing “scripted” & “Unscripted content” Summary: flexible traversal through semantic units using abstractions Extract Semantic units News, drama,movies etc Abstract Semantic units e.g News story, scene e.g skims, Key frames Event Detection Sports & surveillance Rank events e.g goals, Home runs uninteresting Summary: flexible traversal through ranked events

14 Representation for “Scripted” content

15 Representation for “Unscripted” content : Key audio-visual markers Cheering! Applause!

16 Background and Motivation (1) To Browse the expected scenes. (Highlight Search) (2) To Grasp the entire content quickly. (Summarized Playback) It is necessary to develop a technology that can detect highlight scenes from contents automatically and help browse the content. We propose to use only Audio Analysis The requirement for long duration recorded content is … Current PVRs can store up to 200 hours of high quality video content. The storage capacity is expected to further grow. (i.e. HDD, Blu-ray, etc.)

17 How to detect highlight scenes from contents EXCITED SPEECH part in the content becomes a powerful indicator of highlight scenes. Excited Commen t Cheering! Interesting events lead to human reaction consisting of a mixture of cheering and the commentator’s excited speech. This method works across a wide range of sports.

18 Basic Concept Cheering! Audio Classification Importance Level SHOT! Excited Commen t Applause! Multimedia Data Meta-data Storage Medium MPEG-2 AC-3

19 System Block Diagram Current DVD Recorder System Block Diagram System Encoder MPEG-2 Video Encoder Packetizer PS MIX Write Buffer ATA/ ATAPI I/F Video Signal Video ES Audio Signal Video PES Audio ES Audio PES PS HDD,DVD Disk MCU DVD Codec LSI ATA/ATAPI Format Audio DSP Dolby AC-3 Audio Encoder Importance Level calc. Audio Classification (GMM) MDCT Feature Extraction SCR Meta- Data MDCT: Modified Discrete Cosine Transform GMM: Gaussian Mixture Model Importance Level Time Stamp Info. Broadcast video Entirely Hard-Wired Based on a programmable DSP

20 Time DomainFrequency Domain Audio Classification Framework Applause Cheering Music Speech Excited Speech MDCT Comparing Likelihoods Sample Audio (AC-3) - Variety of content - Input Audio (AC-3) Classify the input audio into 5 Audio Classes using GMM Audio Class MDCT Coef. Training the GMM Classifiers using MDCT data. Applause or Cheering with excited speech from commentator

21 Sports Highlight Extraction Framework W t Time Window ( Sliding ) Data stream of the result of Audio Classification … 2 2 2 2 3 1 3 3 3 2 2 2 2 2 2 2 2 2 4 4 1 3 3 3 1 1 3 … Playback Time = Percentage of the significant audio class Audio Energy Importance Level x Excited Speech is effective for sports content.

22 Playback Time Importance Level High Low SLICE LEVEL User’s Preference Play Play ・・ ・ Skipping to the start point manually Skipping Automatically Highlight Search Func. Summarized Playback Func. Highlight Scene Imp. Level Plot Realization on Target Platform Importance Level

23 Proposed New Features (1) To extract highlight scenes automatically. (2) To browse the expected scenes. (Highlight Search) (3) To grasp the entire content quickly. (Summarized Playback) New Keys for the Features a) Summarized Playback b) Highlight Scene Skip c) Highlight Scene Menu d) Importance Level Adjust Prototype Model Importance Level Plot Remote Control Prototype Model Development

24 Importance Level Plot Result Baseball Horse Racing RBIsHome Run Finish We have got satisfactory accuracy with over 30 games from Japanese and U.S. broadcast sports content. Commentator’s talk 30min.

25 Mitsubishi ‘Raku-Reco’ [Highlight Playback] Importanc e Level Plot Current Position Slice Level Slice Level The product DVR-HE50W was released in September 2005. Highlight Scenes Highlight Scenes

26 MELCO DVD Recorder 2006

27 MELCO Highlights Enabled Cellphone 2007

28 Highlight Playback for MELCO Cellphone 2006

29 Highlight Playback for Cellphones

30 Highlight Playback Blu-Ray Recorders 2008

31 Performance Evaluation What is a Good Summary? Are there Objective Measures? – Yes, they exist and No, they are not sufficient User Studies will be key – Key issues are Functionalities and their effectiveness – Controlled tests of demonstration setup with groups of non-expert users – Several Iterations will be required – Hope to gauge the usefulness and efficacy of proposed functionalities – Evidence from our past studies shows that such studies are insightful

32 Experimental Set-Up Interface displayed on Big-Screen TV All Subjects were sports aficionadoes One of us sat with the subject The rest of us observed from a remote hook- up Wide variety of sports tried – Soccer – American Football – Ice Hockey – Baseball – Tennis

33 Status of MERL Video Summarization MERL Techniques well accepted in community – Book published, Over 20 papers, 6 Invited Book Chapters, Over 30 patents filed, 10 issued – Best Paper awards at ICIP, ICASSP and ICCE – 4 Ph.D’s stemming from research – MERL researchers are members of PC’s of prestigious conferences such as ACM Multimedia, and IEEE ICME MERL advantages – Simplicity stemming from compressed domain computation – Accuracy comparable to best – Strong in both audio and video analysis

34 General Response Missed Highlights bigger concern than False Highlights Quick recovery from false highlights essential Overall system enjoyable despite its mistakes Users liked: – The Action Map because of the flexibility it affords – Auto-skip and Highlight skip features Suitability to different sports – Best Suited to: – Soccer, Baseball, Hockey

35 Conclusion Users enjoyed system despite flaws The general consistency in the results indicates validity Our hypothesis that fair accuracy combined with a good interface would work well, was confirmed.

36 Product Impact Highlight Playback Feature of Mitsubishi DVD Recorder DVR-HE50W for Japanese Market – World’s first DVD Recorder and mainstream PVR to have sports highlights extraction capability – Recommended as best buy in its category by HiVi Magazine – Uniformly positive reviews in Japanese press Tokusengai magazine rates Mitsubishi sports highlights playback as the best

37 Current work: Genre-independent Smart Skipping Smart Skipping – Adaptive content segmentation at multiple resolutions – Genre independent “one-size fits all” technique – Audio-Based to exploit current platform Current approach – Use hand-labeled data to train a support vector machine (SVM) scene- change classifier. – Hand-labeled data comes from several different genres, so learned classifier works across genres. – Feature representation is compatible with sports highlights work.

38 System overview Audio features – MFCCs, semantic classes Video features – shot change locations Labeled ground truth data – 7 hours of content from sitcoms, news programs, dramas, music videos, etc.

39 Blind Summarization – Content-adaptive summarization Content Specificity enables high summary quality Content-specificity leads to profusion of techniques Difficult to support multiple non-overlapping techniques in practical systems Content-independent techniques are known to compromise quality of summarization Thus the trade-off is between summary quality and content-specificity Our approach is to combine specific and general techniques – Have a large common core of processing – Postpone content-specific processing to as late a stage as possible

40 Problem formulation Motivated by the observation that “interesting” events in unscripted multimedia happen sparsely in a background of “uninteresting” events. – A burst of audience reaction in the vicinity of “interesting” events in sports audio. – A burst of screaming sound in the vicinity of “suspicious” events in surveillance audio content. Given a time series of observations from a usual background process (P 1 ) and an unusual foreground process (P 2 )

41 System Summary Developed a system that extracts highlight scenes automatically by using audio features and plays back desired highlight scenes. Presented a reasonably accurate highlight extraction algorithm for sports contents by using audio detection of audience reaction. This technology is easy to incorporate into DVD, HDD, Blu-ray Disc recorders. Improvement of the audio classification accuracy and extension of our framework beyond sports video, are avenues for further improvement.

42 Challenges A unified, non-heuristic way of summarizing content from diverse genres based on: Enhancement of current audio-only approach Adding Visual Object Detection (goalposts, faces etc.) Fusing Audio-Visual Cues to improve accuracy Extending current framework beyond sports to variety show, dramas, news etc.

43 Proposed Hierarchical Video Representation Video with Audio Track Play / Break Audio-Visual Markers Highlights Highlight Groups Feature extraction & segmentation Key audio-visual object detection Audio-visual markers association Grouping Pla y Break Visual Marker Audio Marker A highlight

44 Proposed Analysis Framework AudioVideo Audio Marker DetectionVisual Marker Detection ApplauseCheersBaseball CatcherSoccer Goalpost Golfer Bending to Hit A-V Marker Association Golf Swings Highlight Candidates Finer Resolution Highlights Golf Putts Which sport is it? Non-highlights Strikes, BallsBall-hitsNon-highlights Corner KicksPenalty KicksNon-highlights Excited Speech

45 What is a good visual marker? That has strong association with “interesting” events That has low variability

46 Visual Marker 1: Squatting Baseball Catcher Catchers Pitches Pitches that are followed by hits Hits that lead to runs/home- runs

47 Faces as a Visual Feature for Video Analysis Visual features are much lower than audio in semantics, BUT, Faces are: Arguably, semantically the most meaningful visual class Humans are the most popular subjects of consumer video Arguably, the best studied visual class for detection Powerful computer vision techniques for detection and Arguably, MERL has the most effective face detector Viola-Jones detector, fast and accurate

48 MERL Face Detector Viola-Jones face detector – One of the fastest and most accurate – Detectors for ‘frontal face’, ‘right face’, ‘left face’, ‘slanted faces’,… – Face Recognition (MERL technology) can be incorporated in the future – FaceAPI – easy integration and development (We developed a plug-in for a popular video editing software) Same object detection framework for faces, baseball catcher, etc. Can increase speed by changing parameters About 15fps, can be increased – accuracy trade off Only I-frames can be sufficient for content analysis

49 Applications – News Video Segmentation First, classify using face count Same as commercial detection. Mostly 1-face segments for news, though. Cluster 1-face segments using face location and size Effectively classifies different scene types Anchor shots appear in one cluster, weather report in another, outside correspondents another, etc. Detected 11 story introductions, missed 2 Styles vary. Will test across many news video sources. Temporal smoothing and error filtering to generate smooth summaries In general, applicable to static-scene programs Promising results with talk-show programs (e.g. monologues vs. guest segments) Interviews, documentaries, etc.

50 Scene composition clusters. (x: face x-location. y: face size)

51 Audio-Visual Marker Association Why? – To establish the beginning and end of the event of interest – to reduce false alarms – to use audience reaction to gauge significance Two cases – A video marker overlaps with an audio marker Associate them – A video marker followed by several audio markers Choose the nearest one that is not too far apart

52 Visual Marker Audio Marker candidate Audio Marker Visual Marker candidate Visual Marker Case1: more than 50% overlap between audio marker & visual marker t Case2: t < Threshold. Threshold obtained from the statistics of play durations of a game Discarded segment Audio Marker Discarded segment

53 What type of user goals? Skim (highlights, news) – automatic summary, quick overview Content selection – decide what to watch – Movie trailer, sitcom jokes, instructional video abstract Browse segments – find out what elements included – Stages of how-to, questions in game show, songs in music, guests in morning show, … Content segment selection – locate and watch only desired part – Songs in variety show, segments of a talk show, verdict in court TV Remember previously viewed – quickly remember which … – Series episodes – sitcoms, children’s, cartoons… – Home video browsing, searching Many times a mixture of above, or switching between

54 Problem formulation We want to find scene boundaries (boundaries between semantically different segments of a show). – We will use scene boundaries for smart browsing. For different genres, different features are useful. – Short music segments indicate scene changes in some sitcoms. – Speaker changes indicate scene changes in news programs. To create a single system that works for all genres, we hand-label a training set of diverse video content and use a support vector machine to learn an optimal classifier.

55 System overview Audio features – MFCCs, semantic classes Video features – shot change locations Labeled ground truth data – 7 hours of content from sitcoms, news programs, dramas, music videos, etc.

56 Audio Texture Analysis based temporal segmentation Audio features –semantic classes

57 Results Training data + SVM allows us to test many combinations of features on many genres. A combination of features (semantic histograms plus Bhattacharyya distance) was found to perform much better than any single audio feature. Video Shot detection improves the results

58 Performance Evaluation What is a Good Summary? Are there Objective Measures? – Yes, they exist and No, they are not sufficient User Studies will be key – Key issues are Functionalities and their effectiveness – Controlled tests of demonstration setup with groups of non-expert users – Several Iterations will be required – Hope to gauge the usefulness and efficacy of proposed functionalities – Evidence from Time Tunnel studies shows that such studies are insightful

59 Experimental Set-Up Interface displayed on Big-Screen TV All Subjects were sports aficionadoes One of us sat with the subject The rest of us observed from a remote hook- up Wide variety of sports tried – Soccer – American Football – Ice Hockey – Baseball – Tennis

60 Starting the session Demonstration of remote-control based highlights playback to subject Remote control has highlight skip button in addition to usual FF, REW etc.

61 Three Simple Tasks You have ten minutes to watch the content before you leave for work. You have thirty minutes to watch the content You have a full afternoon to watch the content The user was allowed to choose the content

62 Outline of Questions for Subjects Not a rigorous scientific experiment but intended to get a general idea of user response Salient Questions: How much of sports video do you record every week/month? What was your overall impression of the interface? Do you think it captured all the highlights?

63 What is a Highlight? Recorded both detailed comments and yes-no answers P1 thought that some highlights seemed “random.” P3 said that crowds react to things that he does not always care about. P4 “didn’t like” false and missing highlights. He said, “show me home runs and touchdowns.” Unanimous in liking the highlights when found “correctly”

64 General Response Missed Highlights bigger concern than False Highlights Quick recovery from false highlights essential Overall system enjoyable despite its mistakes Users liked: – The Action Map because of the flexibility it affords – Auto-skip and Highlight skip features Suitability to different sports – Best Suited to: – Soccer, Baseball, Hockey

65 Conclusion Users enjoyed system despite flaws The general consistency in the results indicates validity Our hypothesis that fair accuracy combined with a good interface would work well, was confirmed.

66 Blind Summarization – Content-adaptive summarization Content Specificity enables high summary quality Content-specificity leads to profusion of techniques Difficult to support multiple non-overlapping techniques in practical systems Content-independent techniques are known to compromise quality of summarization Thus the trade-off is between summary quality and content-specificity Our approach is to combine specific and general techniques – Have a large common core of processing – Postpone content-specific processing to as late a stage as possible

67 Problem formulation Motivated by the observation that “interesting” events in unscripted multimedia happen sparsely in a background of “uninteresting” events. – A burst of audience reaction in the vicinity of “interesting” events in sports audio. – A burst of screaming sound in the vicinity of “suspicious” events in surveillance audio content. Given a time series of observations from a usual background process (P 1 ) and an unusual foreground process (P 2 )

68 Proposed Analysis & Representation Framework

69 Outlier subsequence detection in time series … 12123232345432536543…… Compute Affinity Matrix (N*N) WLWL WLWL M 1 M 2 …. M i …… M N N Context Models Input time series Outlier Subsequence Detection Detected transitions & Outlier subsequences WSWS Clustering

70 Examples of time series from audio Low-level features Mid-level semantic labels at different time resolutions Applause Cheering Music Speech Speech & Music MFCC MDCT Comparing Likelihoods Training Examples Variety of content - Input Audio Audio Class 111234551111…

71 Cluster indicator vector for a soccer clip

72 Mining surveillance video

73 Using confidence metric to rank outliers I – Set of inliers M i – inlier Context model M j – outlier to be ranked When has only a finite support, use to rank outliers. P d,i d 1 (M i, M j ) d 2 (M i, M k )

74 Systematic acquisition of domain knowledge Input content Feature Extraction Mining framework P1 P2 p3 Train Supervised Models

75 Acquired audio classes from surveillance data

76 Scene segmentation for situation comedy content: Music detection

77 Conclusions and Future work Developed Content-adaptive analysis framework based on time-series analysis Postpones content-specific processing to a late stage Works for sports, surveillance, sitcoms, news and other genres where there are one or two key audio-visual object classes High computational complexity compared to content-specific techniques Currently more useful as a test-bed rather than as a practical system Future challenge will be to maintain flexibility while making computational complexity comparable to content-specific techniques Could incorporate visual techniques

78 TimeTunnel: Motivation However - many portions of the interface for accessing DV have not changed Some exceptions include: – Chapter browsing on DVDs (random access) – 30-second skip function for PVRs In summary – the basic method for fast- forwarding and rewinding through digital video is the same as fast-forwarding and rewinding through analogue video

79 Augmenting Fast-forward and Rewind Display images from multiple positions in the video at the same time A combination of temporal and spatial layout – Uses conventional fast-forwarding view in background (slideshow) – Adds context by laying out neighboring frames in a trail vs.or One variation of our interface Temporal keyhole Spatial array

80 Augmenting Fast-forward and Rewind

81 User Evaluation Published the results of an early evaluation in UIST 2003 – Subjects were more accurate at finding a specific location in a video with our technique than with traditional DVD fast- forward – Reduced navigation error by about 25%

82 User Evaluation Recently, subjects used DC images from standard definition video to fast-forward and rewind through video Preferences differed about the specifics of the layout, but some commonality emerged

83 Experiment SAC has a very large design space This initial experiment used a vanilla version of SAC to convince ourselves that there was merit in this realm Hypothesis – SAC will allow users to more accurately reach a desired position in a recorded video than a traditional fast-forward interface will. – SAC will allow users to more quickly reach a desired position in a recorded video than a traditional fast-forward interface will. 15 subjects fast-forwarding through 3-4 minute video Displayed on TV Remote control as input “Please fast-forward to the start of the fireworks show” and “Please watch this program and fast-forward through the commercials” were typical instructions

84 Results: Accuracy average 6.87 vs. 9.18, t(13) = -2.183, p = 0.023

85 Web-based vs Personal View 2009 Sarnoff Corp. Copyrighted World Personal Content WORLDWIDE CONTENT

86 2009 Sarnoff Corp. Copyrighted Limitations of Current Technology Content Extraction : Multi-media content extraction, indexing and retrieval at the Web scale is non- existent – Semantic content extraction from Image/Video data at a large scale in infancy at best – Joint exploitation with text, audio and other data in limited domains: News Videos Indexing : Sub-linear, near real-time indexing for high dimensional data feasible only for modest sized repositories: ~1M images – Video indexing not even tried More Computation not the answer Duration of Content 52560 Hours 24 TB of MPEG-1 Video Database Size1420 GB Indexing Time 1 processor 289080 seconds 80 Hours Projections: State of the art with 6 24/7 1 Year MPEG-1 streams

87 2009 Sarnoff Corp. Copyrighted Limitations of Current Technology Learning : Unsupervised, incremental extraction and assimilation of new content at large scale not available – Google-like auto indexing of large scale multi-modal content needed Triggers : Real-time triggers and alerts for events of interest needed Architecture: Distributed algorithms and architectures for meta-data extraction and real-time retrieval need to be discovered

88 2009 Sarnoff Corp. Copyrighted Feasibility: Technology Nuggets Real-time Feature extraction and matching Multi-modal association of imagery, video, blogs across Internet – Geo-location based association Incremental meta-data extraction techniques Distributed Architecture for very large databases In-time or real-time triggering for user-specified event Robustness w.r.t. viewpoint, illumination changes through embeddings based representations –Locality-sensitive hashing trees, Randomized trees Logic leading to link graphs Hierarchical Event Representation and Recognition Scalable and Modular Framework Real-time extraction & matching of learned multi-modal features Technology Nuggets Techniques

89 2009 Sarnoff Corp. Copyrighted Notional Metrics Duration of ContentYears? Months? Database Size? Indexing Time Indexing Time/Query Feature Linear? Sub-linear?

90 2009 Sarnoff Corp. Copyrighted Notional Metrics (Continued) RobustnessTo Imprecision of meta- data, imprecision of retrieval, etc. User FeedbackMinimize number of required user interactionns Retrieval Accuracy vs Speed Trade-off Graceful Degradation

91 Future Directions Scale Will Increase – Personally created and captured content – Personally recorded Broadcast Content – Web Content Personal vs Web Content – No real distinction Variety of devices – Conventional TV’s – PC’s – Handheld’s – Projectors? 2009 Sarnoff Corp. Copyrighted

92 Technical Challenges Accuracy Computational Complexity Hardware vs. Software – MCA at edge? – MCA at server? User Interfaces 2009 Sarnoff Corp. Copyrighted

93 Questions 90 minutes of soccer – 2 minute digest 9000 minutes of soccer – 200 minute digest? Boundary line between content creation and analysis? Interactivity in multiple modes? Multimedia Content Analysis should – Convert Multimedia into a set of searchable text words? – Provide Multimedia responses to user needs? 2009 Sarnoff Corp. Copyrighted


Download ppt "Taking Advances in Multimedia Content Analysis to Product: Challenges and Solutions Ajay Divakaran Vision and Multi-Sensor Systems December 1, 2009."

Similar presentations


Ads by Google