Temporal Compression Of Speech: An Evaluation IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 4, MAY 2008 Simon Tucker and Steve.

Slides:

Advertisements

Similar presentations

Generation of Multimedia TV News Contents for WWW Hsin Chia Fu, Yeong Yuh Xu, and Cheng Lung Tseng Department of computer science, National Chiao-Tung.

Advertisements

The Robert Gordon University School of Engineering Dr. Mohamed Amish

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Accessing spoken words: the importance of word onsets

Running a model's adjoint to obtain derivatives, while more efficient and accurate than other methods, such as the finite difference method, is a computationally.

What is a CAT?. Introduction COMPUTER ADAPTIVE TEST + performance task.

CNIT 132 – Week 9 Multimedia. Working with Multimedia Bandwidth is a measure of the amount of data that can be sent through a communication pipeline each.

Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.

Measuring Perceived Quality of Speech and Video in Multimedia Conferencing Applications Anna Watson and M. Angela Sasse Dept. of CS University College.

Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.

SmartPlayer: User-Centric Video Fast-Forwarding K.-Y. Cheng, S.-J. Luo, B.-Y. Chen, and H.-H. Chu ACM CHI 2009 (international conference on Human factors.

Image Information Retrieval Shaw-Ming Yang IST 497E 12/05/02.

Using Multiple Synchronized Views Heymo Kou.  What is the two main technologies applied for efficient video browsing? (one for audio, one for visual.

1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,

Personalized Abstraction of Broadcasted American Football Video by Highlight Selection Noboru Babaguchi (Professor at Osaka Univ.) Yoshihiko Kawai and.

1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Multimedia Search and Retrieval: New Concepts, System Implementation, and Application Qian Huang, Atul Puri, Zhu Liu IEEE TRANSACTION ON CIRCUITS AND SYSTEMS.

ADVISE: Advanced Digital Video Information Segmentation Engine

Multimedia Search and Retrieval Presented by: Reza Aghaee For Multimedia Course(CMPT820) Simon Fraser University March.2005 Shih-Fu Chang, Qian Huang,

Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.

1 Discussion Class 10 Informedia. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others to comment.

Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.

HCI and Usability Issues of Multimedia Internet broadcasting Lecture 3.

Presented by Zeehasham Rasheed

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System Supervisor: Prof Michael Lyu Presented by: Lewis Ng,

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

김덕주 (Duck Ju Kim). Problems What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

Simon Tucker NLP Presentation Efficient user-centred access to multimedia meeting content Simon Tucker and Steve Whittaker University.

Information Retrieval in Practice

Time Series Data Analysis - II

User Benefits of Non-Linear Time Compression Liwei He and Anoop Gupta Microsoft Research.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Digital Sound and Video Chapter 10, Exploring the Digital Domain.

TEMPORAL VIDEO BOUNDARIES -PART ONE- SNUEE KIM KYUNGMIN.

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.

Hands-on tutorial: Using Praat for analysing a speech corpus Mietta Lennes Palmse, Estonia Department of Speech Sciences University of Helsinki.

By: TARUN MEHROTRA 12MCMB11.  More time is spent maintaining existing software than in developing new code.  Resources in M=3*(Resources in D)  Metrics.

Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal VideoConference Archives Indexing System.

Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,

MULTIMEDIA TECHNOLOGY SMM 3001 MEDIA - VIDEO. In this chapter How digital video differs from conventional analog video How digital video differs from.

Understanding The Semantics of Media Chapter 8 Camilo A. Celis.

1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.

Wireless communications and mobile computing conference, p.p , July 2011.

Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp

1 Applications of video-content analysis and retrieval IEEE Multimedia Magazine 2002 JUL-SEP Reporter: 林浩棟.

Imposing native speakers’ prosody on non-native speakers’ utterances: Preliminary studies Kyuchul Yoon Spring 2006 NAELL The Division of English Kyungnam.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

MULTIMEDIA DATA MODELS AND AUTHORING

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

WP6 Emotion in Interaction Embodied Conversational Agents WP6 core task: describe an interactive ECA system with capabilities beyond those of present day.

Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,

RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.

Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

What is a CAT? What is a CAT?.

Introduction Multimedia initial focus

Simon Tucker and Steve Whittaker University of Sheffield

Multimedia Content-Based Retrieval

Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

Paper Reading part Seo Seok Jun.

Multimedia Information Retrieval

Presentation transcript:

Temporal Compression Of Speech: An Evaluation IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 4, MAY 2008 Simon Tucker and Steve Whittaker Speaker: 吳麟佑

Outline I. Introduction II. Compression Techniques III. Experimental Procedure IV. Results V. Summary and Discussion

I. Introduction Efficient browsing of speech recordings is problematic. The linear nature of speech, coupled with the lack of abstraction that the medium affords, means that listeners have to listen to long segments of a recording to locate points of interest.

Several large-scale projects have built complex visual tools to help users review multimedia meeting records. These systems record audio and video media, along with other relevant meeting events such as logs of personal notes, whiteboard markings, and presentation slides.

One limitation of these systems is that they require complex visual displays, along with storage and indexing of multiple data types. In contrast, our focus here is on efficient access to speech using small displays such as those available on PDAs and mobile phones.

We explore different temporal compression methods that aim to reduce the amount of time it takes to listen to a speech recording while retaining all of its important information. We investigate two different compression techniques: 1.excision where unimportant portions of the recording are removed. 2.speed-up where the playback rate is altered while keeping speaker pitch constant.

1.Excision Excision reduces the recording length by removing automatically selected portions of audio data, effectively compressing it. One simple excision technique removes the between- word silences in the recording.

For the corpus used in these experiments, on average only 25% of the meeting recordings were silence. We also explored a new temporal compression technique using semantic information at a coarse- grained word level, as opposed to the sentence-level compression techniques described in the literature.

It uses meeting-specific discourse features to produce an extractive summary to determine which parts of the recording should be played to listeners. While this approach is promising for broadcast news, it is unclear how well it will generalize to meetings data which is of higher acoustic complexity and is much less structured.

2.Speed-up Speed-up constructs a new recording in which the speaking rate has been artificially altered. Techniques for altering speech rate range from simple sample-frequency alteration (which changes both the playback rate and speaker pitch), to more complex frequency domain procedures.

Alteration is largely carried out in the time-domain; a technique, popular for its efficiency and quality, is the synchronized overlap add method (SOLA) which successively overlaps small segments of the recording. User studies have shown linear speed-up to be effective. Users can comprehend information played at twice its normal rate and after exposure to sped up speech, they prefer it to the normal speech rate.

More recent work explores nonlinear speed-up. While linear techniques apply constant compression throughout the recording, nonlinear approaches vary compression continuously according to external factors. we explore the effects of speed-up within an utterance— motivated by the finding that listeners can often infer what will be said towards the end of an utterance.

Finally, we devise novel hybrid techniques that combine speed-up and excision. Having identified unimportant utterances or silences, we can speed through these instead of excising them altogether.

II. Compression Techniques A. Acoustic Excision B. Semantic Excision C. Acoustic Speed-Up D. Semantic Speed-Up E. Hybrid Acoustic Speed-Up F. Hybrid Semantic Speed-Up

A. Acoustic Excision Removes silences,on the grounds that these do not convey important information. Standard silence excision techniques are constrained by the amount of silence in the original recording. To overcome this limitation, we therefore developed a technique that measures the silence similarity of audio segments. We then set a threshold, with segments above the threshold labeled as silence-like, which are then excised.

B. Semantic Excision Use a meeting transcript and information retrieval techniques to identify less important speech segments which are then excised from the recording.

C. Acoustic Speed-Up We explore constant speed-up and speech-rate speed- up. With constant speed-up, clips compressed at the low compression rate are played back at a constant speed- up 1.4 times real time, while clips compressed at the high rate are played back at 2.5 times real time - corresponding to 70% and 40% compression, respectively.

D. Semantic Speed-Up Psycholinguistic studies indicate that the more of an utterance listeners have heard, the better they are able to predict what will next be said—suggesting that end-of- utterance information is partially redundant.

E. Hybrid Acoustic Speed-Up Silence speed-up identifies silence-like segments as described above, but instead of excising these. Silence speed-up works in the same way as silence excision, except that all frames are classified as having either high or low silence similarity.

F. Hybrid Semantic Speed-Up A similar technique, insignificant utterance speed-up, identifies unimportant utterances and then presents these sped up. Again, this is similar to insignificant utterance excision, except that the utterances are divided into two groups: important utterances and unimportant utterances, according to the measure of importance we used. Again, the balance between important and unimportant utterances is chosen so that the required level of compression is achieved.

III. Experimental Procedure A. Stimuli Seventeen two-minute clips were manually selected from this corpus; two minutes was chosen as it was felt that this gave listeners sufficient time to judge the effort of listening as well as their overall understanding of the excerpt.

B. Procedure Experiments took place in an acoustically isolated booth, with clips being presented to listeners diotically over Sennheiser HD250 headphones. A Matlab script was used to both present the excerpts and collect the results.

IV. Results The overall results for the experiment are shown in Fig. 1 It is apparent from the graphs that there is little difference between subjective judgments of effort and understanding, possibly because subjects considered that something that was difficult to understand also required substantial effort to process.

Fig. 1

V. Summary and Discussion We carried out an exploratory study developing new temporal compression algorithms and comparing their effects on the perceived understanding of compressed speech. In particular the silence, excision word level excisions and the hybrid techniques described are unique to this paper, as is the subjective comparison between speed- up and excisions manipulations. Subjects found most techniques acceptable at low compression levels, with differences only emerging strongly at higher levels.

Unhappy Ending