Cross-Modal (Visual-Auditory) Denoising Dana Segev Yoav Y. Schechner Michael Elad Technion – Israel Institute of Technology 1.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Joint work with Irad Yavneh
Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09,
Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.
LAM: Musical Audio Similarity Michael Casey Centre for Cognition, Computation and Culture Department of Computing Goldsmiths College, University of London.
Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
In ♫ ♫ otion Harmony Zohar Barzelay, Yoav Y. Schechner Dept. Elect. Eng. Technion – Israel Institute of Technology 1 Ack: Einav Namer, Yael Waissman, ISF.
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Classifying Motion Picture Audio Eirik Gustavsen
Image Denoising via Learned Dictionaries and Sparse Representations
* Joint work with Michal Aharon Guillermo Sapiro
Major Cast Detection in Video Using Both Speaker and Face Information
Speech Recognition in Noise
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Lecture #1COMP 527 Pattern Recognition1 Pattern Recognition Why? To provide machines with perception & cognition capabilities so that they could interact.
Dynamic Time Warping Applications and Derivation
1 New Technique for Improving Speech Intelligibility for the Hearing Impaired Miriam Furst-Yust School of Electrical Engineering Tel Aviv University.
Sparse and Redundant Representation Modeling for Image Processing Michael Elad The Computer Science Department The Technion – Israel Institute of technology.
Retinex by Two Bilateral Filters Michael Elad The CS Department The Technion – Israel Institute of technology Haifa 32000, Israel Scale-Space 2005 The.
Face Recognition and Retrieval in Video Basic concept of Face Recog. & retrieval And their basic methods. C.S.E. Kwon Min Hyuk.
Multimodal Deep Learning
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING MARCH 2010 Lan-Ying Yeh
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
Information Retrieval in Practice
Representing Acoustic Information
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Introduction to Automatic Speech Recognition
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
What’s Making That Sound ?
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
7-Speech Recognition Speech Recognition Concepts
Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Jacob Zurasky ECE5526 – Spring 2011
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
Multimodal Information Analysis for Emotion Recognition
Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.
Structure Discovery of Pop Music Using HHMM E6820 Project Jessie Hsu 03/09/05.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
ECE 8443 – Pattern Recognition EE 3512 – Signals: Continuous and Discrete Objectives: Spectrograms Revisited Feature Extraction Filter Bank Analysis EEG.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Noise Reduction Two Stage Mel-Warped Weiner Filter Approach.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
Exploiting cross-modal rhythm for robot perception of objects Artur M. Arsenio Paul Fitzpatrick MIT Computer Science and Artificial Intelligence Laboratory.
Performance Comparison of Speaker and Emotion Recognition
Detection of Illicit Content in Video Streams Niall Rea & Rozenn Dahyot
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Digital Signal Processing Rahil Mahdian LSV Lab, Saarland University, Germany.
My Research in a Nut-Shell Michael Elad The Computer Science Department The Technion – Israel Institute of technology Haifa 32000, Israel Meeting with.
Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.
Artificial Intelligence for Speech Recognition
4.2 Data Input-Output Representation
Presentation transcript:

Cross-Modal (Visual-Auditory) Denoising Dana Segev Yoav Y. Schechner Michael Elad Technion – Israel Institute of Technology 1

2 Digits sequence Noisy digits sequence Denoised by state of the art algorithm of Cohen & Berdugo Segev, Schechner, Elad, Cross-Modal Denoising

Use one modality to denoise another? Use video to denoise a soundtrack? 3 Segev, Schechner, Elad, Cross-Modal Denoising

a Very intense Non-stationary Unknown Unseen source. Noise Single microphone 4 Segev, Schechner, Elad, Cross-Modal Denoising

5 very noisy audio time (sec) Input Algorithm denoised audio Output For human and machine hearing video Cross-modal Example-Based Segev, Schechner, Elad, Cross-Modal Denoising

6

7

8 Training xample set nput test set I E Segev, Schechner, Elad, Cross-Modal Denoising

9

10 ~syllable (0.25 sec) Segev, Schechner, Elad, Cross-Modal Denoising

lophone 11 Xylophone Segev, Schechner, Elad, Cross-Modal Denoising

lophone 12 Sound Xylophone Segev, Schechner, Elad, Cross-Modal Denoising

13... Examples Segev, Schechner, Elad, Cross-Modal Denoising

14... Examples Segev, Schechner, Elad, Cross-Modal Denoising

15... Examples Segev, Schechner, Elad, Cross-Modal Denoising

16... Examples Segev, Schechner, Elad, Cross-Modal Denoising

Cross-modal representation. 17 Generating multimodal features. Cross-modal pattern recognition. Rendering a denoised signal. Learning feature statistics. Segev, Schechner, Elad, Cross-Modal Denoising

18 Input videoVideo feature-space time (sec) Input audio Audio feature-space Segev, Schechner, Elad, Cross-Modal Denoising

19 Input audio-video time (sec) Audio-video feature-space Segev, Schechner, Elad, Cross-Modal Denoising

20 Training audio-video Audio-video examples feature-space time (sec) Segev, Schechner, Elad, Cross-Modal Denoising

21 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising

22 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising

23 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising

24 Nearest Neighbor Feature-space Segev, Schechner, Elad, Cross-Modal Denoising

25 Nearest Neighbor Feature-space Segev, Schechner, Elad, Cross-Modal Denoising

26 Examples... Segev, Schechner, Elad, Cross-Modal Denoising

27 Examples... Segev, Schechner, Elad, Cross-Modal Denoising

28 Noisy audio Clean segment Segev, Schechner, Elad, Cross-Modal Denoising

29 Noisy audio Clean segment Denoised Segev, Schechner, Elad, Cross-Modal Denoising

Examples Segev, Schechner, Elad, Cross-Modal Denoising

31 Examples... Input... Segev, Schechner, Elad, Cross-Modal Denoising

32... Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

33... Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

34... Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

Bartender experiment 35 Segev, Schechner, Elad, Cross-Modal Denoising

36... Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

Cross-modal representation. 37 Generating multimodal features. Cross-modal pattern recognition (NN). Rendering a denoised signal. Learning feature statistics. Segev, Schechner, Elad, Cross-Modal Denoising

38 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising

39 Feature-space For the k-th example segment: Segev, Schechner, Elad, Cross-Modal Denoising

40 Feature-space bi fif ty two ar bi -fif -ty-two For the k-th example segment: Segev, Schechner, Elad, Cross-Modal Denoising

41 Current cluster Next cluster bityfiftwoar bi ty fif two ar Feature-space bi fif ty two ar Segev, Schechner, Elad, Cross-Modal Denoising

42 Current cluster Next cluster bityfiftwoar bi ty fif two ar Syllable consecutive probability The probability for transition between clusters = Number of examples in training set Segev, Schechner, Elad, Cross-Modal Denoising

43 Hidden Markov Model P Time delay bifif ty two bi Segev, Schechner, Elad, Cross-Modal Denoising

44 P Time delay bifif ty two bi Audio noise Segev, Schechner, Elad, Cross-Modal Denoising

45 Hidden Markov Model P Time delay bifif ty two bi + Audio noise Segev, Schechner, Elad, Cross-Modal Denoising

46 Examples... Input... Segev, Schechner, Elad, Cross-Modal Denoising

47... Examples Input... Segev, Schechner, Elad, Cross-Modal Denoising

48... Examples Input... Segev, Schechner, Elad, Cross-Modal Denoising

49 Input video Segev, Schechner, Elad, Cross-Modal Denoising

50 Input video Segev, Schechner, Elad, Cross-Modal Denoising

51 Input video Vector of indices Segev, Schechner, Elad, Cross-Modal Denoising

52 A Cost function A Regularization term A Data term A Regularization term A Data term Segev, Schechner, Elad, Cross-Modal Denoising

53 A Cost function A Regularization term A Data term A Regularization term A Data term Optimally vector of indices Segev, Schechner, Elad, Cross-Modal Denoising

54 nodes edges Complexity : Examples Input... Complexity: Dynamic Programming Segev, Schechner, Elad, Cross-Modal Denoising

55... Examples Input... Segev, Schechner, Elad, Cross-Modal Denoising

56... Examples Input... Segev, Schechner, Elad, Cross-Modal Denoising

57... Examples Input... Segev, Schechner, Elad, Cross-Modal Denoising

Cross-modal representation. 58 Generating multimodal features. Cross-modal pattern recognition. Rendering a denoised signal. Learning feature statistics. Segev, Schechner, Elad, Cross-Modal Denoising

Audio Features 59 Sensitivity to sound perception. Dimension reduction Visual Features Focusing on the motion of interest Dimension reduction Speech Features Music Features Requirements The spatial trajectory of a hitting rod DCT coefficients MFCCs Spectrogram of each segment Segev, Schechner, Elad, Cross-Modal Denoising

60 MFCCs – Mel-frequency Ceptral Coefficients Audio signal Signal spectrum Mel-frequency filter bank log(. ) DCT MFCCs Segev, Schechner, Elad, Cross-Modal Denoising

61 Spectrogram of each segment Spectrogram Xylophne signal Spectrogram accumulation Segev, Schechner, Elad, Cross-Modal Denoising

The given movie speech Segev, Schechner, Elad, Cross-Modal Denoising

Locking on the object of interest speech Segev, Schechner, Elad, Cross-Modal Denoising

64... speech Extracting global motion by tracking Segev, Schechner, Elad, Cross-Modal Denoising

65... speech Extracting global motion by tracking Segev, Schechner, Elad, Cross-Modal Denoising

Extracting features 66 DCT coefficients which highly represent motion between frames speech Segev, Schechner, Elad, Cross-Modal Denoising

The given movie Xylophone Segev, Schechner, Elad, Cross-Modal Denoising

Locking on the object of interest 68 Xylophone... Segev, Schechner, Elad, Cross-Modal Denoising

Extracting global motion by tracking 69 Xylophone... X Z Y Segev, Schechner, Elad, Cross-Modal Denoising

70 Xylophone... X ZY Extracting global motion by tracking Segev, Schechner, Elad, Cross-Modal Denoising

Extracting features 71 Xylophone Hitting rod spatial coordinates X Y Z Segev, Schechner, Elad, Cross-Modal Denoising

Speech 72 A corpus of a limited number of words and syllables: Digits and bar beverages. Video rate 25fps, Audio rate 8000Hz. Kmeans clustering, 350 clusters. Distance measurement l 2 norm. Xylophone A corpus of a limited sounds. Video rate 25fps, Audio rate 16000Hz Distance measurement l 2 norm. Segev, Schechner, Elad, Cross-Modal Denoising

73 Xylophone Training duration: 103 sec Testing duration : 100 sec Music from song by GNR: SNR = 0.9 Xylophone Melody: SNR = 1 Segev, Schechner, Elad, Cross-Modal Denoising

Speech: Digits 74 Training duration: 60 sec Testing duration : 240 sec NoisyDenoised SNR = 0.07 Segev, Schechner, Elad, Cross-Modal Denoising

Speech: Bartender 75 Music from song by Phil Collins Male SpeechWhite Gaussian Training duration: 48 sec Testing duration : 350 sec SNR = 0.59 SNR = 0.3SNR = 0.38 Segev, Schechner, Elad, Cross-Modal Denoising

76 video very noisy audio time (sec) Input Algorithm denoised audio Output For human and machine hearing Example-based Hidden Markov Model Segev, Schechner, Elad, Cross-Modal Denoising