Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP.

Slides:

Advertisements

Similar presentations

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),

Advertisements

Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.

Multi-Speaker Detection By Matt Fratkin EE /9/05.

Patch to the Future: Unsupervised Visual Prediction

Event prediction CS 590v. Applications Video search Surveillance – Detecting suspicious activities – Illegally parked cars – Abandoned bags Intelligent.

Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Classifying Motion Picture Audio Eirik Gustavsen

Segmentation and Event Detection in Soccer Audio Lexing Xie, Prof. Dan Ellis EE6820, Spring 2001 April 24 th, 2001.

Multiple Pitch Tracking for Blind Source Separation Using a Single Microphone Joseph Tabrikian Dept. of Electrical and Computer Engineering Ben-Gurion.

Local Affine Feature Tracking in Films/Sitcoms Chunhui Gu CS Final Presentation Dec. 13, 2006.

Nearfield Spherical Microphone Arrays for speech enhancement and dereverberation Etan Fisher Supervisor: Dr. Boaz Rafaely.

Zhengyou Zhang, Qin Cai, Jay Stokes

HIWIRE meeting ITC-irst Activity report Marco Matassoni, Piergiorgio Svaizer March Torino.

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell.

Teaching Tool For French Speech Pronunciation Capstone Design Project 2008 Joseph Ciaburri Advisor: Professor Catravas.

Separation of Multispeaker Speech Using Excitation Information B.Yegnanarayana, R.Kumara Swamy and S.R.Mahadeva Prasanna Dept of Computer Science and.

Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.

User Benefits of Non-Linear Time Compression Liwei He and Anoop Gupta Microsoft Research.

LE 460 L Acoustics and Experimental Phonetics L-13

GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.

SoundSense: Scalable Sound Sensing for People-Centric Application on Mobile Phones Hon Lu, Wei Pan, Nocholas D. lane, Tanzeem Choudhury and Andrew T. Campbell.

Kinect Player Gender Recognition from Speech Analysis

A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST

What’s Making That Sound ?

Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.

Notes on ICASSP 2004 Arthur Chan May 24, This Presentation (5 pages)  Brief note of ICASSP 2004  NIST RT 04 Evaluation results  Other interesting.

Exploiting video information for Meeting Structuring ….

1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.

Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.

Umm Al-Qura University Collage of Computer and Info. Systems Computer Engineering Department Automatic Camera Tracking System IMPLEMINTATION CONCLUSION.

Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP.

Adaptive Methods for Speaker Separation in Cars DaimlerChrysler Research and Technology Julien Bourgeois

STRUCTURED SPARSE ACOUSTIC MODELING FOR SPEECH SEPARATION AFSANEH ASAEI JOINT WORK WITH: MOHAMMAD GOLBABAEE, HERVE BOURLARD, VOLKAN CEVHER.

Nico De Clercq Pieter Gijsenbergh.  Problem  Solutions  Single-channel approach  Multichannel approach  Our assignment Overview.

TELECOM V = f x m/s hzm V is speed,f is frequency  is wavelength.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

Modeling individual and group actions in meetings with layered HMMs dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap.

Timo Haapsaari Laboratory of Acoustics and Audio Signal Processing April 10, 2007 Two-Way Acoustic Window using Wave Field Synthesis.

Modal Analysis of Rigid Microphone Arrays using Boundary Elements Fabio Kaiser.

ENTERFACE 08 Project 1 “MultiParty Communication with a Tour Guide ECA” Mid-term presentation August 19th, 2008.

CS654: Digital Image Analysis

Song-level Multi-pitch Tracking by Heavily Constrained Clustering Zhiyao Duan, Jinyu Han and Bryan Pardo EECS Dept., Northwestern Univ. Interactive Audio.

College of Engineering Anchor Nodes Placement for Effective Passive Localization Karthikeyan Pasupathy Major Advisor: Dr. Robert Akl Department of Computer.

Chapter 3 Time Domain Analysis of Speech Signal. 3.1 Short-time windowing signal (1) Three types windows : –Rectangular window –h r [n] = u[n] – u[n –

Full-rank Gaussian modeling of convolutive audio mixtures applied to source separation Ngoc Q. K. Duong, Supervisor: R. Gribonval and E. Vincent METISS.

Image Segmentation by Histogram Thresholding Venugopal Rajagopal CIS 581 Instructor: Longin Jan Latecki.

1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.

1 Detecting Group Interest-level in Meetings Daniel Gatica-Perez, Iain McCowan, Dong Zhang, and Samy Bengio IDIAP Research Institute, Martigny, Switzerland.

Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.

Turning a Mobile Device into a Mouse in the Air

Spatial Covariance Models For Under- Determined Reverberant Audio Source Separation N. Duong, E. Vincent and R. Gribonval METISS project team, IRISA/INRIA,

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Physics 434 Module 4 - T. Burnett 1 Physics 434 Module 4 Acoustic excitation of a physical system: time domain.

Mark Dorman Separation Of Charged Current And Neutral Current Events In The MINOS Far Detector Using The Hough Transform Mark Dorman 16/12/04.

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

HIGH-RESOLUTION SINUSOIDAL MODELING OF UNVOICED SPEECH GEORGE P. KAFENTZIS, YANNIS STYLIANOU MULTIMEDIA INFORMATICS LABORATORY DEPARTMENT OF COMPUTER SCIENCE.

Control of Dynamic Discrete-Event Systems Lenko Grigorov Master’s Thesis, QU supervisor: Dr. Karen Rudie.

Automatic Transcription of Polyphonic Music

Audio to Score Alignment for Educational Software

Robust Data Hiding for MCLT Based Acoustic Data Transmission

Tracking parameter optimization

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Image Processing for Physical Data

Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa

for Vision-Based Navigation

Govt. Polytechnic Dhangar(Fatehabad)

Deep neural networks for spike sorting: exploring options

Measuring the Similarity of Rhythmic Patterns

Automatic Prosodic Event Detection

Presentation transcript:

Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP

Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.

Context Automatic analysis of recordings: –Meeting annotation. –Speaker tracking for speech acquisition. –Surveillance applications.

Context Automatic analysis of recordings: –Meeting annotation. –Speaker tracking for speech acquisition. –Surveillance applications. Questions to answer: –Who? What? Where? When? Location can be used for very precise segmentation.

Microphone Array

Why Multiple Sources? Spontaneous multi-party speech: –Short. –Sporadic. –Overlaps.

Why Multiple Sources? Spontaneous multi-party speech: –Short. –Sporadic. –Overlaps. Problem: frame-level multisoure localization and detection. One frame = 16 ms.

Why Multiple Sources? Spontaneous multi-party speech: –Short. –Sporadic. –Overlaps. Problem: frame-level multisoure localization and detection. One frame = 16 ms. Many localization methods exist…But: –Speech is wideband. –Detection issue: how many?

Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.

Sector-based Approach Question: is there at least one active source in a given sector?

Sector-based Approach Question: is there at least one active source in a given sector?  Answer it for each frequency bin separately

Frame-level Analysis f s Sector of space Frequency bin One time frame every 16 ms. Discretize both space and frequency.

Frame-level Analysis f s Sector of space Frequency bin One time frame every 16 ms. Discretize both space and frequency. Sparsity assumption [Roweis 03].

Frame-level Analysis f s Sector of space Frequency bin One time frame every 16 ms. Discretize both space and frequency. Sparsity assumption [Roweis 03]

Frame-level Analysis f s Sector of space Frequency bin One time frame every 16 ms. Discretize both space and frequency. Sparsity assumption [Roweis 03]

Frequency Bin Analysis Compute phase between 2 microphones:  (f) in  Repeat for all P microphone pairs  f  1 (f) …  P (f)]. P=M(M-1)/2

Frequency Bin Analysis Compute phase between 2 microphones:  (f) in  Repeat for all P microphone pairs  f  1 (f) …  P (f)]. For each sector s, compare measured phases  (f) with the centroid  s : pseudo-distance d(  (f),  s ). P=M(M-1)/2 sector f d(  f  1  d(  f  2  d(  f  3  d(  f  7  …

Frequency Bin Analysis Compute phase between 2 microphones:  (f) in  Repeat for all P microphone pairs  f  1 (f) …  P (f)]. For each sector s, compare measured phases  (f) with the centroid  s : pseudo-distance d(  (f),  s ). Apply sparsity assumption: –The best one only is active. P=M(M-1)/2

Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.

Real Data: Single Speaker Without sparsity assumption [SAPA 04] similar to [ICASSP 01]

Real Data: Single Speaker With sparsity assumption (this work) Without sparsity assumption [SAPA 04] similar to [ICASSP 01]

Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.

Real Data: Multiple Loudspeakers

Task 2: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected loudspeakers simultaneously active

Real Data: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected loudspeakers simultaneously active

Real Data: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected >=1 detected100%99.8% Average nb detected loudspeakers simultaneously active

Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.

Real data: Humans

MetricIdealResult >=1 detected~89.4%90.8% Average nb detected ~ speakers simultaneously active (includes short silences)

Real data: Humans MetricIdealResult >=1 detected~89.4%90.8% Average nb detected ~ speakers simultaneously active (includes short silences) >=1 detected~96.5%95.1% Average nb detected ~2.01.6

Conclusion Sector-based approach. Localization and detection. Effective on real multispeaker data.

Conclusion Sector-based approach. Localization and detection. Effective on real multispeaker data. Current work: –Optimize centroids. –Multi-level implementation. –Compare multilevel with existing methods.

Conclusion Sector-based approach. Localization and detection. Effective on real multispeaker data. Current work: –Optimize centroids. –Multi-level implementation. –Compare multilevel with existing methods. Possible integration with Daimler.

Thank you!

Pseudo-distance Measured phases  f  1 (f) …  P (f)]  in  P  For each sector a centroid  s =[  s,1 …  s,P ]. d(  f ,  s ) =  p sin 2 ( (  p (f) –  s,p ) / 2 ) cos(x) = 1 – 2 sin 2 ( x / 2 )  argmax beamformed energy = argmin d

Delay-sum vs Proposed (1/3) With optimized centroids (this work) With delay-sum centroids (this work)

Delay-sum vs Proposed (2/3) MetricIdealDelay-sumProposed >=1 detected100%99.9%100% Average nb detected loudspeakers simultaneously active >=1 detected100%99.2%99.8% Average nb detected loudspeakers simultaneously active

Delay-sum vs Proposed (3/3) MetricIdealDelay-sumProposed >=1 detected~89.4%80.0%90.8% Average nb detected ~ humans simultaneously active >=1 detected~96.5%86.7%95.1% Average nb detected ~ humans simultaneously active

Energy and Localization