Download presentation
Presentation is loading. Please wait.
Published byClarence O’Connor’ Modified over 9 years ago
1
Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP
2
Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.
3
Context Automatic analysis of recordings: –Meeting annotation. –Speaker tracking for speech acquisition. –Surveillance applications.
4
Context Automatic analysis of recordings: –Meeting annotation. –Speaker tracking for speech acquisition. –Surveillance applications. Questions to answer: –Who? What? Where? When? Location can be used for very precise segmentation.
5
Microphone Array
7
Why Multiple Sources? Spontaneous multi-party speech: –Short. –Sporadic. –Overlaps.
8
Why Multiple Sources? Spontaneous multi-party speech: –Short. –Sporadic. –Overlaps. Problem: frame-level multisoure localization and detection. One frame = 16 ms.
9
Why Multiple Sources? Spontaneous multi-party speech: –Short. –Sporadic. –Overlaps. Problem: frame-level multisoure localization and detection. One frame = 16 ms. Many localization methods exist…But: –Speech is wideband. –Detection issue: how many?
10
Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.
11
Sector-based Approach Question: is there at least one active source in a given sector?
12
Sector-based Approach Question: is there at least one active source in a given sector? Answer it for each frequency bin separately
13
Frame-level Analysis f s Sector of space Frequency bin One time frame every 16 ms. Discretize both space and frequency.
14
Frame-level Analysis f s Sector of space Frequency bin One time frame every 16 ms. Discretize both space and frequency. Sparsity assumption [Roweis 03].
15
Frame-level Analysis f s Sector of space Frequency bin One time frame every 16 ms. Discretize both space and frequency. Sparsity assumption [Roweis 03]. 0 9 2 0 10 0 1
16
Frame-level Analysis f s Sector of space Frequency bin One time frame every 16 ms. Discretize both space and frequency. Sparsity assumption [Roweis 03]. 0 9 2 0 10 0 1
17
Frequency Bin Analysis Compute phase between 2 microphones: (f) in Repeat for all P microphone pairs f 1 (f) … P (f)]. P=M(M-1)/2
18
Frequency Bin Analysis Compute phase between 2 microphones: (f) in Repeat for all P microphone pairs f 1 (f) … P (f)]. For each sector s, compare measured phases (f) with the centroid s : pseudo-distance d( (f), s ). P=M(M-1)/2 sector f d( f 1 d( f 2 d( f 3 d( f 7 …
19
Frequency Bin Analysis Compute phase between 2 microphones: (f) in Repeat for all P microphone pairs f 1 (f) … P (f)]. For each sector s, compare measured phases (f) with the centroid s : pseudo-distance d( (f), s ). Apply sparsity assumption: –The best one only is active. P=M(M-1)/2
20
Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.
21
Real Data: Single Speaker Without sparsity assumption [SAPA 04] similar to [ICASSP 01]
22
Real Data: Single Speaker With sparsity assumption (this work) Without sparsity assumption [SAPA 04] similar to [ICASSP 01]
23
Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.
24
Real Data: Multiple Loudspeakers
25
Task 2: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected 2.0 2 loudspeakers simultaneously active
26
Real Data: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected 2.01.9 2 loudspeakers simultaneously active
27
Real Data: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected 2.01.9 >=1 detected100%99.8% Average nb detected 3.02.5 3 loudspeakers simultaneously active
28
Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.
29
Real data: Humans
30
MetricIdealResult >=1 detected~89.4%90.8% Average nb detected ~1.31.3 2 speakers simultaneously active (includes short silences)
31
Real data: Humans MetricIdealResult >=1 detected~89.4%90.8% Average nb detected ~1.31.3 3 speakers simultaneously active (includes short silences) >=1 detected~96.5%95.1% Average nb detected ~2.01.6
32
Conclusion Sector-based approach. Localization and detection. Effective on real multispeaker data.
33
Conclusion Sector-based approach. Localization and detection. Effective on real multispeaker data. Current work: –Optimize centroids. –Multi-level implementation. –Compare multilevel with existing methods.
34
Conclusion Sector-based approach. Localization and detection. Effective on real multispeaker data. Current work: –Optimize centroids. –Multi-level implementation. –Compare multilevel with existing methods. Possible integration with Daimler.
35
Thank you!
36
Pseudo-distance Measured phases f 1 (f) … P (f)] in P For each sector a centroid s =[ s,1 … s,P ]. d( f , s ) = p sin 2 ( ( p (f) – s,p ) / 2 ) cos(x) = 1 – 2 sin 2 ( x / 2 ) argmax beamformed energy = argmin d
37
Delay-sum vs Proposed (1/3) With optimized centroids (this work) With delay-sum centroids (this work)
38
Delay-sum vs Proposed (2/3) MetricIdealDelay-sumProposed >=1 detected100%99.9%100% Average nb detected 2.01.81.9 2 loudspeakers simultaneously active >=1 detected100%99.2%99.8% Average nb detected 3.01.92.5 3 loudspeakers simultaneously active
39
Delay-sum vs Proposed (3/3) MetricIdealDelay-sumProposed >=1 detected~89.4%80.0%90.8% Average nb detected ~1.31.01.3 2 humans simultaneously active >=1 detected~96.5%86.7%95.1% Average nb detected ~2.01.41.6 3 humans simultaneously active
40
Energy and Localization
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.