Presentation is loading. Please wait.

Presentation is loading. Please wait.

‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.

Similar presentations


Presentation on theme: "‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research."— Presentation transcript:

1 ‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research Group, University of Sheffield

2 ‘Missing data’ recognition Acoustic features (masked signal) and corresponding missing data mask are passed to decoder (based on HMMs) Method works well using masks created using a priori knowledge of target and masker (noise or speech). Problem: how to create mask without prior knowledge of target and masking sources? Time (frames) Frequency channel 20406080100120 10 20 30 40 50 60 Time (frames) Frequency channel A priori mask 20406080100120 10 20 30 40 50 60 Three talkers: utterance ‘one two eight oh’ mixed at SNR 0 dB with ‘eight eight four three’ and ‘four two one eight’

3 Source configuration Missing data masks are created using cues for spatial location to separate a target source from one or two masking sources Target source is assumed to be at azimuth 0 degrees (straight ahead) for simplicity Masking source is at some other azimuth Target Masker 1 Masker 2 Receiver TIDigits, male speakers all sources have reverberation added (using Roomsim) room size 6m x 4m x 3m; all surfaces ‘acoustic plaster’, reverberation time 0.34 seconds receiver: KEMAR head (MIT data) 64-channel gammatone filterbank centre frequencies 50 Hz to 8 kHz (equally spaced on ERB scale)

4 Missing data mask creation (1) Spatial location cues: interaural time difference (ITD) and interaural level difference (ILD) ITD (cross-correlation between left and right binaural signals): use biggest peak in each channel of cross-correlogram ILD (ratio of energy in left and right ears) Skeleton cross-correlogram for two talkers mixed at SNR 0 dB one talker at azimuth 0 degrees one at azimuth 40 degrees Time (frames) Frequency channel 20406080100120 10 20 30 40 50 60 Frequency channel Time (frames) 20406080100120 10 20 30 40 50 60 Left ear Right ear

5 Missing data mask creation (2) Steps in mask creation: Create probability distributions showing probability that an ITD/ILD combination was produced by a target at azimuth 0 ITD (ms) 0 1 ILD (dB) -505 Time (frames) Frequency channel Localisation mask, ILD/ITD 20406080100120 10 20 30 40 50 60 Use distributions to create missing data mask from ITD/ILD values for each test utterance

6 Training data (probability distribution) Target at azimuth 0 degrees; Masker at azimuth 5, 10, 20, 40, -5, -10, -20 or -40 degrees Target and masker mixed at SNR 0, 10 or 20 dB (120 pairs of utterances from male speakers matched for length) Target Masker 1 Receiver

7 Missing data mask creation (3) 2.Identify whether each time-frequency element belongs to the target or masker (using the a priori mask) Time (frames) Frequency channel A priori mask 20406080100120 10 20 30 40 50 60 Steps in mask creation: Create probability distribution showing probability that an ITD/ILD combination was produced by a target at azimuth 0 1.Identify ITD and ILD for each frequency channel and time frame of a set of training utterances with target source at azimuth 0 and masker at another azimuth Histogram 1: all observations of ITD/ILD, i.e. produced by target plus masker Histogram 2: observations produced by target only 3.Assign each ITD & ILD to a bin to create 2-D histograms (one per frequency channel): ITD (ms) 0 1 ILD (dB) -505 4.histogram 2 gives probability distribution histogram 1

8 Missing data mask creation (4) ITD (ms) Channel CF 82 Hz 0 1 Channel CF 223 Hz 0 1 Channel CF 430 Hz 0 1 ITD (ms) Channel CF 731 Hz 0 1 Channel CF 1169 Hz 0 1 Channel CF 1807 Hz 0 1 ILD (dB) ITD (ms) Channel CF 2736 Hz -505 0 1 ILD (dB) Channel CF 4090 Hz -505 0 1 ILD (dB) Channel CF 6061 Hz -505 0 1 Examples of probability distributions for 9 frequency channels

9 Test data (target plus one masker) Target at azimuth 0 degrees; One masker at azimuth 5, 7.5, 10, 15, 20, 30 or 40 degrees Target and masker mixed at SNR 0 dB (240 pairs of utterances from male speakers matched for length) Target Masker 1 Receiver Similar to training data, but additional azimuths are used for masker

10 Missing data mask creation (5) Steps in mask creation: Create probability distribution showing probability that an ITD/ILD combination was produced by a target at azimuth 0 Use distribution to create missing data mask for each test utterance 1.Identify ITD/ILD for each time-frequency element of each test utterance (mixed speakers) Time (frames) Frequency channel Localisation mask, ILD/ITD 20406080100120 10 20 30 40 50 60 Time (frames) Frequency channel 20406080100120 10 20 30 40 50 60 A priori mask 2.Look up ITD/ILD in probability distribution and use probability as (soft) missing data mask value for that element

11 Training data (recogniser) Target at azimuth 0 degrees (reverberation added) No masker 4228 utterances from 55 male speakers Target Receiver Recogniser is HMM with 8 states and 10 mixtures

12 Experiment – ITD v. ILD Using probability distributions created with ITD only, ILD only, or both Time (frames) Frequency channel Localisation mask, ITD only 20406080100120 10 20 30 40 50 60 Time (frames) Frequency channel Localisation mask, ILD/ITD 20406080100120 10 20 30 40 50 60 Frequency channel Localisation mask, ILD only Time (frames) 20406080100120 10 20 30 40 50 60

13 Experiment – ITD v. ILD 57.51015203040 30 40 50 60 70 80 90 100 Azimuth of masker (degrees) % Accuracy ILD only ITD only ILD/ITD A priori MFCC Both ITD and ILD cues are required for best performance (Method generalises to azimuths not included in training data for probability distribution)

14 Test data – additional masker Test data (target plus two maskers) Target at azimuth 0 degrees; One masker at azimuth 5, 7.5, 10, 15, 20, 30 or 40 degrees Second masker at azimuth –10 or +10, mixed with first masker at SNR 0 dB Target and combined maskers mixed at SNR 0 dB (240 sets of 3 utterances from male speakers matched for length) Target Masker 1 Masker 2 Receiver Target Masker 1 Masker 2 Receiver

15 Experiment – one or two maskers (Same probability distribution as before, i.e. trained on single masker data) 57.51015203040 80 82 84 86 88 90 92 94 96 98 100 Azimuth of first masker % Accuracy ILD/ITD, single masker ILD/ITD, two maskers, second at azimuth 10 A priori, single masker A priori, two maskers, second at azimuth 10 ILD/ITD, two maskers, second at azimuth -10 A priori, two maskers, second at azimuth -10 Using left ear for recognition

16 Ear selection for recognition (1) Two input signals (left and right ear) are available for recognition Acoustic features (masked signal) plus missing data mask are passed to missing data recogniser Target Masker 1 Masker 2 Receiver Target Masker 1 Receiver Previously used left ear (generally furthest from masker, i.e. least affected by masker) If masker configuration isn’t known, which ear should be used?

17 Ear selection for recognition (2) Options: select either left or right ear Left ear Right ear Time (frames) Frequency channel 20406080100120 10 20 30 40 50 60 Time (frames) Frequency channel 20406080100120 10 20 30 40 50 60 Per elementPer frame combine left and right features by selecting quietest (i.e. least affected by maskers) per time frameorper time-frequency element Dark areas show elements of left ear signal that are quieter in right ear signal Time (frames) Frequency channel 20406080100120 10 20 30 40 50 60 Time (frames) Frequency channel 20406080100120 10 20 30 40 50 60 Target at azimuth 0; maskers at azimuths 30 and –10

18 Test data – ear selection Testing data (target plus two maskers) Target at azimuth 0 degrees; One masker at azimuth 5, 7.5, 10, 15, 20, 30 or 40 degrees Second masker at azimuth –10 or +10 (asymmetrical) or at negative azimuth matching first masker (symmetrical), mixed with first masker at SNR 0 dB Target and combined maskers mixed at SNR 0 dB (240 sets of 3 utterances from male speakers matched for length) Target Masker 1 Masker 2 Receiver Target Masker 1 Masker 2 Receiver Target Masker 1 Masker 2 Receiver AsymmetricalSymmetricalAsymmetrical

19 Experiment – ear selection (1) Azimuth of first masker (degrees) % Accuracy Two asymmetrical maskers, one at azimuth 10 Azimuth of first masker (degrees) % Accuracy Two asymmetrical maskers, one at azimuth -10 Azimuth of first masker (degrees) % Accuracy Two symmetrical maskers Left Right 57.51015203040 76 78 80 82 84 86 88 90 92 94 57.51015203040 76 78 80 82 84 86 88 90 92 94 57.51015203040 76 78 80 82 84 86 88 90 92 94 Left Right Left Right Asymmetrical Symmetrical

20 Experiment – ear selection (2) Azimuth of first masker (degrees) % Accuracy Two asymmetrical maskers, one at azimuth 10 Azimuth of first masker (degrees) % Accuracy Two asymmetrical maskers, one at azimuth -10 Azimuth of first masker (degrees) % Accuracy Two symmetrical maskers Left Right Composite, per element Composite, per frame Left Right Composite, per element Composite, per frame Left Right Composite, per element Composite, per frame Symmetrical Mean % Accuracy Asymmetrical azimuth 10 858687888990 Asymmetrical azimuth -10 Left Right Per element Per frame

21 Summary and conclusions Missing data method using spatial localisation cues gives good performance in difficult conditions -reverberation, -multiple maskers, -sources close in location and pitch Need both ITD and ILD for best performance Simple method of determining probability distribution -generalises well to unseen azimuths and additional maskers -(also not sensitive to reverberation surface of training data) Affected by ear used -best to combine ears element by element (do humans do the same?) Further work Conditions with less symmetry in target/receiver/room configuration More reverberant surfaces


Download ppt "‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research."

Similar presentations


Ads by Google