Liverpool University The Department Centre for Cognitive Neuroscience Department of Psychology Liverpool University Overall Aim Understanding Human Information.

Liverpool University

The Department Centre for Cognitive Neuroscience Department of Psychology Liverpool University Overall Aim Understanding Human Information Processing

Expertise Auditory Scene Analysis (ASA) –Perception experiments –Modelling Speech Perception Audio-Visual Integration –Models of AV information fusion –Applying these models to ASA

Work at Liverpool Task 1.3, active/passive speech perception. Research Question: Do human listeners actively predict the time course of background noise to aid speech recognition? Current state: Perceptual evidence for ‘predictive scene analysis’ : Elvira Perez will explain all Planned work: Database of environmental noise to test computational models

Work at Liverpool Task 1.4 Envelope Information & Binaural Proc Research Question: What features do listeners use to track a target speaker in the presence of competing signals? Patti Adank (Aug.03-July.04) Current State: Tested the hypothesis that ‘jitter’ is a stream segregation or stream formation cue: Tech report finalised in July 04

Work at Liverpool Task 2.2 Reliability of auditory cues in multi-cue scenarios Research Question: How are cues perceptually integrated? Combination of experimentation and modelling Current state: Experimental data and models on audio-visual motion signal integration (non Hoarse) Ongoing work: MLE models for speech feature integration. Elvira Planned work: Collaboration with Patras (John Worley) on location and pitch segregation cue integration

Work at Liverpool Task 4.1: Informing speech recognition Research Question: How to apply data derived from perception experiments to machine learning? Current State: Just starting to ‘predict’ environmental noises (using Aurora noises) Recording database of natural scenes for analysis and modelling. (with Sheffield)

… over to Elvira

Environmental Noise Two-pronged approach –Elvira: is perceptual evidence for active noise modelling in listeners –Georg (+ Sheffield): noise modelling based on database

Baseline Data Typical noise databases not very representative –Size severely limited (e.g. Aurora) –Unrealistsic scenarios (fighter jets, foundries) Database of environmental noise –Transport noises: A320-200, ICE, Saab 9-3, –Social Places: Departure lounges, Hotel Lobby, Pub –Private Journeys: urban walk, country walk –Buildings: offices, corridors –… Aim is to have about 10-20 mins of representative data for typical situations.

Recordings Soundman OKMII binaural microphones Sony D3 DAT recorder 48kHz stereo recordings Digital transfer to PC mics

Analysis Previous work –Auditory filterbank (linear, Mel-Scale, 32 ch.) –Linear prediction Within channels Across channels Planned work –Aud filterbank –Non-linear prediction using Nnets

Using Envelope Information for ASA (Patti Adank) Background –Brungard & Darwin (resource allocation task? ) Two simultaneous sentences: track one Segregation benefits from –Pitch differences –Speaker differences –Key question: operational definition of speaker characteristics

Speaker characteristics Vocal tract shape –Difficult to quantify / computationally extract Speaking style (intonation, stress, accent…) –Difficult to extract measures for very short segments Voice characteristics –F0 – of course… –Shimmer (amp modulation) –Jitter (roughness - random GCI variation) –Breathiness (open quotient duing voiced speech) All relatively easy to extract computationally All relatively easy to control in speech re-synthesis

No one choice: Jitter –Dan Ellis Computational model is segregation by glottal closure instance Model groups coincident energy in auditory filterbank –Could ‘Jitter’ be useful for segregation?

Jitter as a primary segregation cue Double vowel experiment: –5 synthetic vowels (Assmann & Summerfield) –Synthesized with range of 5 pitch levels 5 jitter levels Results –Pitch difference aids segregation –Jitter difference does not

baseline (0%) 0.5%1%2%4% % Jitter 45 50 55 60 65 70 Mean +- 1 SE percent

Jitter analogous to location cues Location cues not primary segregation cues –Segregate on pitch first, then –Use location cues for stream formation Experiment –Brungard, Darwin (e.g. 2001) Task, E.g. “Ready Tiger go to White One now”, And “Ready Arrow go to Red Four now”, but –Speech resynthesized using Praat Same speaker, different sentences –Jitter does not aid stream formation

% correct colour/no combination

Informing Speech Recognition Jitter not no.1 candidate for informing speech recognition…

Task 2.2 Reliability of auditory cues in multi-cue scenarios. Ernst & Banks (Nature 2002) –Maximum likelihood estimation good model for visual/somatosensory cue integration –Adapted this for AV integration: mouse catching experiment: MLE good model Hofbauer et al., JPP: HPP 2004 –Want to look at speech cue integration in collaboration with Sheffield

Present listeners with vowel-nasal combination –vary vowel F2 800-2000Hz –attach nasal /m/ /n/ without transition to vowel ask listeners what they hear (/vn/ or /vm/) Sequential Grouping 180016001400120010008002000 800 2000 2700 375 100 200ms

Hypothesis If listeners organise formants by continuity, then –the /o/ should lead to /m/, while –the /e/ should lead to /n/, with the second formant of the nasal remaining unassigned if proximity is a cue then there should be a changeover at around 1400 Hz 800 2000 2700 375 100 200ms

Formants as a representation? If sequential grouping of formants explains the perceptual change from /m/ to /n/ for high vowel F2s. Then transitions should ‘undo’ this change. time F

Transitions in /vm/ syllables Synthetic /v-m/ segments as before, but 0, 2.5, 5, 10, 20ms formant transitions 7 fluent German speakers, 200 trials each Experimental results fit prediction

Transitions ?? ‘Format transitions’ of 5 ms have an effect Synthetic speech was synthesized at 100 Hz formant transitions: half glottal period ?? –Confirmed that transition has to coincide with energetic bit of glottal period Do subjects use a ‘transition’ or just energy in the appropriate band (1-2 kHz) ?

Formant transitions? Take /em/ stimulus without transitions (heard as /en/) add a chirp in place of F2 transition (0-,5,10,20,40ms) –down chirp is FM sinusoid 2kHz-1kHz –control is FM sinusoid 1kHz-2kHz

ASA: Chirp should be segregated –listeners should hear ‘vowel-nasal’ plus chirp –listeners should find to difficult to report ‘time of chirp’ Model prediction

Down Chirp 7 listeners 200 trials each. Result: –chirp is perceived by listeners –and integrated into percept /en/ is heard as /em/

Up Chirp 7 listeners 200 trials each. Result: –chirp is perceived by listeners and integrated into percept Results less clear cut than in down-chirp case.

What does it all mean Subjects –Hear /em/ when the chirp is added (any chirp!) –Hear the chirp as a separate sound –Can identify direction of chirp Chirps are able to replace formant –Spectral and fine time structure different –Up-direction inconsistent with expected F2

Multiresolution scene analysis Speech recognition does not require detail Scene analysis does…

Top down scene analysis: –Take unknown/noisy speech sample (/em/) –Run through recogniser At each time frame compare observed data with best guess remove data outside expected range Unmatched Matched Sue Harding Original

MLE framework Propose to test MLE model for ASA cue integration Cue integration as weighted sum (  ) of component probability time F ASA says: Ignore this bit

Hypothetical Example time F F Labial transition p(m) = 0.8  = 0.7 Formant structure p(m) = 0.7  = 0.3 time F velar transition p(n) = 0.8  = 0.7 Formant structure p(m) = 0.7  = 0.3 unknown transition p(n/m) = 0  = 0.0 Formant structure p(m) = 0.7  = 1.0 /m/ /n/

MLE experiment (Elvira) time F F F

Taking it further (back) Transition cues Prior prob speech high non-speech low Localisation cues  is low

What does it all mean Duplex Perception is –Nothing special –Entirely consistent with a probablistic scene analysis viewpoint Could imagine a fairly high impact publication on this topic Training activity on ‘Data fusion’?

Where to go from here Would like to collaborate on principled testing of these (and related) ideas –Sheffield ?? IDIAP ?? Is this any different from missing data recognition? –Bochum ?? Want to ‘warm up’ duplex perception? –Most useful: a hands-on modeller

EEG / MEG Study We argue that –Scene analysis informs speech perception Therefore would expect non-speech signals to be processed/evaluated before speech is recognised EEG / MEG data should show –Differential processing of speech / non-speech signals –Perhaps show an effect of the chirps on the latency of speech driven auditory evoked potential (field) We have –A really neat stimulus –Emen signals can be listened to as speech non-speech signal –Non-speech changes speech identity

(very!) Preliminary data Four conditions –/em/ with 20 ms formant transitions –/em/ no formant transitions (en percept) –/em/ no formant transition + 20 ms up chirp (em) –/em/ no formant transition + 20 ms dn chirp (em) Two tasks –Identify /em/ s –Identify signals containing chirps 16 channel EEG recordings 200 stim each

Predictions If ‘speech is special’ then should see significant task dependent differences May also see significant differences between stimuli leading to same percept –Effect of chirp might delay speech recognition? –Here we go:

LHSRHS Speech non- speech T7 T8

LHS RHS Speech non- speech TP7 TP8

Speech non- speech F1 F2

O1 O2 (control…) No evidence for differences In early (sensory) procesing

EEG Conclusions (very!) preliminary data looks very promising Need to get more subjects Refine paradigm (sequence currently too fast) –Would a MME study be appropriate Would like to –Look at source localisation (MEG Helsinki, fMRI Liverpool) –Get more channels (MEG Helsinki)

Liverpool University The Department Centre for Cognitive Neuroscience Department of Psychology Liverpool University Overall Aim Understanding Human Information.

Similar presentations

Presentation on theme: "Liverpool University The Department Centre for Cognitive Neuroscience Department of Psychology Liverpool University Overall Aim Understanding Human Information."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Liverpool University The Department Centre for Cognitive Neuroscience Department of Psychology Liverpool University Overall Aim Understanding Human Information.

Similar presentations

Presentation on theme: "Liverpool University The Department Centre for Cognitive Neuroscience Department of Psychology Liverpool University Overall Aim Understanding Human Information."— Presentation transcript:

Similar presentations

About project

Feedback