Hallucinations in Auditory Perception!!! Malcolm Slaney Yahoo! Research Stanford CCRMA
Hadoop
One Dimensional (waveform) Two Dimensional (not a spectrogram) Three Dimensional (neural movie) Time Autocorrelation Lag Cochlear Place Time Cochlear Place Pressure Cochlear Processing Correlogram Processing
Center Frequency Distance down cochlea Time Interval (s) Autocorrelation Lag With help from Richard O. Duda Correlogram
Success Reconstructing from correlogram –NIPS Keynote
Continuation –Tone and Noise –Parliament Cough Hear two voices? What do you hear? –Waveforms? –Ideas? Problems
Time Autocorrelation Lag Cochlear Place Time Cochlear Place Pressure Cochlear Processing Correlogram Processing
Wedding Sine Natural Speech Examples
What Vowel is This? Word 1 Word 2 Word 3 Peter Ladefoged
McGurk
Speech Object Sinewave Speech Object Wedding Vision Audio Locate Ventroloquism Vision Audio Locate Dots Vision Speech McGurk Speech Environment Vowel?
ASR /w/ / / /n/ S1S1 S2S2 S3S3 Word model showing phonemes for the word one Acoustic (phoneme) model for the phoneme / / One Two Three One Two Three One Two Three Language model for the words: “one”, “two”, “three”
Conventional Scene Analysis Slide by Dan Ellis (Columbia)
Barker—ASR
Goto—CASA with MIDI MIDI Sequence
Old plus New Principle Slide by Dan Ellis (Columbia)
Ellis—Prediction Driven
Saliency
Saliency Example Time-frequency display Saliency map shows high-interest locations
Saliency Maps Longer tones better Missing parts salient Modulation more salient Forward masking works
Sound Examples Birds Calls Cows Horse Waterfall
Saliency Comparison Details of saliency comparison Model predictions
Relational Network (Simple) X Y Z M M X M Y M Z m Patches of neurons Each measure one quantity Bidirectional relations for feedback/feedforward Thanks to Rodney Douglas
Relational Network (example) Input here Relational Feedback Relational specification Relational feedback
ASR Relational Network Cochlea Delay Phone Recognizer Word Recognizer A patch of neurons (one of N output) Note: We don’t know how to represent delays Phone Recognizer Bidirectional links enforce phoneme/word constraints
Desired Results /A/ Phoneme Patch /I/ Phoneme Patch AI Word Patch IA Word Patch Phoneme Input AI A Relational Feedback A WithoutWith
Simulation
Simulation 2
Simulation 3
Grossberg— ART
Statistical Means ICA –Different distributions One Microphone –GMM models of distribution
Conventional
Better?
Thanks
Pitch
Silicon Frequency Response Tone ramps into two cochleas
Cochlear Best Frequency
Cochlear Rate Profiles Left CochleaRight Cochlea Spikes per utterance
Hardware Overview Cochlea Learning Phoneme Word PCI-AER (for remapping) Cochlea Shih-Chii Liu Giacomo Indiveri Implemented in M ATLAB
LSH Movie
By Lloyd Watts Auditory Map
Please do more Neurophysiology! DavidJerryPrabhakar
Timbre definition Sound color –Instruments –Vowels Static Dynamic Timbre Pitch Loudness All sound
Multi-Dimensional Scaling of Timbre Measure –Distances Estimate –Positions Art –Label axis Spectral flux Decay Spectral centroid McAdams et al. (1995)
Desired perception model Compact (parsimonious) Three Properties –Predictive Explain distance perception –Simple model Orthogonal axis –Linear model Interpolate sounds A B ? Test Euclidean distance Assumption
Experimental Contrast Old Way New Way Sound Parameter spacePerception Sound Perception Guess a model that fits the data Model
Spectral shape using MFCC A huge tapestry hung in her hallway. Time (frames)
MFCC and LFC MFCC Sound Spectrum Filterbank log10 DCT MFCC LFC Sound Spectrum DCT LFC
Kernel function of DCT Spectrum –superposition of DCT kernels Cepstrum coefficients –Coefficients for superposition
Parameter space: MFCC C6= C3=
Parameter space: LFC C6=0 C3=
Synthesize stimuli Harmonics: pitch and vibrato –Amplitude weighted by the spectral shape flatweighted Desired spectral shape Vertical - frequency, Horizontal - amplitude
Experiment procedures Paired stimuli (AB, AG, AD, …) Rate dissimilarities using 0- 9 scale 10 subjects –Quiet office –Individual sessions (headphone)
2D linear regression Known values: x, y, d - estimate a and b Residual from Euclidean model Euclidean Fitting C3 C6 Perceptual Judgement d Model prediction
Results summary Tristimulus model MFCC LFC
Experiment results MFCC better Still good Redundant dimension? MFCC: most successful timbre model Less linearity for high coeffs
Remix Examples Abba Gimme Gimme Madonna Hung Up Tracy Young Remix of Hung Up Tracy Young Remix 2 of Hung Up
Specificity Spectrum Cover songsRemixes Look for specific exact matches Bag of Features model Our work (nearest neighbor) FingerprintingGenre
Cross-Correlation 2M songs –3 minutes –10 frames/ second 72 Billion
Curse of Dimensionality Histogram of distances between Gaussian data –Normalized to the mean Nearest Neighbor Ill-posed?
Distractors
Center Frequency Distance down cochlea Time Interval (s) Autocorrelation Lag Correlogram