System integration – current status and future priorities Alastair H. Moore Technical Project Meeting Berlin, May 2016
12 channel digital audio OSX Naoqi playrec m/c audio buffers transcription Matlab audio DSP Synth speech Dialogue manager Python ASR mono audio Audio DOA & saliency Motor commands (head pose) Ego sphere Python Interface DOA Image frames Ubuntu Video stream Visual DOA & saliency C++ visual localisation Face detector Face positions
playrec Matlab interface to portaudio Has dynamic internal buffer structure ‘rec’ or ‘playrec’ to record audio ’getrec’ to retrieve the audio Allows online processing in matlab – buffers are stored until requested so no missed buffers If sound is to be output to soundcard (as is currently done for auditioning the processed audio) setting the number of buffers gives a trade-off between latency and risk of buffer underrun (audio glitches)
Audio localisation -> ego sphere Matlab Spherical harmonic domain Pseudo-intensity vectors DPD-MUSIC Single source direction of arrival written to EARS map object Map object written to XML file Python Read XML file Converts DOAs to required co-ordinates system Send to egosphere
Audio localisation -> ego sphere Scope for improvement Use confidence of localisation estimate as ‘saliency’ parameter in egosphere May need to add parameter to MAP object Avoid sending any DOAs when SNR is poor/no speech activity Incorporate tracking – audio only or audio-visual. Need interface to get visual DOAs into Matlab
Audio enhancement -> ASR Matlab Spherical harmonic domain beamforming 1st order (relatively wide beams) fixed look direction (chosen for robustness of demo) limited to 5 kHz Coherent-to-diffuse ratio-based post filter Uses simulated HRTFs Enhanced audio written to TCP/IP pipe in continuous stream of small blocks
Audio enhancement -> ASR Python script Reads audio from pipe Endpointing using basic energy-based voice activity detector Sends audio to Google ASR Transcription sent to Naoqi dialogue system ‘Holds off’ further ASR while Nao speaks
Audio enhancement -> ASR Scope for improvement Steer beam using DOAs Can it be done robustly? Post filter with higher frequency HRTFs Acoustic echo cancellation to avoid ‘hold off’ period Add dereverberation?
Visual localisation -> egosphere
Egosphere behaviour DOAs arrive from audio and video subsystems All DOAs are attended to (looked at) with priority according to saliency
12 channel digital audio Ubuntu / OSX Naoqi playrec naolab Audio stream Matlab audio DSP Synth speech Dialogue manager Python ASR Motor commands (head pose) Ego sphere Python Interface Ubuntu Video stream C++ visual localisation Face detector Synchronised mono video + face positions