Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab.

Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab

Human-centered Interfaces Free users from desktop and wired interfaces Allow natural gesture and speech commands Give computers awareness of users Work in open and noisy environments -Outdoors -- H21 next to construction site! -Indoors -- crowded meeting room (E21) Vision’s role: provide perceptive context

Perceptive Context Who is there? (presence, identity) What is going on? (activity) Where are they? (individual location) Which person said that? (audiovisual grouping) What are they looking / pointing at? (pose, gaze) Today: Tracking speakers with an audio-visual microphone array Tracking faces for gaze aware dialog interfaces Speaker verification with higher-order joint audio-visual statistics

Tracking speakers Track location and short-term identity Should work with lots of fast lighting variation -Stereo based methods -New technique for dense background modeling Estimate trajectory of 3D foreground points from multiple views over time Guide active cameras and microphone array Recognize activity and participant roles

Range-based stereo person tracking Range can be insensitive to fast illumination change Compare range values to known background Project into 2D overhead view IntensityRangeForeground Plan view Merge data from multiple stereo cameras.. Group into trajectories… Examine height for sitting/standing…

Fast/dense stereo foreground left image (reference) right image Standard stereo searches exhaustively per frame: But if the background is predictable, we can prune most of the search!

Range Foreground depth Fast/dense stereo foreground Intensity Background New Image

Sparse Stereo Model SceneRange Image What to do when there are undefined range values in background and a new foreground image has a valid range value? conservative -- call it background liberal -- call it foreground

Conservative Segmentation Type I errors! (Misses most of person) Model Lighting Change Foreground New person Foreground

Liberal Segmentation Type II errors! (False Positives) Model New person Foreground Lighting Change Foreground

Dense Stereo Model Acquistion Different gain settings yield different regions of undefined range values

Dense Stereo Model Acquistion Combined valid measurements from observations at different gain and/or illumination settings: ++

State of the Art (cont’d) if you want really dense range backgrounds from one stereo view….

Visibility Constraints

Visibility Constraints for Virtual Backgrounds

Simple background subtraction

Virtual Background Segmentation

Range-based stereo person tracking Range can be insensitive to fast illumination change Compare range values to known background Project into 2D overhead view IntensityRangeForeground Plan view Merge data from multiple stereo cameras.. Group into trajectories… Examine height for sitting/standing…

Multiple stereo views

Merged Plan-view segmentation

Points -> trajectories -> active sensing Active Camera motion Microphone array Activity classification trajectories Spatio- temporal points

Test Environment

Active camera tracking

Audio input in noisy environments Acquire high-quality audio from untethered, moving speakers “Virtual” headset microphones for all users

Solutions Wireless close-talking microphone Shotgun microphone Microphone array Our solution: large, vision- and audio-guided microphone array

Our approach Large-array, non-linear geometry -allows selection of 3-D volume of space -can select based on distance (more than beamforming) Integrated with vision tracking -makes real-time localization of multiple sources feasible -known array geometry and target location ==> simple system -precalibrate array with known source tone Related Work -small-aperture vision-guided microphone arrays (Waibel) -large-aperture audio-guided arrays (Silverman)

Microphone Arrays Microphones at known locations synchronized in time Electronically focused directional receiver

Array focusing Delay-and-sum beamforming – compensate for propagation delays to reinforce target signal:

Delay and sum array processing Calibrate using cross-correlation analysis with single source presentation Compute delay and weight based on geometry of array and target -delay: time of flight -weight: estimated SNR based on distance Filtered source is delayed and weighted sum of all microphones.

Beamforming Example Received Signals Delayed Signals Delayed And Summed Signal

Array Size Beam width  (array span) -1 Large arrays select fully bounded volumes Small arrays select directional beams

Related Work Small-aperture vision-guided microphone arrays (Bub, Hunke, and Waibel) Large-aperture audio-guided arrays (Silverman et al.)

First person moves on oval path while counting; second person stationary while reciting alphabet. Result from single microphone at center of room: Result from microphone array with focus fixed at initial position of moving speaker: Beamforming Demonstration Output Power (dB) Position (meters)

Array Steering Audio-only – max-power search

Audio-only steering is hard. Position (meters) Array output power (dB)

Hybrid Localization Vision-only steering isn’t perfect. -Joint calibration -Person tracking, not mouth tracking Can correct vision-based estimate with limited search (implemented as gradient ascent) in audio domain

System flow (single target) Vision-based tracker Gradient ascent search in array output power Delay-and-sum beamformer Video Streams Audio Streams

Results Single microphone: Hybrid tracking with beamforming: Localization TechniqueSNR (dB) Single microphone-6.6 Video only-4.4 Audio-Video Hybrid2.3

Results continued

Status Fully 3-d, multimodal sound source localization and separation system Realtime implementation of delay-and-sum array processing Future work: -Compare to commercial linear array -More sophisticated beamforming (null steering) -Connect to automated speech recognition (in progress) -Incorporating single channel source separation techniques -AVMI -ICA -source modeling

Today Tracking speakers with an audio-visual microphone array  Tracking faces for gaze aware dialog interfaces [John Fisher] Speaker verification with higher-order joint audio-visual statistics

Brightness and depth motion constraints I t I t + 1 II ZZ Z t Z t + 1

Brightness and depth motion constraints I t I t + 1 II ZZ Z t Z t + 1 y t =  y t-1 Parameter space

New bounded error tracking algorithm Influence region open loop 2D tracker closed loop 2D tracker Track relative to all previous frames which are close in pose space

Closed-loop 3D tracker Track users head gaze for hands-free pointing…

Head-driven cursor Related Projects: Schiele Kjeldsen Toyama Current application for second pointer or scrolling / focus of attention…

Head-driven cursor

Single cursor

Two hand cursors

Head-hand cursors

Gaze aware dialog interface Interface Agent responds to gaze of user -agent should know when it’s being attended to -turn-taking pragmatics -anaphora / object reference Prototype -E21 interface “sam” -current experiments with face tracker on meeting room table WOZ initial user tests Integrating with wall cameras and hand gesture interface

Is that you talking? New single channel algorithm to prevent stray utterances Match video to audio! -Audio-visual Synchrony Detection -Analyze Mutual Information between signals Find maximally informative subspace projection between audio and video… [ Fisher and Darrell ]

Perceptual context Take-home message: vision provides Perceptual Context to make applications aware of users.. activity -- adapting outdoor activity classification [ Grimson and Stauffer ] to indoor domain… So far: detection, ID, head pose, audio enhancement and synchrony verification… Soon: gaze -- add eye tracking on pose stabilized face pointing -- arm gestures for selection and navigation.

Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab.

Similar presentations

Presentation on theme: "Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab.

Similar presentations

Presentation on theme: "Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab."— Presentation transcript:

Similar presentations

About project

Feedback