Presentation is loading. Please wait.

Presentation is loading. Please wait.

Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab.

Similar presentations


Presentation on theme: "Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab."— Presentation transcript:

1 Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab

2 Human-centered Interfaces Free users from desktop and wired interfaces Allow natural gesture and speech commands Give computers awareness of users Work in open and noisy environments -Outdoors -- H21 next to construction site! -Indoors -- crowded meeting room (E21) Vision’s role: provide perceptive context

3 Perceptive Context Who is there? (presence, identity) What is going on? (activity) Where are they? (individual location) Which person said that? (audiovisual grouping) What are they looking / pointing at? (pose, gaze) Today: Tracking speakers with an audio-visual microphone array Tracking faces for gaze aware dialog interfaces Speaker verification with higher-order joint audio-visual statistics

4 Tracking speakers Track location and short-term identity Should work with lots of fast lighting variation -Stereo based methods -New technique for dense background modeling Estimate trajectory of 3D foreground points from multiple views over time Guide active cameras and microphone array Recognize activity and participant roles

5 Range-based stereo person tracking Range can be insensitive to fast illumination change Compare range values to known background Project into 2D overhead view IntensityRangeForeground Plan view Merge data from multiple stereo cameras.. Group into trajectories… Examine height for sitting/standing…

6 Fast/dense stereo foreground left image (reference) right image Standard stereo searches exhaustively per frame: But if the background is predictable, we can prune most of the search!

7 Range Foreground depth Fast/dense stereo foreground Intensity Background New Image

8 Sparse Stereo Model SceneRange Image What to do when there are undefined range values in background and a new foreground image has a valid range value? conservative -- call it background liberal -- call it foreground

9 Conservative Segmentation Type I errors! (Misses most of person) Model Lighting Change Foreground New person Foreground

10 Liberal Segmentation Type II errors! (False Positives) Model New person Foreground Lighting Change Foreground

11 Dense Stereo Model Acquistion Different gain settings yield different regions of undefined range values

12 Dense Stereo Model Acquistion Combined valid measurements from observations at different gain and/or illumination settings: ++

13 State of the Art (cont’d) if you want really dense range backgrounds from one stereo view….

14 Visibility Constraints

15 Visibility Constraints for Virtual Backgrounds

16 Simple background subtraction

17 Virtual Background Segmentation

18 Range-based stereo person tracking Range can be insensitive to fast illumination change Compare range values to known background Project into 2D overhead view IntensityRangeForeground Plan view Merge data from multiple stereo cameras.. Group into trajectories… Examine height for sitting/standing…

19 Multiple stereo views

20 Merged Plan-view segmentation

21 Points -> trajectories -> active sensing Active Camera motion Microphone array Activity classification trajectories Spatio- temporal points

22 Test Environment

23 Active camera tracking

24 Audio input in noisy environments Acquire high-quality audio from untethered, moving speakers “Virtual” headset microphones for all users

25 Solutions Wireless close-talking microphone Shotgun microphone Microphone array Our solution: large, vision- and audio-guided microphone array

26 Our approach Large-array, non-linear geometry -allows selection of 3-D volume of space -can select based on distance (more than beamforming) Integrated with vision tracking -makes real-time localization of multiple sources feasible -known array geometry and target location ==> simple system -precalibrate array with known source tone Related Work -small-aperture vision-guided microphone arrays (Waibel) -large-aperture audio-guided arrays (Silverman)

27 Microphone Arrays Microphones at known locations synchronized in time Electronically focused directional receiver

28 Array focusing Delay-and-sum beamforming – compensate for propagation delays to reinforce target signal:

29 Delay and sum array processing Calibrate using cross-correlation analysis with single source presentation Compute delay and weight based on geometry of array and target -delay: time of flight -weight: estimated SNR based on distance Filtered source is delayed and weighted sum of all microphones.

30 Beamforming Example Received Signals Delayed Signals Delayed And Summed Signal

31 Array Size Beam width  (array span) -1 Large arrays select fully bounded volumes Small arrays select directional beams

32 Related Work Small-aperture vision-guided microphone arrays (Bub, Hunke, and Waibel) Large-aperture audio-guided arrays (Silverman et al.)

33 First person moves on oval path while counting; second person stationary while reciting alphabet. Result from single microphone at center of room: Result from microphone array with focus fixed at initial position of moving speaker: Beamforming Demonstration Output Power (dB) Position (meters)

34 Array Steering Audio-only – max-power search

35 Audio-only steering is hard. Position (meters) Array output power (dB)

36 Audio-only steering is hard. Position (meters) Array output power (dB)

37 Hybrid Localization Vision-only steering isn’t perfect. -Joint calibration -Person tracking, not mouth tracking Can correct vision-based estimate with limited search (implemented as gradient ascent) in audio domain

38 System flow (single target) Vision-based tracker Gradient ascent search in array output power Delay-and-sum beamformer Video Streams Audio Streams

39 Results Single microphone: Hybrid tracking with beamforming: Localization TechniqueSNR (dB) Single microphone-6.6 Video only-4.4 Audio-Video Hybrid2.3

40 Results continued

41 Status Fully 3-d, multimodal sound source localization and separation system Realtime implementation of delay-and-sum array processing Future work: -Compare to commercial linear array -More sophisticated beamforming (null steering) -Connect to automated speech recognition (in progress) -Incorporating single channel source separation techniques -AVMI -ICA -source modeling

42 Today Tracking speakers with an audio-visual microphone array  Tracking faces for gaze aware dialog interfaces [John Fisher] Speaker verification with higher-order joint audio-visual statistics

43 Brightness and depth motion constraints I t I t + 1 II ZZ Z t Z t + 1

44 Brightness and depth motion constraints I t I t + 1 II ZZ Z t Z t + 1 y t =  y t-1 Parameter space

45 New bounded error tracking algorithm Influence region open loop 2D tracker closed loop 2D tracker Track relative to all previous frames which are close in pose space

46 Closed-loop 3D tracker Track users head gaze for hands-free pointing…

47 Head-driven cursor Related Projects: Schiele Kjeldsen Toyama Current application for second pointer or scrolling / focus of attention…

48 Head-driven cursor

49 Task

50 Single cursor

51 Two hand cursors

52 Head-hand cursors

53 Gaze aware dialog interface Interface Agent responds to gaze of user -agent should know when it’s being attended to -turn-taking pragmatics -anaphora / object reference Prototype -E21 interface “sam” -current experiments with face tracker on meeting room table WOZ initial user tests Integrating with wall cameras and hand gesture interface

54 Is that you talking? New single channel algorithm to prevent stray utterances Match video to audio! -Audio-visual Synchrony Detection -Analyze Mutual Information between signals Find maximally informative subspace projection between audio and video… [ Fisher and Darrell ]

55 Perceptual context Take-home message: vision provides Perceptual Context to make applications aware of users.. activity -- adapting outdoor activity classification [ Grimson and Stauffer ] to indoor domain… So far: detection, ID, head pose, audio enhancement and synchrony verification… Soon: gaze -- add eye tracking on pose stabilized face pointing -- arm gestures for selection and navigation.


Download ppt "Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab."

Similar presentations


Ads by Google