Towards building user seeing computers

Towards building user seeing computers
CRV’05 Workshop on Face Processing in Video August 8-11, 2005, Victoria, BC, Canada Gilles Bessens and Dmitry Gorodnichy Computational Video Group Institute for Information Technology National Research Council Canada

recognition / memorization
What means “to see” When humans lose a sense of touch or hearing, they can still communicate using vision. The same is for computers: When information cannot be entered into computer using hands or speech, vision could provide a solution, if only computers can see... Users with accessibility needs (e.g. residents of the SCO Ottawa Health Center) will benefit the most. But other users would benefit too. Seeing tasks: 1. Where - to see where is a user: {x,y,z…} 2. What - to see what is user doing: {actions} 3. Who - to see who is the user: {names} Our goal: to built systems which can do all three tasks. x y , z a , b g PUI monitor binary event ON OFF recognition / memorization Unknown User!

Wish-list and constraints
Users want computers to be able to 1. Automatically detect and recognize a user: a) to load user’s personal windows settings (e.g. font size, application window layout), which is a very tedious work for users with disabilities, b) to find the range of user's motion to map it to the computer control coordinates. 2. Enable written communication: e.g. typing a message in an browser or internet. 3. Enable navigation in Window environment: selecting items from windows menus and pushing buttons of Windows applications. 4. Detect visual cues from users (intentional blinks, mouth opening, repetitive or predefined motion patterns) for hands-free remote control: a) mouse-type "clicks“, b) vision-based lexicon, c) computer control commands: "go to next/last window", "copy/cut/paste", "start Editor", "save and quite". But limitation should be acknowledged: computer limitations - the system should run in real time (>10fps) user mobility limitations - user have limited range of motion. Besides, camera field of view and resolution are limited environmental limitations - changing, we develop a state transition machine which switches from face detection to face recognition to face tracking modules to accommodate the constraints. Other: a) the need for missing feedback, which is what the feeling of touch when holding a mouse provides to users who operate with mouse b) the need for limited-motion-based cursor control and key entry.

Evolution of seeing computers
1998. Proof-of-concept colour-based skin tracking [Bradski’98] – not precise 2001. Motion-based segmentation & localization – not precise Several skin colour models developed - reached the limits 2001. Rapid face detection using rectangular wavelets of intensities [fg02] 2002. Subpixel-accuracy convex-shape nose tracking [Nouse™, fg02, ivc04] 2002. Stereo face tracking using projective vision [w.Roth, ivc04] 2003. Second-order change detection [Double-blink, ivc04] 2003-now. Neuro-biological recognition of low-res faces [avbpa05,fpiv04,fpiv05] Figure: Typical result for face detection using colour, motion and intensity components of video using six different webcams.

Nouse™ “Nose as Mouse” good news
Precision & convenience of tracking the convex-shape nose feature allows one to use nose as mouse (or joystick handle) Copyright S. A. LA NACION Todos los derechos reservados. image Motion,colour,edges,Haar-wavelets  nose search box: x,y,width,height Rating by Planeta Digital (Aug. 2003) Convex-shape template matching  nose tip detection: I,J (pixel precision) Integration over continuous intensity  X,Y (sub-pixel pixel precision) (X,Y)

Main face recognition challenge
ICAO-conformed passport photograph (presently used for forensic identification) Image-based biometrics modalities Images obtained from surveillance cameras (of 11/9 hijackers) and TV. NB: VCD is 320x240 pixels Face recognition performance

Keys to resolving FRiV problem
12 pixels between the eyes should be sufficient – Nominal face resolution To beat low resolution & quality, use lessons from human vision recognition system: 1) Efficient visual attention mechanisms 2) Decision based on accumulating results over several frames (rather than on one frame) 3) Efficient neuro-associative mechanisms a) to accumulate learning data in time by adjusting synapses, and b) to associate a visual stimulus to a semantic meaning based on the computed synaptic values, using: non-linear processing, massively distributed collective decision making synaptic plasticity.

Lessons from biological vision
Saliency based localization and rectification - implemented Fovea vision: Accumulation over time and space - implemented Local brightness adjustment - implemented Recognition decision at time t depends on our recognition decision at time t+1 - implemented

Lessons from biological memory
Brain stores information using synapses connecting the neurons. In brain: to 1013 interconnected neurons Neurons are either in rest or activated, depending on values of other neurons Yj and the strength of synaptic connections: Yi={+1,-1} Brain is a network of “binary” neurons evolving in time from initial state (e.g. stimulus coming from retina) until it reaches a stable state – attractor. What we remember are attractors! This is the associative principle we all live to Refs: Hebb’49, Little’74,’78, Willshaw’71 - implemented ?..

From visual image  to saying name
From neuro-biological prospective, memorization and recognition is two stages of the associative process: From receptor stimulus R  to effector stimulus E In brain Main associative principle Stimulus neuron Response neuron Xi: {+1 or –1} Yj: {+1 or –1} Synaptic strength: < Cij < +1 “Dmitry” In computer

How to update weights Learning rules: From biologically plausible to mathematically justifiable Models of learning Hebb (correlation learning): Generalized Hebb: Better rule: or even Widrow-Hoff’s (delta) rule : Projection learning: is both incremental and takes into account the relevance of the training stimuli and their attributes Refs: Amari’71,’77, Kohonen’72, Personnaz’85, Kanter-Sompolinsky’86,Gorodnichy‘95-’99

Testing FRiV framework
TV programs annotation IIT-NRC 160x120 video-based facial database (one video to memorize, another to recognize)

From video input to neural output
1. face-looking regions are detected using rapid classifiers. 2. they are verified to have skin colour and not to be static. 3. face rotation is detected and rotated, eye aligned and resampled to 12-pixels-between-the-eyes resolution face is extracted. 4. extracted face is converted to a binary feature vector (Receptor) 5. this vector is then appended by nametag vector (Effector) 6. synapses of the associative neuron network are updated Time weighted decision: a) neural mode: all neurons with PSP greater than a certain threshold Sj>S0 are considered as ``winning"; b) max mode: the neuron with the maximal PSP wins; c) time-filtered: average or median of several consecutive frame decisions, each made according to a) or b), is used; d) PSP time-filtered: technique of a) or b) is used on the averaged (over several consecutive frames) PSPs instead of PSPs of individual frames; e) any combination of the above. S10 - The numbers of frames in 2nd video clip of the pair, when the face in a frame is associated with the correct person (i.e. the one seen in the 1st video clip of the pair), without any association with other seen persons. - best (non-hesitant) case S11 - ... when the face is not associated with one individual, but rather with several individuals, one of which is the correct one. - good (hesitating) case S01 - ... when the face is associated with someone else - worst case S02 - ... when the face is associated with several individuals (none of which is correct) - wrong but hesitating case S when the face is not associated with any of the seen faces - not bad case

Perceptual Vision Interface Nouse™
Combining results Perceptual Vision Interface Nouse™ Evolved from a single demo program to a hands-free perceptual operating system Combines all techniques presented and provides a clear vision for other to-be-developed seeing computers Requires more man-power for tuning and software designing, contingent upon extra funding… Nouse connected User recognized User’s motion range obtained User’s face detected Nouse zero position (0,0) set Nouse initialization and calibration Face position converted to (X,Y) (to use for typing, cursor control) Visual pattern analyzed (for hands-free commands)

Towards building user seeing computers

Similar presentations

Presentation on theme: "Towards building user seeing computers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards building user seeing computers

Similar presentations

Presentation on theme: "Towards building user seeing computers"— Presentation transcript:

Similar presentations

About project

Feedback