Observer motion problem

Observer motion problem
From image motion, compute:  observer translation (Tx Ty Tz)  observer rotation (Rx Ry Rz)  depth at every location Z(x, y)

Observer just translates toward FOE
heading point Directions of velocity vectors intersect at FOE But… simple strategy doesn’t work if observer also rotates

Observer motion problem, revisited
From image motion, compute:  Observer translation (Tx Ty Tz)  Observer rotation (Rx Ry Rz)  Depth at every location Z(x,y) Observer undergoes both translation + rotation

Equations of observer motion

Longuet-Higgins & Prazdny
Along a depth discontinuity, velocity differences depend only on observer translation Velocity differences point to the focus of expansion

What is a chair?

Alignment methods Find an object model and geometric transformation that best match the viewed image V viewed object (image) Mi object models Tij allowable transformations between viewed object and models F measure of fit between V and the expected appearance of model Mi under the transformation Tij GOAL: Find a combination of Mi and Tij that maximizes the fit F

Alignment method: recognition process
Find best transformation Tij for each model Mi (optimizing over possible views) Find Mi whose best Tij gives the best match to image V

Recognition by linear combination of views
model views LC2 is a linear combination of M1 and M2 that best matches the novel view novel view

Predicting object appearance
two known views of obelisk Recognition process: (1) compute α, β that predict P1 & P2 (2) use α, β to predict other points (3) evaluate fit of model to image XP1I0 = α XP1I1 + β XP1I2 XP2I0 = α XP2I1 + β XP2I2

Why is face recognition hard?
changing pose changing illumination - why hard? particular person can appear very different in image – some of the factors that lead to variability – different points of view or pose relative to the viewer, changing illumination (can make you look more like someone else than like yourself, fairly extreme illuminations here, but see in Tony Blair images from web), different accessories (glasses) - changing facial expression, age (young & adventurous to old & grumpy), hair changes - most of the time, people not looking right toward you or camera – going about their daily business, appearing with different poses, illuminations, size (seen at a distance, very small, low resolution) - scene cluttered with lots of objects, occlusion (bit of face next to Rafael Reif – MIT president, in news about new College of Computing at MIT - security guard occluded) – recognition in unconstrained environments adds to challenge of detecting and recognizing faces aging clutter occlusion changing expression

Eigenfaces for recognition (Turk & Pentland) Principal Components Analysis (PCA)
Goal: reduce the dimensionality of the data while retaining as much information as possible in the original dataset PCA allows us to compute a linear transformation that maps data from a high dimensional space to a lower dimensional subspace - continue tour - approach to face recognition championed Turk/Pentland early 90s, known as eigenface approach, based on technique of principle components analysis - at some level, think of an image as a two-dimensional thing, but really very high-dimensional entity if think of each image pixel as dimension that can take on range of values corresponding to brightness of image at that location - much of information stored in individual pixels may be redundant, so for purpose of representing the image for recognition, especially if nearby, would be nice if could take high-dimensional data, map onto lower dimensional subspace that removes lot of redundancy in image, PCA one technique can use to do this - not going into details of method, here very simple example to capture basic intuition – imagine lots of images, measured brightness at two pixels across all images, plot all combinations of brightness – suppose nearby, chances are highly correlated, most variation brightness captured by where brightness lies along this one dimension, first principal component, then project all data to this one dimension, adequate for whatever task using data for - why eigenfaces term? method PCA involves computing eigenvectors and eigenvalues of covariance matrix constructed from data, first principal component associated with eigenvector with largest eigenvalue (direction red line), captures direction of largest variance in data, if multi-dimensional data, second principle component would capture direction of next largest variation in data, and so on - what is analogy for images of faces?

Eigenfaces for recognition (Turk & Pentland)
Ψ(x,y) Perform PCA on a large set of training images, to create a set of eigenfaces, Ei(x,y), that span the data set First components capture most of the variation across the data set, later components capture subtle variations - analogy of eigenvectors can be portrayed as images, Turk & Pentland referred to as eigenfaces – look creepy, capture variations exist across database of face images - upper left corner, average face, brightness each location is average of brightness at that same location across entire stack of image samples (generated from database where each image had dark background), before PCA, subtract average from each image in training set - to right is first eigenface, starting with average space, add in different amounts of first eigenface and capture large part of variation in face images, then add some amount of next eigenface to capture more variation in dataset, as add in bit of each one, capture more of variation - denote eigenfaces as Ei(x,y), express each face image in data set as weighted sum of each of eigenfaces, combined with average face, where weights vary between individuals - some intuition about contribution that some eigenfaces make, e.g. first may capture overall skin tone of individual, some beard, glasses, etc. - typically about 25 eigenfaces might capture enough of variation to get reasonable recognition performance - once you have eigenfaces, computing weights for new image is direct calculation (check), but how then use representation for recognition? Ψ(x,y): average face (across all faces) Each face image F(x,y) can be expressed as a weighted combination of the eigenfaces Ei(x,y): F(x,y) = Ψ(x,y) + Σi wi*Ei(x,y)

Representing individual faces
Each face image F(x,y) can be expressed as a weighted combination of the eigenfaces Ei(x,y): F(x,y) = Ψ(x,y) + Σi wi*Ei(x,y) Recognition process: Compute weights wi for novel face image Find image m in face database with most similar weights, e.g. - top repeats what just said, original face image, set of first k eigenfaces computed from training set, each associated with separate weight, can be multiplied by image, all weighted images added together along with mean image, would give rough depiction of face, not exact, if preserving subset eigenfaces, lost some subtle variations - weights are signature for this person, different set of weights for each individual in database of faces can recognize, so then search through weights for known individuals for one that yields closest match - different metrics can use, simplest is euclidean distance, i index for each of weights, m sample individual in database, measure how different are two sets of weights, find identity m that minimizes difference

Face detection: Viola & Jones
Multiple view-based classifiers based on simple features that best discriminate faces vs. non-faces Most discriminating features learned from thousands of samples of face and non-face image windows - construct classifier takes small window image, determines whether/not face, imagine strategy scan image looking each small patch, ask face? sample pub result – found all faces + soccer ball looks little like face - to design classifier, ask what features of image in window most effective distinguishing faces/non-faces? features best for find faces roughly frontal view may differ from best for detect profile face - design multiple classifiers specialized frontal or profile views (keep back of mind) - started out thousands simple features could be used for classification, learned ones most effective by training classifier using lots examples image patches labeled faces or non-faces - another key aspect approach - attentional mechanism, greatly reduces amount computation needed, suppose couple hundred features sufficient construct classifier good performance (each window, 200 features, decision), no point waste time computing all 200 features every subwindow image, don’t bother computing for regions roughly uniform brightness, nice if could use few features as quick filter, remove lot regions from consideration clearly not faces, then focus computation resources regions more promising possible faces, use more discriminating features determine if really face Attentional mechanism: cascade of increasingly discriminating classifiers improves performance

Viola & Jones use simple features
Use simple rectangle features: Σ I(x,y) in gray area – Σ I(x,y) in white area within 24 x 24 image sub-windows  Initially consider 160,000 potential features per sub-window!  features computed very efficiently - what features used? simple rectangular features, squares represent sub-window image 24x24, each feature 2-4 rectangular sub-regions particular locations in window - to calculate value feature, add intensities gray areas, subtract sum intensities white areas, compare difference brightness regions to threshold, difference larger certain amount? can provide evidence whether/not window face - think 1st feature measure change brightness left/right sub-regions – where large value? where large difference adjacent vertical strips image (left/right face boundaries, side hairline) - defined number different geometric configs, pos/neg regions, different sizes, orientations, provides initial set 160,000 potential features this sort, lot but very simple, compute very efficiently (imagine lot redundancy in regions add intensities, compute intermediate representation sums particular image regions, avoids redundant computations) - which features best distinguish whether face/not? this is learned, two most informative here, see where lie on typical face, why good at distinguishing face? face typically has eye region darker than area below, bridge of nose between eyes typically brighter than eyes, templates capture relationships (video) Which features best distinguish face vs. non-face? Learn most discriminating features from thousands of samples of face and non-face image windows

Learning the best features
weak classifier using one feature: x = image window f = feature p = +1 or -1  = threshold … n training samples, equal weights, known classes  - what’s learning process? iterative, progressively find next best feature until system does good enough job classifying image patches faces/non, 1st define weak classifier, just uses single feature, h classifier, value 1/face or 0/non-face, function 4 things, window x, feature f, threshold theta use decide face/not, extra p +/- 1 - may be more likely face if >threshold or <threshold, depends specific feature, p +/- 1 allows decision based either larger/smaller than threshold - to learn good features, training data, n training samples/large, sub-windows known class (1 face, 0 non), weight associated samples, start equal, later change, may be certain examples more challenging to classify (soccer ball), want to give more weight in training process, iterative process, weights normalized sum to 1, find next weak classifier (combo f, theta, p) does best job correctly ID class each image window in training set, difference what feature indicates (0/1) & correct class (0/1), summed over all windows with each multiplied by current weight associated sample, expression measures incorrect classifications, want minimize discrepancy between true class & what classifier gives us, based on this feature - not perfect, some samples wrong, increase weight those samples for next time around, searching for next best feature – strategy AdaBoost - lots of weak classifiers based single features, each does better/worse job classifying, define final classifier integrates evidence all features, weigh contribution each feature according to how good distinguishing face/non during training, sum all evidence, check if larger than threshold – yes? then face, ~200 features give good results for this formulation classifier, test set novel image patches to evaluate performance %correct (x1,w1,1) (xn,wn,0) find next best weak classifier normalize weights final classifier AdaBoost ~ 200 features yields good results for “monolithic” classifier use classification errors to update weights

“Attentional cascade” of increasingly discriminating classifiers
Early classifiers use a few highly discriminating features, low threshold 1st classifier uses two features, removes 50% non-face windows later classifiers distinguish harder examples - as said, don’t want to compute values of all 200 features everywhere in image, not efficient, so used idea of attentional cascade, start all sub-windows, use small number of features to reject many sub-windows clearly non-faces, only preserve more promising ones for further analysis with additional features - these two features remove half non-face windows while preserving ~100% real faces, second layer uses about 10 more features and rejects about 80% of remaining non-faces - given you’re computing features for fewer windows, in long run, allows more features to be used, in practice, cascade 38 classifiers total about 6000 features yields high level performance  Increases efficiency  Allows use of many more features  Cascade of 38 classifiers, using ~6000 features

Feature based vs. holistic processing
Tanaka & Simonyi (2016), Sinha et al. (2006) composite face effect face inversion effect whole-part effect

Feature based vs. holistic processing
composite face effect face inversion effect identical top halves seen as different when aligned with different bottom halves when misaligned, top halves perceived as identical - (Kanade/features & PCA/holistic) question perceptual studies, not fully resolved, recognize faces based on feature vs. holistic process uses global, configural info about faces - perceptual phenomena used argue holistic, fuzzy notion in literature, examples of effects that people point to - composite face effect, left pair images - two top halves same or different? identical tops but when different bottoms aligned, appear different (illusion, really same), when bottoms misaligned, more likely make correct judgment, argue derive holistic rep face when parts aligned, prevents access detailed parts needed for comparison, demo famous people, small info e.g. eye, sufficient recognize famous person, mix woody allen/oprah winfrey disrupts recognition, interferes diagnostic capability individual features - inversion effect, more difficult recognize inverted faces, disparity between our ability recognize upright vs. inverted faces greater than disparity recognize other kinds objects, argue inversion disrupts holistic arrangement of collection of features, eyes, nose, mouth, inversion effect develops early childhood, prosopagnosics don’t exhibit (learning process over time does not occur, contributes to difficulty) whole-part effect inversion disrupts recognition of faces more than other objects prosopagnosics do not show effect Identification of “studied” face is significantly better in whole vs. part condition

The power of averages, Burton et al. (2005)
- Burton et al, internal rep individuals based on average examples seen, avg famous people, how grab examples internet, ~frontal views, otherwise unconstrained (lighting, expression, time), not avg raw images - avg texture (captures typical variation brightness face), avg shape (shape manual step, generic grid, move key points to face location), (1) morph image to common geometric shape, do same each exemplar, then average (individuals not all look like Tony Blair, avg captures essence), but lost individual shape (large forehead), (2) avg locations key points on grid to get avg shape captured by mesh, (3) morph avg texture to grid avg face - quite effective recognize famous faces – train PCA based recog system with avgs, performs better recognize new individual images, improves more exemplars combined in avg – test commercial system FaceVACS, database ~ 31,000 photos famous faces, ~ 3,600 celebrities, online version can upload image & returns closest matching photo from DB (black box) – 460 individual images of famous people 54% correct, with average faces as input, reaches 100% correct, then computed average using 46% wrong/80% correct - why? what doing with average that’s advantageous? avg factors e.g. lightness variations, expression, not diagnostic identity - perceptual experiments recognize famous faces, avg reaction times faster, less error ID avg faces vs. exemplars, how to automate construction of representation? how to evolve over time as become more familiar? PCA performance average “shape” average “texture”

Human recognition of average faces
Burton et al. (2005) Performance: texture + shape images Performance: shape-free images

What is an artificial neural network?
Network of simple neuron-like computing elements … - what is ANN? network simple neuron-like computing elements, basis? learned earlier structure real neurons, dendrites where other neurons connect to cell at synapses, cause electrical changes propagate to cell body where info integrated to produce response in form of spikes propagate along axon to other cells, inputs can excite or inhibit electrical activity of cell - in somewhat analogous way, each element in neural net receives number of inputs, integrated to produce output - inputs integrated in way could increase or decrease output each unit, ANNs typically organized in layers, input layer, output layer (may also have lots of output units), layer in between called hidden layer because hides between inputs we provide and outputs we can observe … that can learn to associate inputs with desired outputs

Computing in a ”typical” neural network
+1 sigmoid w0 I1 w1 H 1 or 0 - quantity: sum of weights x inputs, and could be more, often referred to as activation value for unit (sometimes use term with real neurons) - said so far, axes, 0 threshold, step function 0 to 1 at activation = 0 - rather than sudden jump in behavior with slight change to activation, make a smooth transition, still between 0 and 1, but as increase activation from negative value, slowly increase response of unit, more steeply around 0, but not sudden jump, sigmoid function - 1/(1+e-s) where s is sum of weighted inputs (activation) - in end, what actually do in simple neural networks - sum of weighted inputs to sigmoid function to output between 0 and 1 w2 I2 w0 • 1 + w1 • I1 + w2 • I wn • In > 0 • activation How does each unit integrate its inputs to produce an output? sum of weighted inputs  sigmoid function  output between 0 and 1

Learning to recognize input patterns
feedforward processing network weights can be learned from training examples (map inputs to correct outputs) back-propagation: iterative algorithm progressively reduces error between computed and desired output until performance is satisfactory on each iteration: compute output of current network and assess performance compute weight adjustments from hidden to output layer that reduce output errors compute weight adjustments from input to hidden units that improve hidden layer change network weights, incorporating a rate parameter - how train network to solve particular problem? so far, learned how computes in forward direction inputs to outputs, how can earn pattern of weights that enables net to produce correct outputs for each set inputs? learn from set training examples (pairs input image & correct digit label) - start random values for weights, use iterative strategy progressively reduces error computed & desired output until performance good enough - how adjust weights to improve performance? most common strategy back-propagation algorithm - on each iteration, take current network, compute output for each input, assess performance (measure errors computed vs. desired), if not good enough, work backwards, determine changes in weights hidden to output layers that reduce errors, then move further back, determine adjustments of weights input to hidden layers that improve contributions of hidden layer to reducing final output, after figure out how to adjust, then (typically) change all at once – final rate parameter controls how aggressively to make changes – scale factor (change lot each iteration, or more conservative) – graph effect of different rates back-propagation algorithm to learn network weights

Observer motion problem

Similar presentations

Presentation on theme: "Observer motion problem"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Observer motion problem

Similar presentations

Presentation on theme: "Observer motion problem"— Presentation transcript:

Similar presentations

About project

Feedback