Perception Vision, Sections Speech, Section 24.7
Computer Vision §“the process by which descriptions of physical scenes are inferred from images of them.” -- S. Zucker §“produces from images of the external 3D world a description that is useful to the viewer and not cluttered by irrelevant information”
Typical Applications §Medical Image Analysis §Aerial Photo Interpretation §Material Handling §Inspection §Navigation
Multimedia Applications §Image compression §Video teleconferencing §Virtual classrooms
Image pixelation
Pixel values
How to recognize faces?
Problem Background §M training images §Each image is N x N pixels §Each image is l normalized for face position, orientation, scale, and brightness §There are several pictures of each face l different “moods”
Your Task §Determine if the test image contains a face §If it contains a face, is it a face of a person in our database? §If it is a person in our database, which one? §Also, what is the probability that it is Jim?
Image Space §An N x N image can be thought of as a point in an N 2 dimensional image space §Each pixel is a feature with a gray scale value. §Example: l 512 x 512 image l each pixel can be 0 (black) to 255 (white)
Nearest Neighbor §The most likely match is the nearest neighbor §But that would take too much processing §Since all images are faces, they will have very high similarity
Face Space §Lower dimensionality to both simplify the storage and generalize the answer §Use eigenvectors to distill the 20 most distinctive metrics §Make a 20-item array for each face that contains the values of 20 features that most distinguish faces. §Now each face can be stored in 20 words
The average face §Training images are I 1, I 2,... I m §Average image is A
Weight of an image in each feature §For k=1,..., 20 features, compute the similarity between the Input image, I, and the kth eigenvector, E k
Image in Face Space §“Only” 20 dimensional space §W = [w 1, w 2,..., w 20 ], a column vector of weights that indicate the contribution of each of the 20 eigenfaces in I §Each image is projected from a point in high dimensional space into face space §20 features * 32 bits = 320 bits per image
Reconstructing image I §If M’ < M, we can only approximate I §Good enough for recognizing faces
Picking the 20 Eigenfaces §Principal Component Analysis l (also called Karhunen-Loeve transform) §Create 20 images that maximize the information content in eigenspace §Normalize by subtracting the average face §Compute the covariance matrix, C §Find the eigenvectors of C that have the 20 largest eigenvalues
Build a database of faces §Given a training set of face images, compute the 20 largest eigenvectors, E 1, E 2,..., E 20 l Offline because it is slow §For each face in the training set, compute the point in eigenspace, W = [w 1,w 2,...,w 20 ] l Offline, because it is big
Categorizing a test face §Given a test image, I test, project it into the 20-space by computing W test §Find the closest face in the database to the test face: l where Wk is the point in facespace associated with the kth person l || * || denotes the euclidean distance in facespace
Distance from facespace §Find the distance of the test image from eigenspace
Is this a face? §If dffs < threshold1 l then if d < threshold2 the test image is a face that is very close to the nearest neighbor, classify it as that person l else the image is a face, but not one we recognize § else the image probably does not contain a face
Face Recognition Accuracy §Using 20-dimensional facespace resulted in about 95% correct classification on a database of 7500 images of 3000 people §If there are several images per person, the average W for that person helps improve accuracy
Edge Detection §Finding simple descriptions of objects in complex images l find edges l interrelate edges
Causes of edges §Depth discontinuity l One surface occludes another §Surface orientation discontinuity l the edge of a block §reflectance discontinuity l texture or color changes §illumination discontinuity l shadows
Examples of edges
Finding Edges Image Intensity along a line First derivative of intensity Smoothed via convolving with gaussian
Pixels on edges
Edges found
Human-Computer Interfaces §Handwriting recognition §Optical Character Recognition §Gesture recognition §Gaze tracking §Face recognition
Vision Conclusion §Machine Vision is so much fun, we have a full semester course in it §Current research in vision modeling is very active l More breakthroughs are needed
Speech Recognition Section 24.7
Speech recognition goal §Find a sequence of words that maximizes P(words | signal)
Signal Processing §“Toll quality” was the Bell labs definition of digitized speech good enough for long distance calls (“toll” calls) l Sampling rate: 8000 samples per second l Quantization factor: 8 bits per sample §Too much data to analyze to find utterances directly
Computational Linguistics §Human speech is limited to a repertoire of about 40 to 50 sounds, called phones §Our problem: l What speech sounds did the speaker utter? l What words did the speaker intend? l What meaning did the speaker intend?
Finding features
Vector Quantization §The 255 most common clusters of feature values are labeled C1, …, C255 §Send only the 8 bit label §One byte per frame (a 100-fold improvement over the 500 KB/minute)
How to Wreck a Nice Beach §where P(signal) is a constant (it is the signal we received) §So we want
Unigram Frequency §Word frequency §Even though his handwriting was sloppy, Woody Allen’s bank hold-up note probably should not have been interpreted as “I have a gub” l The word “gun” is common l The word “gub” is unlikely
Language model §Use the language model to compare l P(“wreck a nice beach”) l P(“recognize speech”) §Use naïve Bayes to asses the likelihood for each word that it will appear in this context
Bigram model §want P(w i | w 1, w 2, …, w n ) l approximate it by P(w i | w I-1 ) §Easy to train l Simply count the number of times each word pair occurs l “I has” is unlikely, “I have” is likely l “an gun” is unlikely, “a gun” is likely
Trigram §Some trigrams are very common l only track the most common trigrams §Use a weighted sum of l unigram l bigram l trigram
Near the end of the semester l Time flies like an arrow l Fruit flies like a banana §It is currently hard to incorporate parts of speech and sentence grammar into the probability calculation l lots of ambiguity l but humans seem to do it
Conclusion §Speech recognition technology is changing very quickly §Highly parallel §Amenable to hardware implementations