What is computer vision?

What is computer vision?
The search of the fundamental visual features, and the two fundamentals applications of reconstruction and recognition Features Recognition Reconstruction

2012 The current mania and euphoria of the AI revolution
2012, annual gathering, an improvement of 10% (from 75 to 85) Computer vision researchers use machine learning techniques to recognize objects in large amount of images Go back to 1998 (1.5 decade!) Textual (hand-written and printed) is actually visual! And why wait for so long? A silent Hardware revolution: GPU Sadly driven by video gaming  Nvidia (GPU maker) is now in the driving seat of this AI revolution! 2016, AlphaGo beats professionals A narrow AI program Re-shaping AI, Computer Science, digital revolution … 1587, a year of no significance, ray huang, 万历15 年，黄仁宇 China: a macro history 中国大历史 VisGraph, HKUST

Visual matching, and recognition for understanding
Finding the visually similar things in different images --- Visual similarities Visual matching, find the ‘same’ thing under different viewpoints, better defined, no semantics per se. Visual recognition, find the pre-trained ‘labels’, semantics We define ‘labels’, then ‘learn’ from labeled data, finally classify ‘labels’

The state-of-the-art of visual classification and recognition
Any thing you can clearly define and label Then show a few thousands examples (labeled data) of this thing to the computer A computer recognizes a new image, not seen before, now as good as humans, even better! This is done by deep neural networks.

References CNN for Visual Recognition, Stanford Deep Learning Tutorial, LeNet, Montreal Pattern Recognition and Machine Learning, Bishop Sparse and Redundant Representations, Elad Pattern Recognition and Neural Networks, Ripley Pattern Classification, Duda, Hart, different editions A wavelet tour of signal processing, a sparse way, Mallat Introduction to applied mathematics, Strang Now the big trends in visual computing are … Some figures and texts in the slides are cut/paste from these references.

Classification and recognition
Where is it, for the input x? Make a decision, either by probability a>b, or by classification surface f(x)>0 or <0. Forward inference How to compute? Classification surface estimate f(x)=0, an (nonlinear and high-dimensional) optimization problem (of often a differentiable log likelihood) Backward learning What to minimize? Justification, often probabilistic, and Bayesian

A (parameterized) score function mapping the data to class score, forward inference, modeling
A loss function (objective) measuring the quality of a particular set of parameters based on the ground truth labels Optimization, minimize the loss over the parameters with a regularization, backward learning

The dataset of pairs of (x,y) is given and fixed
The dataset of pairs of (x,y) is given and fixed. The weights start out as random numbers and can change. During the forward pass the score function computes class scores, stored in vector f. The loss function contains two components: The data loss computes the compatibility between the scores f and the labels y. The regularization loss is only a function of the weights. During Gradient Descent, we compute the gradient on the weights (and optionally on data if we wish) and use them to perform a parameter update during Gradient Descent.

Bayesian decision 𝑷(𝝎 𝒋 𝒙 = 𝑷 𝒙 𝝎 𝒋 ) 𝑷(𝝎 𝒋 )/𝑷 𝒙
𝑷(𝝎 𝒋 𝒙 = 𝑷 𝒙 𝝎 𝒋 ) 𝑷(𝝎 𝒋 )/𝑷 𝒙 posterior = likelihood * prior / evidence Decide 𝝎 𝟏 if 𝑷(𝝎 𝟏 𝒙 > 𝑷(𝝎 𝟐 𝒙 ; otherwise decide 𝝎 𝟐 Now the big trends in visual computing are …

Optimization (supervised learning)
Minimize a loss function The number of erros, zero-one loss The zero-one loss is not differentiable, so we maximize the log-likelihood, or to minimize the negative log-likelihood We use the gradient of this function Stochastic gradient descent uses a few examples at a time instead of the entire training set The loss function should be regularized (ill-posed non-uniqueness solution, or smoothness constraint, or avoid overfitting) Now the big trends in visual computing are …

Optimization (supervised learning)
Training, validation, and testing data Hyper-parameters Overfitting to the data Generalization Regularization Now the big trends in visual computing are …

VisGraph, HKUST

Fundamental linear classifiers
Binary linear classifier y= 𝑓 𝒙 = 𝒘 ⋅ 𝒙 + 𝑏 The classification surface is a hyper-plane 𝑓(𝒙)=0 Geometry, 3d, and n-d Linear algebra, linear space Linear classifiers Decision is nonlinear thresholding, Nonlinear distance function, or probability-like sigmoid Can take the nonlinear distance function to the hyperplane, and interpret this distance function as a probability

A single neuron is a linear classifier
w x + b, a linear classifier, a neuron, It’s a dot product of two vectors, scalar product A template matching, a correlation, the template w and the input vector x Also an algebraic distance, not the geometric one which is non-linear (therefore the solution is usually nonlinear!) The dot product is the metric distance of two points, one data, the other representative The ‘linear’ is that the decision surface is linear, a hyper-plane. The solution or the training is usually not linear at all, it depends on the loss function (softmax or svm). It is iteratively solved by a numerical gradient

A biological neuron and its mathematical model.

From two to N classes Binary linear classifier y= 𝑓 𝒙 = 𝒘 ⋅ 𝒙 + 𝑏
The classification surface is a hyper-plane 𝑓(𝒙)=0 Multi-class, output a vector function, 𝒚=𝒇(𝒙)=𝒇(𝑾 𝒙 + 𝒃), The normalized exponentials (softmax),𝒔(𝒇(𝒙))=𝒔∘𝒇 (𝒙) (s is a kind of normalization) W x + b, each row is a linear classifier, a neuron Can take the nonlinear distance function to the hyperplane, and interpret this distance function as a probability

Is a linear classifier straightforward?
Only inference ‘scoring’ function is linear No ‘analytical’ forms of the loss functions Not equality, inequalities VisGraph, HKUST

The two common linear classifiers, with different loss functions
SVM, uncalibrated score Softmax, multi-class logistic regression, a normalized class probability for each label They are usually comparable

Activation (nonlinearity) functions
Sigmoid logistic function 𝑠 𝑥 =1/(1+ 𝑒 −𝑥 ), normalized to between 0 and 1, is naturallly probability-like between 0 and 1, so naturally, sigmoid for two, and softmax for N, 𝑒 − 𝑥 𝑖 /∑𝑒 − 𝑥 𝑗 It’s the normalization of the output data, also remember the similar consideration to the input data normalization (whitening) Activation function and Nonlinearity function, not necessarily logistic sigmoid between 0 and 1, others like tanh (centered), relu, … Sigmoid, kills gradients, not used any more Tanh, 2 s(2x) – 1, centered between -1 and 1, better, ReLU, max(0,x) Very popular recently Don’t set up too high learning rate Practices: Rate to mix different activations Use ReLU Now typically 100 million parameters and 10 to 20 layers Sigmoid, with x1=0, x2=x VisGraph, HKUST

VisGraph, HKUST

From linear to non-linear classifiers
Go higher and linear! find a map or a transform 𝒙↦𝜙(𝒙) to make them linear, but in higher dimensions A complete basis of polynomials  too many parameters for the limited training data Kernel methods, support vector machine, … Learn the nonlinearity at the same time as the linear classifiers  multilayer neural networks Multilayer neural networks They implement linear classifiers, but in a space where the inputs have been mapped nonlinearly! A universal nonlinear approximator, from at least 3 layers (two hidden layers) onwards They admit simple algorithms where the form of the nonlinearity can be learned from the training data. They are extremely powerful, have nice theoretical properties, apply well to a vast array of applications.

Multi Layer Perceptrons
A N-layer neural network does not count the input layer But it counts the output layer: it represents the class scores vector, it does not have an activation function, or the identity activation function Activation is a kind of data normalization Better count the hidden layers  A one-layer, 𝑓 1 x , linear classifiers, then s_1, no hidden layer A two-layer, 𝑓 2 ∘ 𝑠 1 ∘ 𝑓 1 x , one hidden layer A three layer network 𝑓 3 ∘ 𝑠 2 ∘ 𝑓 2 ∘ 𝑠 1 ∘ 𝑓 1 x , two hidden layers Now the big trends in visual computing are … For a model f(x), Forward inference f(x), and backward learning \nabla f(x)

A 2-layer Neural Network, one hidden layer of 4 neurons (or units), and one output layer with 2 neurons, and three inputs. The network has = 6 neurons (not counting the inputs), [3 x 4] + [4 x 2] = 20 weights and = 6 biases, for a total of 26 learnable parameters.

A 3-layer neural network with three inputs, two hidden layers of 4 neurons each and one output layer. Notice that in both cases there are connections (synapses) between neurons across layers, but not within a layer. The network has = 9 neurons, [3 x 4] + [4 x 4] + [4 x 1] = = 32 weights and = 9 biases, for a total of 41 learnable parameters.

From a regular network to CNN: a visual machine
The whole network governed by a differentiable loss function: from the raw pixels to class scores Each layer transforms an input to an output with some differentiable function The full connectivity It does not scale up the images and layers It leads quickly to over-fitting VisGraph, HKUST

A regular 3-layer Neural Network.
A CNN arranges its neurons in three dimensions (width, height, depth). Every layer of a CNN transforms the 3D input volume to a 3D output volume. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).

We used to convert an input image into a feature vector, 1D
That was feature selection We now input directly the image, 2D The neurons are arranged from 1D to 2D, and to 3D Converting input images into feature vector loses the spatial neighborhood-ness complexity increases to cubics Yet, the connectivities become local to reduce the complexity! VisGraph, HKUST

CNN INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three channels R,G,B. CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters. RELU layer will apply an elementwise activation function, max(0,x). This leaves the size of the volume unchanged ([32x32x12]). POOL layer will perform a down-sampling along the spatial dimensions (width, height), resulting in volume such as [16x16x12]. FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. Each neuron in this layer will be connected to all the numbers in the previous volume.

INPUT -> [[CONV -> RELU]. N -> POOL. ]
INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3). For example, here are some common ConvNet architectures you may see that follow this pattern:

The initial volume stores the raw image pixels (left) and the last volume stores the class scores (right). Each volume of activations along the processing path is shown as a column. Since it's difficult to visualize 3D volumes, we lay out each volume's slices in rows. The last layer volume holds the scores for each class, but here we only visualize the sorted top 5 scores, and print the labels of each one. The full web-based demo is shown in the header of our website. The architecture shown here is a tiny VGG Net.

CNN layers Some layers do not have parameters, the RELU and POOl layers implement a fixed function Some layers contain parameters, the CONV and FC layers VisGraph, HKUST

The Convolutional Layer
Local connectivity. The receptive field of the neuron, or the filter size. The connections are local in space (width and height), but always full in depth. A set of learnable filters Parameters sharing VisGraph, HKUST

The “convolution” For 3D input images, the convolution is 2D in each channel, and each channel has a different filter or kernel, the convolution per channel is then summed up in all channels to produce a scalar for non-linearity activation Do we need a linear combination parameters? A convolution can be defined for 1, 2, 3, and N D The 2D convolution is different from a real 3D convolution, which integrates the spatio-temporal information, the standard CNN convolution has only ‘spatial’ spreading VisGraph, HKUST

The Pooling Layer Reduce the spatial size
Reduce the amount of parameters Avoid over-fitting Backpropagation for a max: only routing the gradient to the input that had the highest value in the forward pass It is unclear whether the pooling is essential. The data normalization or PCA/whitening is common in general NN, but in CNN, the ‘normalization layer’ has been shown to be minimal as well. VisGraph, HKUST

Computational complexity
The memory bottleneck GPU, a few GB VisGraph, HKUST

CNN applications Transfer learning Fine-tuning the CNN
Keep some early layers Early layers contain more generic features, edges, color blobs Common to many visual tasks Fine-tune the later layers More specific to the details of the class CNN as feature extractor Remove the last fully connected layer A kind of descriptor or CNN codes for the image AlexNet gives a 4096 Dim descriptor VisGraph, HKUST VisGraph, HKUST

Open questions Only empirical that deeper is better
Images contain hierarchical structures Overfitting and generalization meaningful data! Intrinsic laws Networks are non-convex Need regularization Smaller networks are hard to train with local methods Local minima are bad, in loss, not stable, large variance Bigger ones are easier More local minima, but better, more stable, small variance As big as the computational power, and data!

VisGraph, HKUST

What is computer vision?

Similar presentations

Presentation on theme: "What is computer vision?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What is computer vision?

Similar presentations

Presentation on theme: "What is computer vision?"— Presentation transcript:

Similar presentations

About project

Feedback