What is computer vision?

Slides:



Advertisements
Similar presentations
Request Dispatching for Cheap Energy Prices in Cloud Data Centers
Advertisements

SpringerLink Training Kit
Luminosity measurements at Hadron Colliders
From Word Embeddings To Document Distances
Choosing a Dental Plan Student Name
Virtual Environments and Computer Graphics
Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI
THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –
D. Phát triển thương hiệu
NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN
Điều trị chống huyết khối trong tai biến mạch máu não
BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.
Nasal Cannula X particulate mask
Evolving Architecture for Beyond the Standard Model
HF NOISE FILTERS PERFORMANCE
Electronics for Pedestrians – Passive Components –
Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel
L-Systems and Affine Transformations
CMSC423: Bioinformatic Algorithms, Databases and Tools
Some aspect concerning the LMDZ dynamical core and its use
Bayesian Confidence Limits and Intervals
实习总结 (Internship Summary)
Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,
Front End Electronics for SOI Monolithic Pixel Sensor
Face Recognition Monday, February 1, 2016.
Solving Rubik's Cube By: Etai Nativ.
CS284 Paper Presentation Arpad Kovacs
انتقال حرارت 2 خانم خسرویار.
Summer Student Program First results
Theoretical Results on Neutrinos
HERMESでのHard Exclusive生成過程による 核子内クォーク全角運動量についての研究
Wavelet Coherence & Cross-Wavelet Transform
yaSpMV: Yet Another SpMV Framework on GPUs
Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.
MOCLA02 Design of a Compact L-­band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Fuel cell development program for electric vehicle
Overview of TST-2 Experiment
Optomechanics with atoms
داده کاوی سئوالات نمونه
Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium  
ლექცია 4 - ფული და ინფლაცია
10. predavanje Novac i financijski sustav
Wissenschaftliche Aussprache zur Dissertation
FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,
Particle acceleration during the gamma-ray flares of the Crab Nebular
Interpretations of the Derivative Gottfried Wilhelm Leibniz
Advisor: Chiuyuan Chen Student: Shao-Chun Lin
Widow Rockfish Assessment
SiW-ECAL Beam Test 2015 Kick-Off meeting
On Robust Neighbor Discovery in Mobile Wireless Networks
Chapter 6 并发:死锁和饥饿 Operating Systems: Internals and Design Principles
You NEED your book!!! Frequency Distribution
Y V =0 a V =V0 x b b V =0 z
Fairness-oriented Scheduling Support for Multicore Systems
Climate-Energy-Policy Interaction
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Ch48 Statistics by Chtan FYHSKulai
The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.
Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs
Online Learning: An Introduction
Factor Based Index of Systemic Stress (FISS)
What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.
THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*
Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.
The Toroidal Sporadic Source: Understanding Temporal Variations
FW 3.4: More Circle Practice
ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف
Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM
Limits on Anomalous WWγ and WWZ Couplings from DØ
Presentation transcript:

What is computer vision? The search of the fundamental visual features, and the two fundamentals applications of reconstruction and recognition Features Recognition Reconstruction

2012 The current mania and euphoria of the AI revolution 2012, annual gathering, an improvement of 10% (from 75 to 85) Computer vision researchers use machine learning techniques to recognize objects in large amount of images Go back to 1998 (1.5 decade!) Textual (hand-written and printed) is actually visual! And why wait for so long? A silent Hardware revolution: GPU Sadly driven by video gaming  Nvidia (GPU maker) is now in the driving seat of this AI revolution! 2016, AlphaGo beats professionals A narrow AI program Re-shaping AI, Computer Science, digital revolution … 1587, a year of no significance, ray huang, 万历15 年, 黄仁宇 China: a macro history 中国大历史 VisGraph, HKUST

Visual matching, and recognition for understanding Finding the visually similar things in different images --- Visual similarities Visual matching, find the ‘same’ thing under different viewpoints, better defined, no semantics per se. Visual recognition, find the pre-trained ‘labels’, semantics We define ‘labels’, then ‘learn’ from labeled data, finally classify ‘labels’

The state-of-the-art of visual classification and recognition Any thing you can clearly define and label Then show a few thousands examples (labeled data) of this thing to the computer A computer recognizes a new image, not seen before, now as good as humans, even better! This is done by deep neural networks.

References CNN for Visual Recognition, Stanford http://cs231n.github.io/neural-networks-1/ Deep Learning Tutorial, LeNet, Montreal http://www.deeplearning.net/tutorial/mlp.html Pattern Recognition and Machine Learning, Bishop Sparse and Redundant Representations, Elad Pattern Recognition and Neural Networks, Ripley Pattern Classification, Duda, Hart, different editions A wavelet tour of signal processing, a sparse way, Mallat Introduction to applied mathematics, Strang Now the big trends in visual computing are … Some figures and texts in the slides are cut/paste from these references.

Classification and recognition Where is it, for the input x? Make a decision, either by probability a>b, or by classification surface f(x)>0 or <0. Forward inference How to compute? Classification surface estimate f(x)=0, an (nonlinear and high-dimensional) optimization problem (of often a differentiable log likelihood) Backward learning What to minimize? Justification, often probabilistic, and Bayesian

A (parameterized) score function mapping the data to class score, forward inference, modeling A loss function (objective) measuring the quality of a particular set of parameters based on the ground truth labels Optimization, minimize the loss over the parameters with a regularization, backward learning

The dataset of pairs of (x,y) is given and fixed The dataset of pairs of (x,y) is given and fixed. The weights start out as random numbers and can change. During the forward pass the score function computes class scores, stored in vector f. The loss function contains two components: The data loss computes the compatibility between the scores f and the labels y. The regularization loss is only a function of the weights. During Gradient Descent, we compute the gradient on the weights (and optionally on data if we wish) and use them to perform a parameter update during Gradient Descent.

Bayesian decision 𝑷(𝝎 𝒋 𝒙 = 𝑷 𝒙 𝝎 𝒋 ) 𝑷(𝝎 𝒋 )/𝑷 𝒙 𝑷(𝝎 𝒋 𝒙 = 𝑷 𝒙 𝝎 𝒋 ) 𝑷(𝝎 𝒋 )/𝑷 𝒙 posterior = likelihood * prior / evidence Decide 𝝎 𝟏 if 𝑷(𝝎 𝟏 𝒙 > 𝑷(𝝎 𝟐 𝒙 ; otherwise decide 𝝎 𝟐 Now the big trends in visual computing are …

Optimization (supervised learning) Minimize a loss function The number of erros, zero-one loss The zero-one loss is not differentiable, so we maximize the log-likelihood, or to minimize the negative log-likelihood We use the gradient of this function Stochastic gradient descent uses a few examples at a time instead of the entire training set The loss function should be regularized (ill-posed non-uniqueness solution, or smoothness constraint, or avoid overfitting) Now the big trends in visual computing are …

Optimization (supervised learning) Training, validation, and testing data Hyper-parameters Overfitting to the data Generalization Regularization Now the big trends in visual computing are …

VisGraph, HKUST

Fundamental linear classifiers Binary linear classifier y= 𝑓 𝒙 = 𝒘 ⋅ 𝒙 + 𝑏 The classification surface is a hyper-plane 𝑓(𝒙)=0 Geometry, 3d, and n-d Linear algebra, linear space Linear classifiers Decision is nonlinear thresholding, Nonlinear distance function, or probability-like sigmoid Can take the nonlinear distance function to the hyperplane, and interpret this distance function as a probability

A single neuron is a linear classifier w x + b, a linear classifier, a neuron, It’s a dot product of two vectors, scalar product A template matching, a correlation, the template w and the input vector x Also an algebraic distance, not the geometric one which is non-linear (therefore the solution is usually nonlinear!) The dot product is the metric distance of two points, one data, the other representative The ‘linear’ is that the decision surface is linear, a hyper-plane. The solution or the training is usually not linear at all, it depends on the loss function (softmax or svm). It is iteratively solved by a numerical gradient

A biological neuron and its mathematical model.

From two to N classes Binary linear classifier y= 𝑓 𝒙 = 𝒘 ⋅ 𝒙 + 𝑏 The classification surface is a hyper-plane 𝑓(𝒙)=0 Multi-class, output a vector function, 𝒚=𝒇(𝒙)=𝒇(𝑾 𝒙 + 𝒃), The normalized exponentials (softmax),𝒔(𝒇(𝒙))=𝒔∘𝒇 (𝒙) (s is a kind of normalization) W x + b, each row is a linear classifier, a neuron Can take the nonlinear distance function to the hyperplane, and interpret this distance function as a probability

Is a linear classifier straightforward? Only inference ‘scoring’ function is linear No ‘analytical’ forms of the loss functions Not equality, inequalities VisGraph, HKUST

The two common linear classifiers, with different loss functions SVM, uncalibrated score Softmax, multi-class logistic regression, a normalized class probability for each label They are usually comparable

Activation (nonlinearity) functions Sigmoid logistic function 𝑠 𝑥 =1/(1+ 𝑒 −𝑥 ), normalized to between 0 and 1, is naturallly probability-like between 0 and 1, so naturally, sigmoid for two, and softmax for N, 𝑒 − 𝑥 𝑖 /∑𝑒 − 𝑥 𝑗 It’s the normalization of the output data, also remember the similar consideration to the input data normalization (whitening) Activation function and Nonlinearity function, not necessarily logistic sigmoid between 0 and 1, others like tanh (centered), relu, … Sigmoid, kills gradients, not used any more Tanh, 2 s(2x) – 1, centered between -1 and 1, better, ReLU, max(0,x) Very popular recently Don’t set up too high learning rate Practices: Rate to mix different activations Use ReLU Now typically 100 million parameters and 10 to 20 layers Sigmoid, with x1=0, x2=x VisGraph, HKUST

VisGraph, HKUST

From linear to non-linear classifiers Go higher and linear! find a map or a transform 𝒙↦𝜙(𝒙) to make them linear, but in higher dimensions A complete basis of polynomials  too many parameters for the limited training data Kernel methods, support vector machine, … Learn the nonlinearity at the same time as the linear classifiers  multilayer neural networks Multilayer neural networks They implement linear classifiers, but in a space where the inputs have been mapped nonlinearly! A universal nonlinear approximator, from at least 3 layers (two hidden layers) onwards They admit simple algorithms where the form of the nonlinearity can be learned from the training data. They are extremely powerful, have nice theoretical properties, apply well to a vast array of applications.

Multi Layer Perceptrons A N-layer neural network does not count the input layer But it counts the output layer: it represents the class scores vector, it does not have an activation function, or the identity activation function Activation is a kind of data normalization Better count the hidden layers  A one-layer, 𝑓 1 x , linear classifiers, then s_1, no hidden layer A two-layer, 𝑓 2 ∘ 𝑠 1 ∘ 𝑓 1 x , one hidden layer A three layer network 𝑓 3 ∘ 𝑠 2 ∘ 𝑓 2 ∘ 𝑠 1 ∘ 𝑓 1 x , two hidden layers Now the big trends in visual computing are … For a model f(x), Forward inference f(x), and backward learning \nabla f(x)

A 2-layer Neural Network, one hidden layer of 4 neurons (or units), and one output layer with 2 neurons, and three inputs.  The network has 4 + 2 = 6 neurons (not counting the inputs), [3 x 4] + [4 x 2] = 20 weights and 4 + 2 = 6 biases, for a total of 26 learnable parameters.

A 3-layer neural network with three inputs, two hidden layers of 4 neurons each and one output layer. Notice that in both cases there are connections (synapses) between neurons across layers, but not within a layer. The network has 4 + 4 + 1 = 9 neurons, [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32 weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable parameters.

From a regular network to CNN: a visual machine The whole network governed by a differentiable loss function: from the raw pixels to class scores Each layer transforms an input to an output with some differentiable function The full connectivity It does not scale up the images and layers It leads quickly to over-fitting VisGraph, HKUST

A regular 3-layer Neural Network. A CNN arranges its neurons in three dimensions (width, height, depth). Every layer of a CNN transforms the 3D input volume to a 3D output volume. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).

We used to convert an input image into a feature vector, 1D That was feature selection We now input directly the image, 2D The neurons are arranged from 1D to 2D, and to 3D Converting input images into feature vector loses the spatial neighborhood-ness complexity increases to cubics Yet, the connectivities become local to reduce the complexity! VisGraph, HKUST

CNN INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three channels R,G,B. CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters. RELU layer will apply an elementwise activation function, max(0,x). This leaves the size of the volume unchanged ([32x32x12]). POOL layer will perform a down-sampling along the spatial dimensions (width, height), resulting in volume such as [16x16x12]. FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. Each neuron in this layer will be connected to all the numbers in the previous volume.

INPUT -> [[CONV -> RELU]. N -> POOL. ] INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3). For example, here are some common ConvNet architectures you may see that follow this pattern:

The initial volume stores the raw image pixels (left) and the last volume stores the class scores (right). Each volume of activations along the processing path is shown as a column. Since it's difficult to visualize 3D volumes, we lay out each volume's slices in rows. The last layer volume holds the scores for each class, but here we only visualize the sorted top 5 scores, and print the labels of each one. The full web-based demo is shown in the header of our website. The architecture shown here is a tiny VGG Net.

CNN layers Some layers do not have parameters, the RELU and POOl layers implement a fixed function Some layers contain parameters, the CONV and FC layers VisGraph, HKUST

The Convolutional Layer Local connectivity. The receptive field of the neuron, or the filter size. The connections are local in space (width and height), but always full in depth. A set of learnable filters Parameters sharing VisGraph, HKUST

The “convolution” For 3D input images, the convolution is 2D in each channel, and each channel has a different filter or kernel, the convolution per channel is then summed up in all channels to produce a scalar for non-linearity activation Do we need a linear combination parameters? A convolution can be defined for 1, 2, 3, and N D The 2D convolution is different from a real 3D convolution, which integrates the spatio-temporal information, the standard CNN convolution has only ‘spatial’ spreading VisGraph, HKUST

The Pooling Layer Reduce the spatial size Reduce the amount of parameters Avoid over-fitting Backpropagation for a max: only routing the gradient to the input that had the highest value in the forward pass It is unclear whether the pooling is essential. The data normalization or PCA/whitening is common in general NN, but in CNN, the ‘normalization layer’ has been shown to be minimal as well. VisGraph, HKUST

Computational complexity The memory bottleneck GPU, a few GB VisGraph, HKUST

CNN applications Transfer learning Fine-tuning the CNN Keep some early layers Early layers contain more generic features, edges, color blobs Common to many visual tasks Fine-tune the later layers More specific to the details of the class CNN as feature extractor Remove the last fully connected layer A kind of descriptor or CNN codes for the image AlexNet gives a 4096 Dim descriptor http://cs231n.github.io/neural-networks-1/ VisGraph, HKUST VisGraph, HKUST

Open questions Only empirical that deeper is better Images contain hierarchical structures Overfitting and generalization meaningful data! Intrinsic laws Networks are non-convex Need regularization Smaller networks are hard to train with local methods Local minima are bad, in loss, not stable, large variance Bigger ones are easier More local minima, but better, more stable, small variance As big as the computational power, and data!

VisGraph, HKUST