Download presentation
Presentation is loading. Please wait.
Published byAlaina Blair Modified over 9 years ago
1
Hierarchical Neural Networks for Object Recognition and Scene “Understanding”
2
Object Recognition Task: Given an image containing foreground objects, predict one of a set of known categories. “Airplane” “Motorcycle”“Fox” 2 From Mick Thomure, Ph.D. Defense, PSU, 2013
3
Selectivity: ability to distinguish categories “Bird” “No-Bird” 3 From Mick Thomure, Ph.D. Defense, PSU, 2013
4
Invariance: tolerance to variation In-category VariationRotation Scaling Translation 4 From Mick Thomure, Ph.D. Defense, PSU, 2013
5
What makes object recognition a hard task for computers?
6
From: http://play.psych.mun.ca/~jd/4051/The%20Primary%20Visual%20Cortex.ppt
7
From: http://psychology.uwo.ca/fmri4newbies/Images/visualareas.jpg
8
Hubel and Wiesel’s discoveries Based on single-cell recordings in cats and monkeys Found that in V1, most neurons are one of the following types: – Simple: Respond to edges at particular locations and orientations within the visual field – Complex: Also respond to edges, but are more tolerant of location and orientation variation than simple cells – Hypercomplex or end-stopped: Are selective for a certain length of contour Adapted from: http://play.psych.mun.ca/~jd/4051/The%20Primary%20Visual%20Cortex.ppt
10
Neocognitron Hierarchical neural network proposed by K. Fukushima in 1980s. Inspired by Hubel and Wiesel’s studies of visual cortex.
11
HMAX Riesenhuber, M. & Poggio, T. (1999), “Hierarchical Models of Object Recognition in Cortex” Serre, T., Wolf, L., Bileschi, S., Risenhuber, M., and Poggio, T. (2006), “Robust Object Recognition with Cortex-Like Mechanisms” HMAX: A hierarchical neural-network model of object recognition. Meant to model human vision at level of “immediate recognition” capabilities of ventral visual pathway, independent of attention or other top- down processes. Inspired by earlier “Neocognitron” model of Fukushima (1980)
12
General ideas behind model “Immediate” visual processing is feedforward and hierachical: low levels detect simple features, which are combined hierarchically into increasingly complex features to be detected Layers of hierarchy alternate between “sensitivity” (to detecting features) and “invariance” (to position, scale, orientation) Size of receptive fields increases along the hierarchy Degree of invariance increases along the hierarchy
13
HMAX State-of-the-art performance on common benchmarks. 1500+ references to HMAX since 1999. Many extensions and applications Biometrics Remote sensing Modeling visual processing in biology 13 Processing From Mick Thomure, Ph.D. Defense, PSU, 2013
14
Template Matching: selectivity for visual patterns Pooling: invariance to transformation by combining multiple inputs Template Matching Template Matching Pooling 14 ON OFF Input Template From Mick Thomure, Ph.D. Defense, PSU, 2013
15
S1 Layer: edge detection Input S1 Edge Detectors 15 From Mick Thomure, Ph.D. Defense, PSU, 2013
16
C1 Layer: local pooling Some tolerance to position and scale. Input S1 C1 16 From Mick Thomure, Ph.D. Defense, PSU, 2013
17
S2 Layer: prototype matching Match learned dictionary of shape prototypes Input S1 C1 S2 17 From Mick Thomure, Ph.D. Defense, PSU, 2013
18
C2 Layer: global pooling Activity is max response to S2 prototype at any location or scale. Input S1 C1 S2 C2 18 From Mick Thomure, Ph.D. Defense, PSU, 2013
19
Properties of C2 Activity 1.Activity reflects degree of match 2.Location and size do not matter 3.Only best match counts 19 Prototype Activation Input From Mick Thomure, Ph.D. Defense, PSU, 2013
20
Properties of C2 Activity 1.Activity reflects degree of match 2.Location and size do not matter 3.Only best match counts 20 Prototype Activation Input From Mick Thomure, Ph.D. Defense, PSU, 2013
21
Properties of C2 Activity 1.Activity reflects degree of match 2.Location and size do not matter 3.Only best match counts 21 Activation Prototype Input From Mick Thomure, Ph.D. Defense, PSU, 2013
22
Classifier: predict object category Output of C2 layer forms a feature vector to be classified. Some possible classifiers: SVM Boosted Decision Stumps Logistic Regression Input S1 C1 S2 C2 22 From Mick Thomure, Ph.D. Defense, PSU, 2013
23
Gabor Filters Gabor filter: Essentially a localized Fourier transform, focusing on a particular spatial frequency, orientation, and scale in the image. Filter has associated frequency, scale s, and orientation . Response measures extent to which is present at orientation at scale s centered about pixel (x,y).
24
Theta = 0 Lambda = 5 Theta = 0 Lambda = 10 Theta = 0 Lambda = 15 Theta = 0 Lambda = 10 Theta = 45 Lambda = 10 Theta = 90 Lambda = 10 http://matlabserver.cs.rug.nl/cgi-bin/matweb.exe
25
Examples of Gabor filters of different orientations and scales From http://www.cs.cmu.edu/~efros/courses/LBMV07/presentations/0130SerreC2Features.ppt
26
HMAX: How to set parameters for Gabor filters Sample parameter space Apply corresponding filters to stimuli commonly used to probe cortical cells (i.e., gratings, bars, edges) Select parameter values that capture the tuning properties of V1 simple cells, as reported in the neuroscience literature
27
Learning V1 Simple Cells via Sparse Coding Olshausen & Field, 1996 Hypotheses: – Any natural image I(x,y) can be represented by a linear superposition of (not necessarily orthogonal) basis functions, ϕ (x,y): –The ϕ i span the image space (i.e., any image can be reconstructed with appropriate choice of a i ) –The a i are as statistically independent as possible –Any natural image has a representation that is sparse (i.e., can be represented by a small number of non-zero a i )
28
Test of hypothesis: Use these criteria to learn a set of ϕ i from a database of natural images. Use gradient descent, with the following cost function: cost of incorrect reconstructioncost of non-sparseness (using too many a i ), where S is a nonlinear function and σ i is a scaling constant
29
Training set: a set of 12x12 image patches from natural images. Start with large random set (144) of ϕ i For each patch, – Find set of a i to minimize E with respect to a i – With these a i, Update ϕ i using this learning rule: where
30
From http://redwood.berkeley.edu/bruno/papers/current-opinion.pdf These resemble receptive fields of V1 simple cells: they are (1) localized, (2) orientation-specific, (3) frequency and scale- specific
31
S1 units: Gabor filters (one per pixel) 16 scales / frequencies, 4 orientations ( 0, 45, 90, 135). Units form a pyramid of scales, from 7x7 to 37x37 pixels in steps of two pixels. Response of an S1 unit is absolute value of filter response.
32
C1 unit: Maximum value of group of S1 units, pooled over slightly different positions and scales 8 scales / frequencies, 4 orientations
33
From S. Bileschi, Ph.D. Thesis The S1 and C1 model parameters are meant to match empirical neuroscience data.
34
S2 layer Recall that each S1 unit responds to an oriented edge at a particular scale Each S2 unit responds to a particular group of oriented edges at various scales, i.e., a shape S1 units were chosen to cover a “full” range of scales and orientations How can we choose S2 units to cover a “full” range of shapes?
35
HMAX’s answer: Choose S2 shapes by randomly sampling patches from “training images”
36
HMAX’s answer: Choose S2 shapes by randomly sampling patches from “training images” Extract C1 features in each selected patch. This gives a p×p×4 array, for 4 orientations.
37
S2 prototype P i, with 4 orientations
38
Scale 2 Scale 3 Scale 5 Scale 6 Scale 7 Scale 8 Scale 1 Scale 4
39
S2 prototype P i, with 4 orientations Input image to classify: C1 layer: 4 orientations, 8 scales Calculate similarity between P i and patches in input image, independently at each position and each scale. Scale 2 Scale 3 Scale 5 Scale 6 Scale 7 Scale 8 Scale 1 Scale 4
40
S2 prototype P i, with 4 orientations Input image to classify: C1 layer: 4 orientations, 8 scales Scale 1Scale 4 Scale 2 Scale 3 Scale 5 Scale 6 Scale 7 Scale 8 Calculate similarity between P i and patches in input image, independently at each position and each scale. Similarity (radial basis function):
41
S2 prototype P i, with 4 orientations Input image to classify: C1 layer: 4 orientations, 8 scales Scale 1Scale 4 Scale 2 Scale 3 Scale 5 Scale 6 Scale 7 Scale 8 Calculate similarity between P i and patches in input image, independently at each position and each scale. Similarity (radial basis function): Result: At each position in C1 layer of input image, we have an array of 4x8 values. Each value represents the “degree” to which shape P i is present at the given position at the given orientation and scale.
42
S2 prototype P i, with 4 orientations Input image to classify: C1 layer: 4 orientations, 8 scales Scale 1Scale 4 Scale 2 Scale 3 Scale 5 Scale 6 Scale 7 Scale 8 Calculate similarity between P i and patches in input image, independently at each position and each scale. Similarity (radial basis function): Result: At each position in C1 layer of input image, we have an array of 4x8 values. Each value represents the “degree” to which shape P i is present at the given position at the given scale. Now, repeat this process for each P i, to get N such arrays.
43
C2 unit: For each P i, calculate maximum value over all positions, orientations, and scales. Result is N values, corresponding to the N prototypes.
44
Support Vector Machine Feature vector representing image classification SVM classification: To classify input image (e.g., “face” or “not face”), give C2 values to a trained support vector machine (SVM).
45
Adaboost Classifier Feature vector representing image classification Boosting classification: To classify input image (e.g., “face” or “not face”), give C2 values to a trained classifier trained by Adaboost.
46
Visual tasks: (1) Part-based object detection: Detect different types of “part-based” objects, either alone or in “clutter”. Data sets contain images that either contain or do not contain a single instance of the target object. Task is to decide whether the target object is present or absent. (2) Texture-based object recognition: Recognize different types of “texture-based” objects (e.g., grass, trees, buildings, roads). Task is to classify each pixel with an object label. (3) Whole-Scene Recognition: Recognize all objects (“part-based” and “texture- based”) in a scene.
47
Databases Caltech 5 (Five object categories: leaves, cars, faces, airplanes, motocycles) Caltech 101 (101 object categories) MIT Streetscenes MIT car and face databases
49
Sample images from the MIT Streetscenes database
50
Sample images from the Caltech 101 dataset
51
From Serre et al., Object recognition with features inspired by visual cortex Sample of results for part-based objects Results Accuracy at the equilibrium point (i.e., at the point such that the false positive rate equals the false negative rate).
52
How to do multiclass classification with SVMs?
53
Two main methods: – One versus All – One versus One (“all pairs”)
54
One Versus All For N categories: – Train N SVMs: “Category 1” versus “Not Category 1” “Category 2” versus “Not Category 2” etc. Run new example through all of them. Prediction is the category with the highest decision score.
55
One Versus One (All pairs) Train N * (N – 1)/2 SVMs: “Category 1” versus “Category 2” “Category 1” versus “Category3”... “Category 2” versus “Category 3” etc. To predict category for a new instance, run SVMs in a “decision tree”: Category 1 vs Category 2 Category 1 vs Category 3 Category 2 vs Category 3 Category 1Category 2
56
Whole Scene Interpretation Streetscenes project (Bileschi, 2006)
57
Result: real-valued detection strength for each possible category at each possible location for each possible scale. Object is considered “present” at a position and scale if its detection strength is above a threshold (thresholds are determined empirically). “Local neighborhood suppression” is used to remove redundant detections. A dense sampling of square windows of all possible scales and at all possible positions is cropped from test image, converted to C1 and C2 features, and passed through each possible SVM (or Boosting) classifier (one per category). Experiment 1: Use C1 features Experiment 2: Use C2 features
58
Sample results (here, detecting “car”) (Bileschi, 2006)
59
Standard Model (C1) = used HoG = Histogram of gradients (Triggs) Standard Model (C2) = used 444 S2 prototypes of 4 different sizes Part-Based = part-based model of Liebe et al Grayscale = Use raw grayscale values (normalized in size and histogram equalized) Local patch correlation = similar to system of Torralba Results on crop data: Cars
61
Improvements to HMAX Mutch and Lowe, 2006 “Sparsify” features at different layers Localize C2 features Do feature selection for SVMs
62
“Sparsify” S2 Inputs: Use only dominant C1 orientation at each position
63
Localize C2 features – Assumption: “The system is “attending” close to the center of the object. This is appropriate for datasets such as the Caltech 101, in which most objects of interest are central and dominant.” – “For more general detection of objects within complex scenes, we augment it with a search for peak responses over object location using a sliding window.”
64
Select C2 features that are highly weighted by the SVM – All-pairs SVM consists of m(m-1)/2 binary SVMs. – Each represents a separating hyperplane in d dimensions. – The d components of the (unit length) normal vector to this hyperplane can be interpreted as feature weights; the higher the kth component (in absolute value), the more important feature k is in separating the two classes. – Drop features with low weight, with weight averaged over all binary SVMs. – Multi-round tournament: in each round, the SVM is trained, then at most half the features are dropped.
65
Number of SVM features – optimized over all categories
66
Results Mutch and Lowe, 2006
67
Comparative results on Caltech 101 dataset
68
From Mutch and Lowe, 2006
69
Future improvements? Could try to improve SVM by using more complex kernel functions (people have done this). “We lean towards future enhancements that are biologically realistic. We would like to be able to transform images into a feature space in which a simple classifier is good enough.” “Even our existing classifier is not entirely plausible, as an all-pairs model does not scale well as the number of classes increases.”
70
How to learn good “prototype” (S2) features?
71
Good Features for Object Recognition Need the right number of discriminative features. Too few features: classifier cannot distinguish categories. Too many features: classifier overfits the training data. Irrelevant features: increases error and compute time. 71 From Mick Thomure, Ph.D. Defense, PSU, 2013
72
Prototype Learning Given example images, learn a small set of prototypes that maximizes performance of the classifier. 72 From Mick Thomure, Ph.D. Defense, PSU, 2013
73
This is very common. Zhu & Zhang (2008) Faez et al. (2008) Gu et al. (2009) Huang et al. (2008) Wu et al. (2007) Lian & Li (2008) Moreno et al. (2007) Wijnhoven & With (2009) Serre et al. (2007) Mutch & Lowe (2008) Learning Prototypes by Imprinting Record, or imprint, arbitrary patches of training images. 73 From Mick Thomure, Ph.D. Defense, PSU, 2013
74
Shape Hypothesis (Serre et al., 2007): invariant representations with imprinted shape prototypes are key to the model’s success. This is assumed in most of the literature, but has yet to be tested! 74 From Mick Thomure, Ph.D. Defense, PSU, 2013
75
Question: Can hierarchical visual models can be improved by learning prototypes. 1.Is the shape hypothesis correct? Compare imprinted and “shape-free” prototypes. 2.Do more sophisticated learning methods do better than imprinting? Conduct a study of different prototype learning methods. 75 From Mick Thomure, Ph.D. Defense, PSU, 2013
76
Glimpse Hierarchical visual model implementation Fast, parallel, open-source Object recognition performance is similar to existing models. Reusable framework can express wide range of models. https://pythonhosted.org/glimpse/ 76 From Mick Thomure, Ph.D. Defense, PSU, 2013
77
Are invariant representations with imprinted shape prototypes key to the model’s success? 77 From Mick Thomure, Ph.D. Defense, PSU, 2013
78
Compare imprinted and “shape- free” prototypes, which are generated randomly. Imprinted Prototype Imprinted Region Shape-Free Prototype Edges 78 From Mick Thomure, Ph.D. Defense, PSU, 2013
79
Datasets: Synthetic tasks of Pinto et al. (2011) Tests viewpoint invariant object recognition Addresses shortcomings in existing benchmark datasets Tunable difficulty Difficult for current computer vision systems 79 From Mick Thomure, Ph.D. Defense, PSU, 2013
80
Task: Face Discrimination Face1Face2 80 From Mick Thomure, Ph.D. Defense, PSU, 2013
81
Task: Category Recognition CarsAirplanes 81 From Mick Thomure, Ph.D. Defense, PSU, 2013
82
Variation Levels Difficulty Increasing Change in Location, Scale, and Rotation 82 From Mick Thomure, Ph.D. Defense, PSU, 2013
83
Results: Face1 v. Face2 Error Bar: One standard error Performance: Mean accuracy over five independent trials Using 4075 prototypes. 83 From Mick Thomure, Ph.D. Defense, PSU, 2013
84
Results: Face1 v. Face2 Error Bar: One standard error Performance: Mean accuracy over five independent trials Using 4075 prototypes. 84 From Mick Thomure, Ph.D. Defense, PSU, 2013
85
Results: Cars v. Planes Error Bar: One standard error Performance: Mean accuracy over five independent trials Using 4075 prototypes. 85 From Mick Thomure, Ph.D. Defense, PSU, 2013
86
Summary High performance on problems of invariant object recognition is possible with unlearned, “shape-free” features. Why do random prototypes work? Still an open question 86 From Mick Thomure, Ph.D. Defense, PSU, 2013
87
Do more sophisticated learning methods do better than imprinting? Many methods could be applied: Feature Selection (Mutch & Lowe, 2008) k-means (Louie, 2003) Hebbian Learning (Brumby et al., 2009) STDP (Masquelier & Thorpe, 2007) PCA, ICA, Sparse Coding 87 From Mick Thomure, Ph.D. Defense, PSU, 2013
88
Feature Selection 1.Imprint prototypes 2.Compute features 3.Evaluate features 4.Measure performance 88 PrototypesTraining Images From Mick Thomure, Ph.D. Defense, PSU, 2013
89
Feature Selection 1.Imprint prototypes 2.Compute features 3.Evaluate features 4.Measure performance 89 Input S1 C1 S2 C2 From Mick Thomure, Ph.D. Defense, PSU, 2013
90
Feature Selection 1.Imprint prototypes 2.Compute features 3.Evaluate features 4.Measure performance 90 Activation Category Boundary From Mick Thomure, Ph.D. Defense, PSU, 2013
91
Feature Selection 1.Imprint prototypes 2.Compute features 3.Evaluate features 4.Measure performance 91 Select: most discriminative prototypes. Compute: Glimpse’s performance on test images using only most discriminative prototypes. From Mick Thomure, Ph.D. Defense, PSU, 2013
92
Results: Face1 v. Face2 Error Bar: One standard error Performance: Mean accuracy over five independent trials, except feature selection. 10,000x Compute Cost 92 From Mick Thomure, Ph.D. Defense, PSU, 2013
93
Results: Cars v. Planes Error Bar: One standard error Performance: Mean accuracy over five independent trials, except feature selection. 10,000x Compute Cost 93 From Mick Thomure, Ph.D. Defense, PSU, 2013
94
Feature Selection Advantage: Creates discriminative prototypes Drawbacks: Computationally expensive Cannot synthesize new prototypes Generally consistent with previous work. 94 From Mick Thomure, Ph.D. Defense, PSU, 2013
95
k-means 1.Choose many prototypes from training images. 2.Identify k clusters of similar prototypes. Iterative optimization process. 3.Create new prototypes by combining prototypes from each cluster. Average of prototypes. Now, use the k new prototypes in Glimpse to perform object recognition. 95 From Mick Thomure, Ph.D. Defense, PSU, 2013
96
k-means Advantages: Fast and scalable Used in similar networks (Coates & Ng, 2012) Related to sparse coding Drawback: Prototypes may not be discriminative. 96 From Mick Thomure, Ph.D. Defense, PSU, 2013
97
Performance was no better than imprinting, and sometimes worse. Reason is unclear. Consistent with previous work? 97 From Mick Thomure, Ph.D. Defense, PSU, 2013
98
Can k-means be improved? “No-Child” “Child” Methods: Investigate different weightings Results: Found only marginal increase in performance over k-means 98 From Mick Thomure, Ph.D. Defense, PSU, 2013
99
Summary Feature Selection: very helpful, but expensive k-means: fast, no improvement over imprinting Extended k-means: results similar to k-means 99 From Mick Thomure, Ph.D. Defense, PSU, 2013
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.