Learning to make specific predictions using Slow Feature Analysis.

Slides:



Advertisements
Similar presentations
Machine learning continued Image source:
Advertisements

(Includes references to Brian Clipp
Uncertainty Representation. Gaussian Distribution variance Standard deviation.
The loss function, the normal equation,
Dimensionality R e d u c t i o n. Another unsupervised task Clustering, etc. -- all forms of data modeling Trying to identify statistically supportable.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
Motion Analysis (contd.) Slides are from RPI Registration Class.
Dimensionality Reduction
Spike-triggering stimulus features stimulus X(t) multidimensional decision function spike output Y(t) x1x1 x2x2 x3x3 f1f1 f2f2 f3f3 Functional models of.
Face Recognition Jeremy Wyatt.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Unsupervised Learning
Matlab Tutorial Continued Files, functions and images.
A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Tracking with Linear Dynamic Models. Introduction Tracking is the problem of generating an inference about the motion of an object given a sequence of.
Algorithm Evaluation and Error Analysis class 7 Multiple View Geometry Comp Marc Pollefeys.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
CSci 6971: Image Registration Lecture 16: View-Based Registration March 16, 2004 Prof. Chuck Stewart, RPI Dr. Luis Ibanez, Kitware Prof. Chuck Stewart,
Radial Basis Function Networks
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Linear Regression Inference
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
This week: overview on pattern recognition (related to machine learning)
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
2 2  Background  Vision in Human Brain  Efficient Coding Theory  Motivation  Natural Pictures  Methodology  Statistical Characteristics  Models.
Lecture 23 The Spherical Bicycle II 1 The story so far: We found the constraint matrix and the null space matrix We “cheated” a little bit in assessing.
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
…. 2 Ongoing software project, not “theory” Encapsulated internals & interfaces Today: –Details of module internals –Details of architecture & signaling/feedback.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
An Introduction to Kalman Filtering by Arthur Pece
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Dimensionality reduction
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Geology 5670/6670 Inverse Theory 28 Jan 2015 © A.R. Lowry 2015 Read for Fri 30 Jan: Menke Ch 4 (69-88) Last time: Ordinary Least Squares: Uncertainty The.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Filters– Chapter 6. Filter Difference between a Filter and a Point Operation is that a Filter utilizes a neighborhood of pixels from the input image to.
Unsupervised Learning II Feature Extraction
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
11/25/03 3D Model Acquisition by Tracking 2D Wireframes Presenter: Jing Han Shiau M. Brown, T. Drummond and R. Cipolla Department of Engineering University.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Today’s Lecture Neural networks Training
Deep Feedforward Networks
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
CH 5: Multivariate Methods
Principal Component Analysis
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 09: DISCRIMINANT ANALYSIS
Feature Selection Methods
Presentation transcript:

Learning to make specific predictions using Slow Feature Analysis

Slow: temporally invariant abstractions Fast: quickly changing input Memory/prediction hierarchy with temporal invariances But… how does each module work: learn, map, and predict?

My (old) module: 1.Quantize high-dim input space 2.Map to low-dim output space 3.Discover temporal sequences in input space 4.Map sequences to low-dim sequence language 5.Feedback = same map run backwards Problems: Sequence-mapping (step #4) depends on several previous steps  brittle, not robust Sequence-mapping not well-defined statistically

Pro’s of SFA: Nearly guaranteed to find some slow features No quantization Defined over entire input space Hierarchical “stacking” is easy Statistically robust building blocks (simple polynomials, Principal Components Analysis, variance reduction, etc)  a great way to find invariant functions  invariants change slowly, hence easily predictable New module design: Slow Feature Analysis (SFA)

BUT… ….No feedback! Can’t get specific output from invariant input It’s hard to take a low-dim signal and turn it into the right high-dim one (underdetermined) Here’s my solution (straightforward, probably done before somewhere): Do feedback with separate map

First, show it working… … then, show how & why Input space: 20-dim “retina” Input shapes: Gaussian blurs (wrapped) of 3 different widths Input sequences: constant-velocity motion (0.3 pixels/step) T = 0 … T=2 … T=4 T = 23 … T=25 … T=27 Pixel 21 = pixel 1

Sanity-check: slow features extracted match generating parameters: “What” “Where” Gaussian std dev. Gaussian center pos’n Slow feature #1 Slow feature #2 (… so far, this is plain vanilla SFA, nothing new…)

T = 0 … T=2 … T=4 T=5  New contribution: Predict all pixels of next image, given previous images… ? ? ? ? ? ? ? ? ? ? Reference prediction is to use previous image (“tomorrow’s weather is just like today’s”) T=4 T=5 

Plot ratio: Median ratio over all points = 0.06 (including discontinuities) …over high-confidence points = 0.03 (toss worst 20%) Reference prediction (mean-squared prediction error ) (mean-squared reference error)

Take-home messages: –SFA can be inverted –SFA can be used to make specific predictions –The prediction works very well –The prediction can be further improved by using confidence estimates So why is it hard, and how is it done?....

Why it’s hard: High-dim: x1 x2 x3 ……………………………………………..…………………..x20 Low-dim slow features: S1 = 0.3 x x x2 x x4 2 +… x5 x9 + … But given S1 = 1.4 S2 = x1= ? x2=? x3=? x4=? x5=? x6=?. x20=? Infinitely many possibilities of x’s Vastly under-determined No simple polynomial-inverse formula (e.g. “quadratic formula”) easy HARD

Very simple, graphable example: (x1, x2) 2-dim  S1 1-dim x1(t), x2(t) approx circular motion in plane S1(t) = x1 2 + x2 2 nearly constant, i.e. slow Illustrate a series of six clue/trick pairs for learning specific-prediction mapping

Clue #1: The actual input data is a small subset of all possible input data (i.e. on a “manifold”) Trick #1: Find a set of points which represent where the actual input data is ≠  “anchor points” A i (Found using k-means, k-medoids, etc. This is quantization, but only for feedback) actual possible

Clue #2: The actual input data is not distributed evenly about those anchor-points Trick #2: Calculate covariance matrix C i of data around A i yes no  data Eigenvectors of Ci

Clue #3: S(x) is locally linear about each anchor point Trick #3: Construct linear (affine) Taylor-series mappings SL i approximating S(x) about each A i (NB: this doesn’t require polynomial SFA, just differentiable) 

Clue #4: Covariance eigenvectors tell us about the local data manifold Trick #4: 1.Get SVD pseudo-inverse  X = SL i -1 (S new – S(A i )) 2.Then stretch  X onto manifold by multiplying by chopped* C i Good news: Linear SL i can be pseudo-inverted (SVD) Bad news: We don’t want any old (x1,x2), we want (x1,x2) on the data manifold S new S(A i ) SS XX XX …stretch… Stretched  X * Projection matrix, keeping only as many eigenvectors as dimensions of S

Good news: Given A i and C i, we can invert S new  X new Bad news: How do we choose which A i and SL i -1 to use? ? ? ? These three all have the same value of S new

Clue #5: a) We need an anchor A i such that S(A i ) is close to S new Trick #5: Choose anchor A i such that –A i is “close to” the hint AND –S(A i ) is close to S new S new S(A i ) Close candidates b) Need a “hint” of which anchors are close in X-space Hint region

All tricks together: Map local linear inverse about each anchor point Anchors + S(A i ) neighbors x

Clue #6: The local data scatter can decide if a given point is probable (“on the manifold”) or not Trick #6: Use Gaussian hyper-ellipsoid probabilities about closest A i (this can tell if a prediction makes sense or not) probable improbable probable improbable

Estimated uncertainty increases away from anchor points -log(P)

Summary of SFA inverse/prediction method: We have X(t-2), X(t-1), X(t)… we want X(t+1) 1.Calculate slow features S(t-2), S(t-1), S(t) S t 2. Extrapolate that trend linearly to S new ( NB: S varies slowly/smoothly in time ) S t 3. Find candidate S(A i )’s close to S new S new all S(A i ) e.g. candidate i = {1, 16, 3, 7}

Summary cont’d 4. Take X(t) as “hint,” and find candidate A i ’s close to it 5. Find “best” candidate A i, whose index is high on both candidate lists: e.g. candidate i = {8, 3, 5, 17} S(A i )’s close to S new A i close to X(t) ii

6. Use chosen A i and pseudo-inverse (i.e. SL i -1 (S new – S(A i ) ) with SVD) to get  X S(A i ) XX XX …stretch… Stretched  X 7. Stretch  X onto low-dim manifold using chopped C i 8. Add stretched  X back onto A i to get final prediction Stretched  X AiAi

9. Use covariance hyper-ellipsoids to estimate confidence in this prediction probable improbable This method uses virtually everything we know about the data; any improvements presumably would need further clues… –Discrete sub-manifolds –Discrete sequence steps –Better nonlinear mappings

Next steps Online learning –Adjust anchor points and covariance as new data arrive –Use weighted k-medoid clusters to mix in old with new data Hierarchy –Set output of one layer as input to next –Enforce ever-slower features up the hierarchy Test with more complex stimuli and natural movies Let feedback from above modify slow feature polynomials Find slow features in the unpredicted input (input – prediction)