Machine Learning & Data Mining CS/CNS/EE 155 Lecture 14: Embeddings 1Lecture 14: Embeddings.

Slides:

Advertisements

Similar presentations

Part 2: Unsupervised Learning

Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

1 Welcome to the Kernel-Class My name: Max (Welling) Book: There will be class-notes/slides. Homework: reading material, some exercises, some MATLAB implementations.

Computer vision: models, learning and inference Chapter 13 Image preprocessing and feature extraction.

Dimensionality Reduction PCA -- SVD

Supervised Learning Recap

Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.

10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.

Visual Recognition Tutorial

Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.

1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Principal Component Analysis

Dimensional reduction, PCA

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Unsupervised Learning

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Continuous Latent Variables --Bishop

CS4670: Computer Vision Kavita Bala Lecture 7: Harris Corner Detection.

DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD

Probabilistic methods for phylogenetic trees (Part 2)

1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)

The Multivariate Normal Distribution, Part 1 BMTRY 726 1/10/2014.

Diffusion Maps and Spectral Clustering

Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.

Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University

Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)

Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.

Camera Geometry and Calibration Thanks to Martial Hebert.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)

GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.

EigenRank: A ranking oriented approach to collaborative filtering By Nathan N. Liu and Qiang Yang Presented by Zachary 1.

CS Statistical Machine learning Lecture 24

EE4-62 MLCV Lecture Face Recognition – Subspace/Manifold Learning Tae-Kyun Kim 1 EE4-62 MLCV.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

Machine learning optimization Usman Roshan. Machine learning Two components: – Modeling – Optimization Modeling – Generative: we assume a probabilistic.

CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.

Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

CSC321: Extra Lecture (not on the exam) Non-linear dimensionality reduction Geoffrey Hinton.

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

Introduction to Machine Learning Nir Ailon Lecture 12: EM, Clustering and More.

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

High Dimensional Probabilistic Modelling through Manifolds

Data Transformation: Normalization

Neil Lawrence Machine Learning Group Department of Computer Science

Multimodal Learning with Deep Boltzmann Machines

Machine Learning Basics

Latent Variables, Mixture Models and EM

Roberto Battiti, Mauro Brunato

Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE

Step-By-Step Instructions for Miniproject 2

Probabilistic Models with Latent Variables

Feature space tansformation methods

Using Manifold Structure for Partially Labeled Classification

Recap: Naïve Bayes classifier

Goodfellow: Chapter 14 Autoencoders

Presentation transcript:

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 14: Embeddings 1Lecture 14: Embeddings

Announcements Kaggle Miniproject is closed – Report due Thursday Public Leaderboard – How well you think you did Private Leaderboard now viewable – How well you actually did Lecture 14: Embeddings2

3

Last Week Dimensionality Reduction Clustering Latent Factor Models – Learn low-dimensional representation of data Lecture 14: Embeddings4

This Lecture Embeddings – Alternative form of dimensionality reduction Locally Linear Embeddings Markov Embeddings Lecture 14: Embeddings5

Embedding Learn a representation U – Each column u corresponds to data point Semantics encoded via d(u,u’) – Distance between points Lecture 14: Embeddings6

Locally Linear Embedding Given: Learn U such that local linearity is preserved – Lower dimensional than x – “Manifold Learning” Lecture 14: Embeddings7 Unsupervised Learning Any neighborhood looks like a linear plane x’su’s

Locally Linear Embedding Create B(i) – B nearest neighbors of x i – Assumption: B(i) is approximately linear – x i can be written as a convex combination of x j in B(i) Lecture 14: Embeddings8 xixi B(i)

Locally Linear Embedding Lecture 14: Embeddings9 Given Neighbors B(i), solve local linear approximation W:

Locally Linear Embedding Every x i is approximated as a convex combination of neighbors – How to solve? Lecture 14: Embeddings10 Given Neighbors B(i), solve local linear approximation W:

Lagrange Multipliers Solutions tend to be at corners! 11

Solving Locally Linear Approximation Lecture 14: Embeddings12 Lagrangian:

Locally Linear Approximation Invariant to: – Rotation – Scaling – Translation Lecture 14: Embeddings13

Story So Far: Locally Linear Embeddings Locally Linear Approximation Lecture 14: Embeddings14 Given Neighbors B(i), solve local linear approximation W: Solution via Lagrange Multipliers:

Recall: Locally Linear Embedding Given: Learn U such that local linearity is preserved – Lower dimensional than x – “Manifold Learning” Lecture 14: Embeddings15 x’su’s

Dimensionality Reduction Find low dimensional U – Preserves approximate local linearity Lecture 14: Embeddings16 Given local approximation W, learn lower dimensional representation: x’su’s Neighborhood represented by W i,*

Rewrite as: Lecture 14: Embeddings17 Symmetric positive semidefinite Given local approximation W, learn lower dimensional representation:

Suppose K=1 By min-max theorem – u = principal eigenvector of M + Lecture 14: Embeddings18 pseudoinverse Given local approximation W, learn lower dimensional representation:

Recap: Principal Component Analysis Each column of V is an Eigenvector Each λ is an Eigenvalue (λ 1 ≥ λ 2 ≥ …) Lecture 14: Embeddings19

K=1: – u = principal eigenvector of M + – u = smallest non-trivial eigenvector of M Corresponds to smallest non-zero eigenvalue General K – U = top K principal eigenvectors of M + – U = bottom K non-trivial eigenvectors of M Corresponds to bottom K non-zero eigenvalues Lecture 14: Embeddings Given local approximation W, learn lower dimensional representation:

Recap: Locally Linear Embedding Generate nearest neighbors of each x i, B(i) Compute Local Linear Approximation: Compute low dimensional embedding Lecture 14: Embeddings21

Results for Different Neighborhoods Lecture 14: Embeddings22 B=3 B=6B=9B=12 True Distribution 2000 Samples

Embeddings vs Latent Factor Models Both define low-dimensional representation Embeddings preserve distance: Latent Factor preserve inner product: Relationship: Lecture 14: Embeddings23

Visualization Semantics Lecture 14: Embeddings24 Latent Factor Model Similarity measured via dot product Rotational semantics Can interpret axes Can only visualize 2 axes at a time Embedding Similarity measured via distance Clustering/locality semantics Cannot interpret axes Can visualize many clusters simultaneously

Latent Markov Embeddings Lecture 14: Embeddings25

Latent Markov Embeddings Locally Linear Embedding is conventional unsupervised learning – Given raw features x i – I.e., find low-dimensional U that preserves approximate local linearity Latent Markov Embedding is a feature learning problem – E.g., learn low-dimensional U that captures user-generated feedback Lecture 14: Embeddings26

Playlist Embedding Users generate song playlists – Treat as training data Can we learn a probabilistic model of playlists? Lecture 14: Embeddings27

Probabilistic Markov Modeling Training set: Goal: Learn a probabilistic Markov model of playlists: What is the form of P? Lecture 14: Embeddings28 SongsPlaylistsPlaylist Definition

First Try: Probability Tables Lecture 14: Embeddings29 P(s|s’ ) s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s start s1s s2s s3s s4s s5s s6s s7s … … #Parameters = O(|S| 2 ) !!!

Second Try: Hidden Markov Models #Parameters = O(K 2 ) #Parameters = O(|S|K) Total = O(K 2 ) + O(|S|K) Lecture 14: Embeddings30

Problem with Hidden Markov Models Need to reliably estimate P(s|z) Lots of “missing values” in this training set Lecture 14: Embeddings31

Latent Markov Embedding “Log-Radial” function – (my own terminology) Lecture 14: Embeddings32 u s : entry point of song s v s : exit point of song s

Log-Radial Functions Lecture 14: Embeddings33 v s’ Each ring defines an equivalence class of transition probabilities usus u s” 2K parameters per song 2|S|K parameters total

Learning Problem Learning Goal: Lecture 14: Embeddings34 SongsPlaylistsPlaylist Definition

Minimize Neg Log Likelihood Solve using gradient descent – Homework question: derive the gradient formula – Random initialization Normalization constant hard to compute: – Approximation heuristics See paper Lecture 14: Embeddings35

Simpler Version Dual point model: Single point model: – Transitions are symmetric (almost) – Exact same form of training problem Lecture 14: Embeddings36

Visualization in 2D Lecture 14: Embeddings37 Simpler version: Single Point Model Single point model is easier to visualize

Sampling New Playlists Given partial playlist: Generate next song for playlist p j+1 – Sample according to: Lecture 14: Embeddings38 Dual Point ModelSingle Point Model

Demo Lecture 14: Embeddings39

What About New Songs? Suppose we’ve trained U: What if we add a new song s’? – No playlists created by users yet… – Only options: u s’ = 0 or u s’ = random Both are terrible! Lecture 14: Embeddings40

Song & Tag Embedding Songs are usually added with tags – E.g., indie rock, country – Treat as features or attributes of songs How to leverage tags to generate a reasonable embedding of new songs? – Learn an embedding of tags as well! Lecture 14: Embeddings41

Lecture 14: Embeddings42 SongsPlaylistsPlaylist Definition Tags for Each Song Learning Objective: Same term as before: Song embedding ≈ average of tag embeddings: Solve using gradient descent:

Visualization in 2D Lecture 14: Embeddings43

Revisited: What About New Songs? No user has yet s’ added to playlist – So no evidence from playlist training data: Assume new song has been tagged T s’ – The u s’ = average of A t for tags t in T s’ – Implication from objective: Lecture 14: Embeddings44 s’ does not appear in

Recap: Embeddings Learn a low-dimensional representation of items U Capture semantics using distance between items u, u’ Can be easier to visualize than latent factor models Lecture 14: Embeddings45

Next Lecture Recent Applications of Latent Factor Models Low-rank Spatial Model for Basketball Play Prediction Low-rank Tensor Model for Collaborative Clustering Miniproject 1 report due Thursday. Lecture 14: Embeddings46