Machine Learning & Data Mining CS/CNS/EE 155 Lecture 14: Embeddings 1Lecture 14: Embeddings
Announcements Kaggle Miniproject is closed – Report due Thursday Public Leaderboard – How well you think you did Private Leaderboard now viewable – How well you actually did Lecture 14: Embeddings2
3
Last Week Dimensionality Reduction Clustering Latent Factor Models – Learn low-dimensional representation of data Lecture 14: Embeddings4
This Lecture Embeddings – Alternative form of dimensionality reduction Locally Linear Embeddings Markov Embeddings Lecture 14: Embeddings5
Embedding Learn a representation U – Each column u corresponds to data point Semantics encoded via d(u,u’) – Distance between points Lecture 14: Embeddings6
Locally Linear Embedding Given: Learn U such that local linearity is preserved – Lower dimensional than x – “Manifold Learning” Lecture 14: Embeddings7 Unsupervised Learning Any neighborhood looks like a linear plane x’su’s
Locally Linear Embedding Create B(i) – B nearest neighbors of x i – Assumption: B(i) is approximately linear – x i can be written as a convex combination of x j in B(i) Lecture 14: Embeddings8 xixi B(i)
Locally Linear Embedding Lecture 14: Embeddings9 Given Neighbors B(i), solve local linear approximation W:
Locally Linear Embedding Every x i is approximated as a convex combination of neighbors – How to solve? Lecture 14: Embeddings10 Given Neighbors B(i), solve local linear approximation W:
Lagrange Multipliers Solutions tend to be at corners! 11
Solving Locally Linear Approximation Lecture 14: Embeddings12 Lagrangian:
Locally Linear Approximation Invariant to: – Rotation – Scaling – Translation Lecture 14: Embeddings13
Story So Far: Locally Linear Embeddings Locally Linear Approximation Lecture 14: Embeddings14 Given Neighbors B(i), solve local linear approximation W: Solution via Lagrange Multipliers:
Recall: Locally Linear Embedding Given: Learn U such that local linearity is preserved – Lower dimensional than x – “Manifold Learning” Lecture 14: Embeddings15 x’su’s
Dimensionality Reduction Find low dimensional U – Preserves approximate local linearity Lecture 14: Embeddings16 Given local approximation W, learn lower dimensional representation: x’su’s Neighborhood represented by W i,*
Rewrite as: Lecture 14: Embeddings17 Symmetric positive semidefinite Given local approximation W, learn lower dimensional representation:
Suppose K=1 By min-max theorem – u = principal eigenvector of M + Lecture 14: Embeddings18 pseudoinverse Given local approximation W, learn lower dimensional representation:
Recap: Principal Component Analysis Each column of V is an Eigenvector Each λ is an Eigenvalue (λ 1 ≥ λ 2 ≥ …) Lecture 14: Embeddings19
K=1: – u = principal eigenvector of M + – u = smallest non-trivial eigenvector of M Corresponds to smallest non-zero eigenvalue General K – U = top K principal eigenvectors of M + – U = bottom K non-trivial eigenvectors of M Corresponds to bottom K non-zero eigenvalues Lecture 14: Embeddings Given local approximation W, learn lower dimensional representation:
Recap: Locally Linear Embedding Generate nearest neighbors of each x i, B(i) Compute Local Linear Approximation: Compute low dimensional embedding Lecture 14: Embeddings21
Results for Different Neighborhoods Lecture 14: Embeddings22 B=3 B=6B=9B=12 True Distribution 2000 Samples
Embeddings vs Latent Factor Models Both define low-dimensional representation Embeddings preserve distance: Latent Factor preserve inner product: Relationship: Lecture 14: Embeddings23
Visualization Semantics Lecture 14: Embeddings24 Latent Factor Model Similarity measured via dot product Rotational semantics Can interpret axes Can only visualize 2 axes at a time Embedding Similarity measured via distance Clustering/locality semantics Cannot interpret axes Can visualize many clusters simultaneously
Latent Markov Embeddings Lecture 14: Embeddings25
Latent Markov Embeddings Locally Linear Embedding is conventional unsupervised learning – Given raw features x i – I.e., find low-dimensional U that preserves approximate local linearity Latent Markov Embedding is a feature learning problem – E.g., learn low-dimensional U that captures user-generated feedback Lecture 14: Embeddings26
Playlist Embedding Users generate song playlists – Treat as training data Can we learn a probabilistic model of playlists? Lecture 14: Embeddings27
Probabilistic Markov Modeling Training set: Goal: Learn a probabilistic Markov model of playlists: What is the form of P? Lecture 14: Embeddings28 SongsPlaylistsPlaylist Definition
First Try: Probability Tables Lecture 14: Embeddings29 P(s|s’ ) s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s start s1s s2s s3s s4s s5s s6s s7s … … #Parameters = O(|S| 2 ) !!!
Second Try: Hidden Markov Models #Parameters = O(K 2 ) #Parameters = O(|S|K) Total = O(K 2 ) + O(|S|K) Lecture 14: Embeddings30
Problem with Hidden Markov Models Need to reliably estimate P(s|z) Lots of “missing values” in this training set Lecture 14: Embeddings31
Latent Markov Embedding “Log-Radial” function – (my own terminology) Lecture 14: Embeddings32 u s : entry point of song s v s : exit point of song s
Log-Radial Functions Lecture 14: Embeddings33 v s’ Each ring defines an equivalence class of transition probabilities usus u s” 2K parameters per song 2|S|K parameters total
Learning Problem Learning Goal: Lecture 14: Embeddings34 SongsPlaylistsPlaylist Definition
Minimize Neg Log Likelihood Solve using gradient descent – Homework question: derive the gradient formula – Random initialization Normalization constant hard to compute: – Approximation heuristics See paper Lecture 14: Embeddings35
Simpler Version Dual point model: Single point model: – Transitions are symmetric (almost) – Exact same form of training problem Lecture 14: Embeddings36
Visualization in 2D Lecture 14: Embeddings37 Simpler version: Single Point Model Single point model is easier to visualize
Sampling New Playlists Given partial playlist: Generate next song for playlist p j+1 – Sample according to: Lecture 14: Embeddings38 Dual Point ModelSingle Point Model
Demo Lecture 14: Embeddings39
What About New Songs? Suppose we’ve trained U: What if we add a new song s’? – No playlists created by users yet… – Only options: u s’ = 0 or u s’ = random Both are terrible! Lecture 14: Embeddings40
Song & Tag Embedding Songs are usually added with tags – E.g., indie rock, country – Treat as features or attributes of songs How to leverage tags to generate a reasonable embedding of new songs? – Learn an embedding of tags as well! Lecture 14: Embeddings41
Lecture 14: Embeddings42 SongsPlaylistsPlaylist Definition Tags for Each Song Learning Objective: Same term as before: Song embedding ≈ average of tag embeddings: Solve using gradient descent:
Visualization in 2D Lecture 14: Embeddings43
Revisited: What About New Songs? No user has yet s’ added to playlist – So no evidence from playlist training data: Assume new song has been tagged T s’ – The u s’ = average of A t for tags t in T s’ – Implication from objective: Lecture 14: Embeddings44 s’ does not appear in
Recap: Embeddings Learn a low-dimensional representation of items U Capture semantics using distance between items u, u’ Can be easier to visualize than latent factor models Lecture 14: Embeddings45
Next Lecture Recent Applications of Latent Factor Models Low-rank Spatial Model for Basketball Play Prediction Low-rank Tensor Model for Collaborative Clustering Miniproject 1 report due Thursday. Lecture 14: Embeddings46