Download presentation
Presentation is loading. Please wait.
Published byGrace Pitts Modified over 8 years ago
1
Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10 (2009) 341-376 Presented by: Peng Zhang 4/15/2011
2
Outline Motivation Major Contributions Preliminaries Algorithms Discussions Experiments Conclusions
3
Motivation Low-rank matrix nearness problems –Learning low-rank positive semidefinite (kernel) matrices for machine learning applications –Divergence (distance) between data objects –Find suitable divergence measures to certain matrices Efficiency Positive semidefinite (PSD, or low rank) matrix is common in machine learning with kernel methods –Current learning techniques require positive semidefinite constraint, resulting in expensive computations Bypass such constraint, find divergences with automatic enforcement of PSD
4
Major Contributions Goal –Efficient algorithms that can find a PSD (kernel) matrix as ‘close’ as possible to some input PSD matrix under equality or inequality constraints Proposals –Use LogDet divergence/von Neumann divergence constraints in PSD matrix learning –Use Bregman projections for the divergences Computationally efficient, scaling linearly with number of data points n and quadratically with rank of input matrix Properties of the proposed algorithms –Range-space preserving property (rank of output = rank of input) –Do not decrease rank –Computationally efficient Running times are linear in number of data points and quadratic in the rank of the kernel (for one iteration)
5
Preliminaries Kernel methods –Inner products in feature space –Only information needed is kernel matrix K K is always PSD –If is low rank –Use low rank decomposition to improve computational efficiency Low rank kernel matrix learning
6
Preliminaries Bregman vector divergences Extension to Bregman matrix divergences Intuitively these can be thought of as the difference between the value of F at point x and the value of the first-order Taylor expansion of F around point y evaluated at point x.
7
Preliminaries Special Bregman matrix divergences –The von Neumann divergence (D vN ) –The LogDet divergence (D ld ) All for full rank matrices
8
Preliminaries Important properties of D vN and D ld –X is defined over positive definite matrices No explicit constrain as positive definite –Range-space preserving property –Scale-invariance of LogDet –Transformation invariance –Others Beyond transductive setting, evaluate kernel function over new data points
9
Preliminaries Spectral Bregman matrix divergence –Generating convex function Function of eigenvalues and convex function Bregman matrix divergence by eigenvalues and eigenvectors
10
Preliminaries Kernel matrix learning problem of this paper –Non-convex –Convex when using LogDet/von Neumann, because rank is implicitly enforced –Interested in constraint as squared Euclidean distance between points –A is rank one, and the problem can be: –Learn a kernel matrix over all data points from side information (labels or constraints)
11
Preliminaries Bregman projections –A method to solve the ‘no rank constraint’ version of the previous problem Choose one constraint each time Perform Bragman projection so that current solution satisfies that constraint Using LogDet and von Neumann divergences, projections can be computed efficiently Convergence guaranteed, but may require many iterations
12
Preliminaries Bregman divergences for low rank matrices –Deal with matrices with 0 eigenvalues Infinite divergences might occur because These imply a rank constraint if the divergence is finite Range … Rank …
13
Preliminaries Rank deficient LogDet and von Neumann Divergences Rank deficient Bregman projections –von Neumann: –LogDet:
14
Algorithm Using LogDet Cyclic projection algorithm using LogDet divergence –Update for each projection –Can be simplified to –Range space is unchanged, no eigen-decomposition required –(21) costs O(n^2) operations per iteration Improving update efficiency with factored n x r matrix G –This update can be done using Cholesky rank-one update –O(r^3) complexity Further improve update efficiency to O(r^2) –Combines Cholesky rank-one update with matrix multiplication
15
Algorithm Using LogDet G = LL T ; G = G 0 B; B is the product of all L matrices from every iteration and X 0 = G 0 G 0 T L can be determined implicitly
16
Algorithm Using LogDet What’re the constraints? Convergence? O(cr^2) O(nr^2) Convergence is checked by how much v has changed May require large number of iterations
17
Algorithm Using von Neumann Cyclic projection algorithm using von Neumann divergence –Update for each projection –This can be modified to –To calculate, find the unique root of the function
18
Algorithm Using von Neumann Slightly slower than Algorithm 2 O(r^2) Root finder, slows down the process
19
Discussions Limitations of Algorithm 2 and Algorithm 3 –The initial kernel matrix must be low-rank Not applicable for dimensionality reduction –Number of iterations may be large This paper only optimized the computations for each iteration Reducing the total number of iterations is future topic Handling new data points –Transductive setting All data points are up front Some of the points have labels or other supervisions When new data point is added, re-learn the entire kernel matrix –Circumvent View B as linear transformation Apply B to new points
20
Discussions Generalizations to more constraints –Slack variables When number of constraints is large, no feasible solution to Bregman divergence minimization problem Introduce slack variables Allows constraints to be violated but penalized –Similarity constraints, or –Distance constraints –O(r^2) per projection –If arbitrary linear constraints are applied, O(nr)
21
Discussions Special cases –DefiniteBoost optimization problem –Online-PCA –Nearest correlation matrix problem Minimizing LogDet divergence and semidefinite programming (SDP) –SDP relaxation of min-balanced-cut problem –Can be solved by LogDet divergence
22
Experiments Transductive learning and clustering Data sets –Digits Handwritten samples of digits 3,8 and 9 from UCI repository –GyrB Protein data set with three bacteria species –Spambase 4601 email messages with 57 attributes, spam/not spam labels –Nursery 12960 instances with 8 attributes and 5 class labels Classification –k-nearest neighbor classifier Clustering –Kernel k-means algorithm –Use normalized mutual information (NMI) measure
23
Experiments Learn a kernel matrix only using constraints –Low rank kernels learned by proposed algorithms attain accurate clustering and classification –Use original data to get initial kernel matrix –The more constraints used, the more accurate results –Convergence von Neumann divergence –Convergence was attained in 11 cycles fo 30 constraints and 105 cycles for 420 constraints LogDet divergence –Between 17 and 354 cycles
24
Simulation Results Significant improvements For DefiniteBoost, 3220 cycles to convergence 0.948 classification accuracy
25
Simulation Results LogDet needs fewer constraints LogDet converges much more slowly (Future work) But often it has fewer overall running time Rank 57Rank 8
26
Simulation Results Metric learning and large scale experiments –Learning a low-rank kernel with same range-space is equivalent to learning linear transformation of input data –Compare proposed algorithms with metric learning algorithms Metric learning by collapsing classes (MCML) Large-margin nearest neighbor metric learning (LMNN) Squared Euclidean Baseline
27
Conclusions Developed LogDet/von Neumann divergence based algorithms for low-rank matrix nearness problems Running times are linear in number of data points and quadratic in the rank of the kernel The algorithms can be used in conjunction with a number of kernel-based learning algorithms
28
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.