Sparse Solutions for Large Scale Kernel Machines Taher Dameh CMPT820-Multimedia Systems Dec 2 nd, 2010.

Slides:



Advertisements
Similar presentations
(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Advertisements

Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.
The economical eigen-computation is where the sufficient spanning set is The final updated LDA component is given by Motivation Incremental Linear Discriminant.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
ECG Signal processing (2)
Aggregating local image descriptors into compact codes
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Computer vision: models, learning and inference Chapter 8 Regression.
Face Recognition By Sunny Tang.
Similarity Search in High Dimensions via Hashing
Computer vision: models, learning and inference
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Unsupervised Feature Selection for Multi-Cluster Data Deng Cai et al, KDD 2010 Presenter: Yunchao Gong Dept. Computer Science, UNC Chapel Hill.
Pattern Recognition and Machine Learning
Support Vector Machines and Kernel Methods
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Dimensional reduction, PCA
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
1 Lecture 18 Syntactic Web Clustering CS
FACE RECOGNITION, EXPERIMENTS WITH RANDOM PROJECTION
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Summarized by Soo-Jin Kim
Presented By Wanchen Lu 2/25/2013
A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Using Support Vector Machines to Enhance the Performance of Bayesian Face Recognition IEEE Transaction on Information Forensics and Security Zhifeng Li,
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
2D-LDA: A statistical linear discriminant analysis for image matrix
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
Collaborative Deep Learning for Recommender Systems
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Machine Learning Supervised Learning Classification and Regression
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Graph Based Multi-Modality Learning
Blind Signal Separation using Principal Components Analysis
Locality Sensitive Hashing
Principal Component Analysis
Feature Selection Methods
Machine Learning Support Vector Machine Supervised Learning
Minwise Hashing and Efficient Search
LSH-based Motion Estimation
Presentation transcript:

Sparse Solutions for Large Scale Kernel Machines Taher Dameh CMPT820-Multimedia Systems Dec 2 nd, 2010

Outline Introduction Motivation: Kernel machines applications in multimedia content analysis and search Challenges in large scale kernel machines Previous Work Sub-Quadratic approach to compute the sparse Gram matrix Results Conclusion and future Work

Introduction Given a set of points, with a notion of distance between points, group the points into some number of clusters. We use Kernel functions to compute the similarity between each pair of points to produce a Similarity (Gram) Matrix (O(N 2 ) space and computation) Example of kerenl kernel machines:  Support Vector Machines SVM (formulated for 2 classes)  Relevance Vector Machines (result much sparse models)  Guassian Process  Fisher’s Linear discriminant analysis LDA  Kernel PCA

Kernel machines applications in multimedia content analysis and search BroadCast Video Summarization Using Clustering Document Clustering Audio Content Discovery Searching one billion web images by content

Challenges and Sparse Solutions for Kernel Machines One of the significant limitations of many kernel methods is that the kernel function k(x,y) must be evaluated for all possible pairs x and y of training points, which can be computationally infeasible. traditional algorithm analysis assumes that the data fits in main memory, it is unreasonable to make such assumptions when dealing with massive data sets such as multimedia data, web page repositories and so on Observing that kernel machines are Redial Basis Function, then the gram matrices have many values that are close to zero We are developing algorithms to approximate the gram matrix to sparse one (filtering out the small similarities)

Previous Work Approximation depending on the Eigen spectrum of the gram matrix  The Eigen spectrum rapidly decays especially when the kernel function is Radial basis (most information stored in the first few eigen vectors) Sparse Bayesian learning  Methods that leads to much sparse models  Relevance vector machines (RVM)  Sparse kernel principle component analysis (sparse KPCA ) Efficient Implementation of computing the kernel function  Space filling curves  Locality Sensitive Hashing (OUR Method)

Locality Sensitive hashing Hash the data-points so that probability of collision is higher for close points. A Family H=h : S → U is called (r1,r2,p1,p2)-sensitive, if for any v,q є S dist(v,q) < r1 → ProbH [h(v) = h(p)] ≥ p1 dist(v,q) > r2 → ProbH [h(v) = h(p)] ≤ p2  p1 > p2 and r1 1  We need the gap between p1 and p2 a quite large For a proper choice of k (will be shown later), g(v) = {h1(v), …,hk(v) } We compute the kernel function between the points that reside at the same bucket.  using this approach and for a hash table of size m (assuming the buckets have the same number of points) computing the gram matrix will have the complexity of N 2 /m

Sub-quadratic approach using LSH Claim 1:The number of concatenated hash values k is logarithmic in the size of datasets n and independent of the dimension d Proof: Given a set of n points P in the d-dimensional space and (r1; r2; p1; p2)-sensitive hash functions, and given a point q, the probability that Is at most p 2 k = B/n, where B is the average bucket size. then we can find that:

Claim2:The complexity of computing the approximated gram matrix using the locality sensitive hashing is sub-quadratic. Proof:

FN ratio VS Memory reeducation for different values of k

Affinity Propagation results for different values of k

Second stage of AP over the first stage weighted exemplars

N*d input vectors m segments each of size (N/m)*d L buckets files Compute Gram matrix of each bucket (gram matrix size is (N/L) 2 ) and run clustering algorithm on each bucket’s Gram matrix Combine clusters with Weights Run second phase of clustering Hashing Clustering Clusters with weights Final Clusters

Conclusion and future work Brute force kernel methods require O(N 2 ) space and computation, where the assumption that data fits in main memory no longer works. Approximate the full Gram matrix to sparse one depending on the redial basis property of such methods would reduce this quadratic down into sub-quadratic Using the locality sensitive hashing we can find the close points to compute the kennel function between them and also we can distrusted the processing as the bucket will be the base unit. Future Work: working on control the error as k increases, so we can both run very large scale data and at the same time maintain sufficient accuracy.