1 Comparison of Principal Component Analysis and Random Projection in Text Mining Steve Vincent April 29, 2004 INFS 795 Dr. Domeniconi.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
ECG Signal processing (2)
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Face Recognition and Biometric Systems Eigenfaces (2)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
An Introduction of Support Vector Machine
SVM—Support Vector Machines
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Dimensionality Reduction PCA -- SVD
Face Recognition and Biometric Systems
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Principal Component Analysis
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
FACE RECOGNITION, EXPERIMENTS WITH RANDOM PROJECTION
Tables, Figures, and Equations
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
An Exercise in Machine Learning
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Chapter 2 Dimensionality Reduction. Linear Methods
Chapter 3 Data Exploration and Dimension Reduction 1.
Principles of Pattern Recognition
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab.
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
CSE 185 Introduction to Computer Vision Face Recognition.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
1/18 New Feature Presentation of Transition Probability Matrix for Image Tampering Detection Luyi Chen 1 Shilin Wang 2 Shenghong Li 1 Jianhua Li 1 1 Department.
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
An Exercise in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Dimensionality reduction
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Trees, bagging, boosting, and stacking
CS 2750: Machine Learning Dimensionality Reduction
Face Recognition and Feature Subspaces
Principal Component Analysis (PCA)
An Enhanced Support Vector Machine Model for Intrusion Detection
Special Topics in Data Mining Applications Focus on: Text Mining
Data Transformations targeted at minimizing experimental variance
Chapter 7: Transformations
Multivariate Methods Berlin Chen
Feature Selection Methods
Multivariate Methods Berlin Chen, 2005 References:
CAMCOS Report Day December 9th, 2015 San Jose State University
The “Margaret Thatcher Illusion”, by Peter Thompson
Presentation transcript:

1 Comparison of Principal Component Analysis and Random Projection in Text Mining Steve Vincent April 29, 2004 INFS 795 Dr. Domeniconi

2 Outline Introduction Previous Work Objective Background on Principal Component Analysis (PCA) and Random Projection (RP) Test Data Sets Experimental Design Experimental Results Future Work

3 Introduction “ Random projection in dimensionality reduction: Applications to image and text data ” from KDD 2001, by Bingham and Mannila compared principal component analysis (PCA) to random projection (RP) for text and image data For future work, they said: “ A still more realistic application of random projection would be to use it in a data mining problem ”

4 Previous Work In 2001, Bingham and Mannila compared PCA to RP for images and text In 2001, Torkkola discussed both Latent Semantic Indexing (LSI) and RP in classifying text for very low dimension levels LSI is very similar to PCA for text data Used the Reuters data base In 2003, Fradkin and Madigan discussed background of RP In 2003, Lin and Gunopulos combined LSI with RP No real data mining comparison between the two

5 Objective Principal Component Analysis (PCA): Find components that make projections uncorrelated by selecting the highest eigenvalues of the covariance matrix Maximizes retained variance Random Projection (RP) Find components that make projections uncorrelated by multiplying by a random matrix Minimizes computations for a particular dimension size Determine whether RP is a viable dimensionality reduction method

6 Principal Component Analysis Normalize the input data then center the input data by subtracting the mean which results in X, used below Compute the global mean and covariance matrix of X: Compute the eigenvalues and eigenvectors of the covariance matrix Arrange eigenvectors in the order of magnitude of their eigenvalues. Take the first d eigenvectors as principle components. Put the d eigenvectors as columns in a matrix M. Determine the reduced output E by multiplying M by X Covariance

7 With X being an n x p matrix calculate E using: with projection matrix P and q is the number of reduced dimensions P, p x q, is a matrix with elements r ij r ij = random Gaussian P can also be constructed in one of the following ways:  r ij =  1 with probability of 0.5 each  r ij = *(  1) with probability of 1/6 each, or 0 with a probability of 2/3 Random Projection

8 SPAM Database SPAM Database, generated June/July 1999 Determine whether is spam or not Previous tests have generated an 7% misclassification error Source of data: Number of instances: 4,601 (1,813 Spam = 39.4%)

9 SPAM Database Number of attributes: 58 Attributes: 48 attributes = word frequency 6 attributes = character frequency 1 attribute = average length of uninterrupted sequence of capital letters 1 attribute = longest uninterrupted sequence of capital letters 1 attribute = sum of the length of uninterrupted sequences of capital letters 1 attribute = class spam (1=Spam, 0=Not Spam)

10 Yahoo News Categories Introduced in “ Impact of Similarity Measures on Web-Page Clustering ” by Alexander Strehl, et al.  Located at: ftp://ftp.cs.umn/dept/users/boley/PDDPdata/ Data consists of 2,340 documents in 20 Yahoo news categories. After stemming, the data base consists of 21,839 words Strehl was able to reduce the number of words to 2,903 by selecting only those words that appear in 1% to 10% of all articles

11 Yahoo News Categories CategoryNo.CategoryNo. Business142E: Online65 Entertainment (E)9E: People248 E: Art24E: Review158 E: Cable44E: Stage18 E: Culture74E: Television187 E: Film278E: Variety54 E: Industry70Health494 E: Media21Politics114 E: Multimedia14Sports141 E: Music125Technology60 Number of documents in each category 2,340 total

12 Revised Yahoo News Categories CategoryNo. Business142 Entertainment (Total)1,389 Health494 Politics114 Sports141 Technology60 Combined 15 Entertainment categories in one category

13 Yahoo News Characteristics With the various simplifications and revisions, the Yahoo News Database has the following characteristics: 2,340 documents 2,903 words 6 categories Even with these simplifications and revisions, there are still too many attributes to do effective data mining

14 Experimental Design Perform PCA and RP on each data set for wide range of dimension numbers Run RP multiple times due to random nature of algorithm Determine relative times for each reduction Compare PCA and RP results in various data mining techniques This would include Na ï ve Bayes, Nearest Neighbor and Decision Trees Determine relative times for each technique Compare PCA and RP on time and accuracy

15 Retained Variance Retained Variance (r) is the percentage of the original variance that the PCA reduced data set covers, the equation for this is: where i  are the eigenvalues, m is the original number of dimensions, and d is the reduced number dimensions. In many applications, r should be above 90%

16 Retained Variance Percent SPAM Database Yahoo News Database

17 PCA and RP Time Comparison SPAM Database Ran RP 5 times for each dimension Times in Seconds Reduction performed in Matlab on Pentium III 1 GHz computer with 256 MB RAM Time of PCA divided by Time of RP RP averages over 10 times faster than PCA

18 PCA and RP Time Comparison Yahoo News Database Ran RP 5 times for each dimension Times in Seconds Reduction performed in Matlab on Pentium III 1 GHz computer with 256 MB RAM Time of PCA divided by the Time of RP RP averages over 100 times faster than PCA

19 Data Mining Explored various data mining techniques using the Weka software package. The following produced the best results: IB1: Nearest Neighbor J48: Decision Trees The following produced poor results are will not be used: Na ï ve Bayes: Overall poor results SVM (SVO): Too slow with similar results to others

20 Data Mining Procedures For each data set imported into Weka: Convert the numerical categories to nominal values Randomize the order of the entries Run J48 and IB1 the data  Determine % Correct and check F-Measure statistics Ran PCA once for each dimension number and RP 5 times for each dimension number Used 67% training/33% testing split Tested on 1564 for SPAM and 796 for Yahoo

21 Results-J48 Spam Data PCA gave uniformly good results for all dimension levels PCA gave results comparable to the 91.4% percent correct for the full data set RP was 15% below full data set results Percent Correct

22 Results-J48 Spam Data RP gave consistent results with a very small split between maximum and minimum values % Correct vs. Dimension #

23 Results-IB1 Spam Data PCA gave uniformly good results for all dimension levels PCA gave results comparable to the 89.5% percent correct for the full data set RP was 10% below full data set results Percent Correct

24 Results-IB1 Spam Data RP gave consistent results with a very small split between maximum and minimum values % Correct vs. Dimension #

25 Results SPAM Data PCA gave consistent results for all dimension levels Expected lower dimension levels to not perform as well RP gave consistent, but lower, results for all dimension levels Also expected lower dimension levels to not perform as well

26 Results-J48 Yahoo Data PCA gave uniformly good results for all dimension levels RP was over 30% below PCA results Percent Correct Note: Did not run data mining on full data set due to large dimension number

27 Results-J48 Yahoo Data RP gave consistent results with a very small split between maximum and minimum values RP results were much lower than PCA % Correct vs. Dimension #

28 Results-IB1 Yahoo Data PCA percent correct decreased as dimension number increased RP was 20% below PCA at low dimension numbers, decreasing to 0% at high dimension numbers Percent Correct Note: Did not run data mining on full data set due to large dimension number

29 Results-IB1 Yahoo Data RP gave consistent results with a very small split between maximum and minimum values RP results were similar to PCA at high dimension levels % Correct vs. Dimension #

30 Results Yahoo Data PCA showed consistently high results for the Decision Tree output, but showed decreasing results for higher dimensions for Nearest Neighbor output Could be over fitting in Nearest Neighbor case Decision Tree has pruning to prevent over fitting

31 Results Yahoo Data RP showed consistent results for both Nearest Neighbor and Decision Trees The lower dimension numbers gave slightly lower results  Approximately 10-20% for dimension numbers less than 100 The Nearest Neighbor results were 20% higher than Decision Tree results

32 Overall Results RP gives consistent results with few inconsistencies over multiple runs In general RP is faster by many orders (10 to 100) of magnitude over PCA but in most cases produced lower accuracy The RP results are closer to PCA using the Nearest Neighbor data mining technique Would suggest using RP if speed of processing is most important

33 Future Work Need to examine additional data sets to determine if results are consistent Both PCA and RP are linear tools. They map the original dataset using a linear mapping. Examine deriving PCA using SVD for speed A more general comparison would include non-linear dimensionality reduction methods such as: Kernel PCA SVM

34 References E. Bingham and H. Mannila, “ Random projection in dimensionality reduction: Applications to image and text data ”, KDD 2001 D. Fradkin and D. Madigan, “ Experiments with Random Projections for Machine Learning ”, SOGLDD ’ 03, August 2003 J. Lin and D. Gunopulos, “ Dimensionality Reduction by Random Projection and Latent Semantic Indexing ”, Proceedings of the Text Mining Workshop, at the 3 rd SIAM International Conference on Data Mining, May 2003 K. Torkkola, “ Linear Discriminant Analysis in Document Classification ”, IEEE Workshop on Text Mining (TextDM ’ 2001), November 2001

35 Questions?