Dimensionality Reduction

Dimensionality Reduction
Some material from: Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7) Villanova University Machine Learning Project

Clustering We know that clustering is a way to understand or examine our data where we do the following: Collect examples Compute similarity among examples according to some metric Group examples together such that examples within a cluster are similar, examples in different clusters are different Summarize each cluster, sometimes assign new instances to the most similar cluster Villanova University Machine Learning Project Dimensionality Reduction

Some typical Uses of Clustering
A technique demanded by many real world tasks Bank/Internet Security: fraud/spam pattern discovery Biology: taxonomy of living things such as kingdom, phylum, class, order, family, genus and species City-planning: Identifying groups of houses according to their house type, value, and geographical location Climate change: understanding earth climate, find patterns of atmospheric and ocean Finance: stock clustering analysis to uncover correlation underlying shares Image Compression/segmentation: coherent pixels grouped Information retrieval/organization: Google search, topic-based news Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Social network mining: special interest group automatic discovery 9 Villanova University Machine Learning Project Dimensionality Reduction

Clustering may also be used to reduce the number of attributes in a data set: dimensionality reduction Why? A large number of attributes is typical, for instance, in text mining image processing biology For instance, in the current UCI repository: The UCI repository currently has 51 stat sets with more than 100 attributes, all of which have more than 1000 instances. Bring it up and point out a couple of them. Villanova University Machine Learning Project Dimensionality Reduction

Clustering can be carried out as a precursor to running, for instance, a KNN classifier. Reducing dimensionality can improve performance in terms of speed and memory improve performance in terms of accuracy especially for an algorithm such as Naive Bayes or KNN which weights all variables equally Villanova University Machine Learning Project Dimensionality Reduction

Why Would This Help Accuracy?
Attributes are not related to class: Adding shirt color to weather.arff Attributes highly correlated, not contributing independent information. Temperature in Celsius and temperature in Fahrenheit Attribute/Class relations are non-linear or are best described as relations between two variables. Don’t play if it’s cold AND rainy. Attributes are sparse, with only a few values for any instance. Any text sample! Villanova University Machine Learning Project Dimensionality Reduction

Attribute Selection Obvious way of reducing the number of attributes: Throw some out. We looked at a simple method for this in the exercises for section 17.2. Weka supports more sophisticated methods in the Select Attributes tab, , described in section 7.1 Often useful, and often used. Villanova University Machine Learning Project Dimensionality Reduction

Clusters as Attributes
Cluster tools such as K Means group data by similarity, based on the attributes The cluster models are basically weighted combinations of attributes We can consider cluster membership itself as an attribute and it captures variation in our other attributes Villanova University Machine Learning Project Dimensionality Reduction

Cluster Membership in Weka
In Weka we can use cluster membership as an attribute In the Preprocess tab there is an unsupervised attribute filter called “AddCluster” Choose a cluster tool in the genericObjectEditor window Apply it, Cluster will appear as another attribute Examining the data (through the Edit… button) show that cluster membership has been added to each instance Bring his up with the big diabetes file and step through it. LOOK at your data. A lot of these are not in fact numeric, although the CSV file defaults that way. Villanova University Machine Learning Project Dimensionality Reduction

Cluster Membership in Weka
Now that you have cluster as an attribute, run classifiers as usual Be sure to set the class. Weka defaults to the last attribute Created attributes such as cluster will be last! You can consider removing attributes as usual. keep stepping Villanova University Machine Learning Project Dimensionality Reduction

Example Diabetes file from UCI instances, 50 attributes load (as a CSV file) run NaiveBayes (once!). Accuracy: 56% Add K-Means cluster filter, remove other inputs run NB again (reset class!). Accuracy: 54% So we have replaced 49 attributes with 1, with only a slight loss in accuracy Can explore adding in some of the others to see if accuracy can be improved Villanova University Machine Learning Project Dimensionality Reduction

Attribute Selection Note that the most effective attribute selection often comes from knowledge of the domain. For instance, the first attribute in the diabetes data file is Encounter ID. The second is Patient ID. These can probably be discarded immediately. Villanova University Machine Learning Project Dimensionality Reduction

Attribute or Feature Extraction
Suppose we want to use whatever information we can get from each attribute? Can we map the values to a smaller, equivalent set of attributes? Sounds familiar? In Regression classifiers we have predicted class based on a weighted combination of attributes in SVMs we have used a kernel to map our inputs to non-linear forms Villanova University Machine Learning Project Dimensionality Reduction

Unsupervised Dimension Reduction
Regression and SVMs are both supervised. Can we apply similar concepts to map or project our data without a class to define the output? Clustering looks at how close instances are based on distance between attributes We can use a comparable metric for our unsupervised reduction: critic is a reduction in predicted error for our existing points Villanova University Machine Learning Project Dimensionality Reduction

Consider These Data Points
Clearly we can represent these points with complete accuracy with two attributes Villanova University Machine Learning Project Dimensionality Reduction

But With a single value, the distance from the green line, we can capture most of the variation in the values Villanova University Machine Learning Project Dimensionality Reduction

Projecting! So we can reduce our attributes from two to one by a transformation that captures most of the variation. The total of the yellow lines is the error; we choose the green line to minimize it Villanova University Machine Learning Project Dimensionality Reduction

Unsupervised Dimension Reduction
A common use of unsupervised learning is to remap our inputs into a smaller number of variables. The most common method is principal component analysis (PCA) common statistical technique also sometimes called factor analysis The goal of PCA is to project a large number of attributes or dimensions into a smaller space Villanova University Machine Learning Project Dimensionality Reduction

Principal component analysis
Method for identifying the important “directions” in the data Can rotate data into (reduced) coordinate system that is given by those directions Algorithm: Find direction (axis) of greatest variance Find direction of greatest variance that is perpendicular to previous direction and repeat Implementation: find eigenvectors of covariance matrix by diagonalization Eigenvectors (sorted by eigenvalues) are the directions Note that perpendicular gets a little strange when we are talking about multiple Villanova University Machine Learning Project Dimensionality Reduction

Example: 10-dimensional data
Can transform data into space given by components Data is normally standardized for PCA Could also apply this recursively in tree learner Villanova University Machine Learning Project Dimensionality Reduction

Unsupervised Attribute Selection
The Attribute Selection tab in Weka lets you apply principal component analysis automatically. Choose PrincipalComponents as an Attribute Evaluator. Search Method must be Ranker. Let it choose automatically. Amount of variance to cover defaults to 95% Class can be set to No class; otherwise whatever attribute is considered the class will be omitted Villanova University Machine Learning Project Dimensionality Reduction

Results: Right-click on result list to get “Save transformed data…”. This can now be fed to another algorithm. Output window shows the results, including proportion of variance accounted for and transformations. Can be very slow. Diabetes example on my (old) laptop did not finish overnight. Villanova University Machine Learning Project Dimensionality Reduction

Discussion Reduces classifier time —preprocessing itself takes time. In theory should not change accuracy for methods which differentially weight attributes, such as J48 or regression methods In practice… Villanova University Machine Learning Project Dimensionality Reduction

Dimensionality Reduction

Similar presentations

Presentation on theme: "Dimensionality Reduction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dimensionality Reduction

Similar presentations

Presentation on theme: "Dimensionality Reduction"— Presentation transcript:

Similar presentations

About project

Feedback