Download presentation
Presentation is loading. Please wait.
Published byBlanche Allison Modified over 9 years ago
1
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F. Lovett, II Advisors: George Ostrouchov and Houssain Kettani Computer Science and Mathematics Division Oak Ridge National Laboratory Summer 2005
2
2 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Why Dimension Reduction? Data Mining Large Databases Number of Items Number of Attributes (high-dimensionality) with items Visualization Requires low dimensional views (2 or 3) Structure discovery Patterns Clusters Fast similarity searching Images, video, documents, character recognition, face recognition, DNA sequences Data Reduction RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
3
3 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap Uses Distances and Mimics PCA Like FastMap RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Faloutsos and Lin (1995) (FastMap) Choose two very distant points as principal axis Project onto orthogonal hyperplane Repeat Each axis O(n), given distances Distances updated using cosine law as needed Result is a mapping as well as the transformation to map new items
4
4 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Projection to Pivot Axis and to Orthogonal Hyperplane Given pivot axis ab FastMap computes coordinates along axis and projections onto the orthogonal hyperplane. b a y z y’ z’ d y’,z’ RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering a y cycy d a,y d b,y d a,b b
5
5 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Problems with FastMap? OUTLIERS! Outliers are points that are not closest on average to the other members of their cluster. When Selecting points based on distance, FastMap considers all the points of a dataset. By including outliers, FastMap isn’t robust. RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
6
6 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY FastMap Pivot Pair: Choosing Outliers RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Axis does not represent majority of data
7
7 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: Clustering and Excluding Outliers RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Compute n distances from random object Take point of largest distance Repeat Clustering Estimate distance distribution from two extreme points Find probability of extreme points Exclude most extreme cluster of low probability points Finish projection using remaining points Diagnostic histogram and cluster plots Ratio Function Uses only distances from pivots (2 nd and 3 rd ) Computes ratios: data fraction / probability of data Looks for splits according to ratio threshold Discards smaller portion.
8
8 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Dataset Generator Generates clustered data from a mixture of multivariate normal densities There are five parameters Number of dimensions Number of clusters Cluster variability Cluster mixing proportions Seed for random number generator Other RobustMap parameters Number of dimensions to extract Quantile of trimmed max Ratio threshold for outlying cluster extraction RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
9
9 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Results RobustMap identifies and excludes outlying clusters RobustMap performs dimension reduction RobustMap exploits robust statistics RobustMap exploits fast machine learning algorithms (runtime O(nk)) RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
10
10 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY... PC 1 PC 2 PC 3PC 4 ++ PC 1620 + 135 year CCM3 run at T42 resolution CO 2 increase to 3x Average Monthly Surface Temperature 1620 x 2500 matrix (Putman, Drake, Ostrouchov, 2000) Decomposition of Climate Model Run Data with PCA (EOF) Image vector PC 1 PC 2 PC 3 PC 4 Concise 4-d summary of 135 year run Winter warming more severe than summer warming Amplitude-in-time plots RM 1RM 2RM 3 RM 4 ++ 1000 x FASTER ! + RM 1 RM 2 RM 3 RM 4 Concise 4-d summary of 135 year run Winter warming more severe than summer warming RobustMap Amplitude-in-time plots
11
11 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Future Plans Ratio Compute threshold from probability theory Create loop for remaining clusters Develop better probability theory for RobustMap Add application context visualization RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
12
12 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Applications Searching Images and multimedia databases String databases (spelling, typing and OCR error correction) Medical databases Data Mining and Visualization Medical databases (ECGs, X-rays, MRI brain scans) Demographic Data Time Series Business, Commerce, and Financial Data Climate, Astrophysics, Chemistry, and Biology Data RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
13
13 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.