O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F.

Slides:



Advertisements
Similar presentations
FMRI Methods Lecture 10 – Using natural stimuli. Reductionism Reducing complex things into simpler components Explaining the whole as a sum of its parts.
Advertisements

Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
15-826: Multimedia Databases and Data Mining
Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Principal Component Analysis
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
Mutual Information Mathematical Biology Seminar
Dimensionality Reduction and Embeddings
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimensionality Reduction
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
09/05/2005 סמינריון במתמטיקה ביולוגית Dimension Reduction - PCA Principle Component Analysis.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Identifying Regulatory Transcriptional Elements on Functional Gene Groups Using Computer-
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
FACE RECOGNITION, EXPERIMENTS WITH RANDOM PROJECTION
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Unsupervised Learning
Computer Vision I Instructor: Prof. Ko Nishino. Today How do we recognize objects in images?
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Dimensionality Reduction
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Cluster Computing Applications Project Parallelizing BLAST Research Alliance of Minorities.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Footer Here1 Feature Selection Copyright, 1996 © Dale Carnegie & Associates, Inc. David Mount For CMSC 828K: Algorithms and Data Structures for Information.
Summarized by Soo-Jin Kim
Similarity measuress Laboratory of Image Analysis for Computer Vision and Multimedia Università di Modena e Reggio Emilia,
Lionel F. Lovett, II Jackson State University Research Alliance in Math and Science Computer Science and Mathematics Division Mentors: George Ostrouchov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Presented by ORNL Statistics and Data Sciences Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and.
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
1 E. Fatemizadeh Statistical Pattern Recognition.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
CSE 185 Introduction to Computer Vision Face Recognition.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
232 Th EVALUATION IN THE RESOLVED RESONANCE RANGE FROM 0 to 4 keV Nuclear Data Group Nuclear Science and Technology Division Oak Ridge National Laboratory.
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
Parallelization of a Non-Linear Analysis Code Lee Hively and Jim Nutaro (mentors) Computational Sciences and Engineering Travis Whitlow Research Alliance.
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning II Feature Extraction
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Classification of unlabeled data:
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Principal Component Analysis (PCA)
Principal Component Analysis
Outlier Discovery/Anomaly Detection
Dimension reduction : PCA and Clustering
Clustering Wei Wang.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Presentation transcript:

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F. Lovett, II Advisors: George Ostrouchov and Houssain Kettani Computer Science and Mathematics Division Oak Ridge National Laboratory Summer 2005

2 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Why Dimension Reduction?  Data Mining  Large Databases  Number of Items  Number of Attributes (high-dimensionality) with items  Visualization  Requires low dimensional views (2 or 3)  Structure discovery  Patterns  Clusters  Fast similarity searching  Images, video, documents, character recognition, face recognition, DNA sequences  Data Reduction RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

3 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap Uses Distances and Mimics PCA Like FastMap RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Faloutsos and Lin (1995) (FastMap)  Choose two very distant points as principal axis  Project onto orthogonal hyperplane  Repeat  Each axis O(n), given distances  Distances updated using cosine law as needed  Result is a mapping as well as the transformation to map new items

4 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Projection to Pivot Axis and to Orthogonal Hyperplane Given pivot axis ab FastMap computes coordinates along axis and projections onto the orthogonal hyperplane. b a y z y’ z’ d y’,z’ RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering a y cycy d a,y d b,y d a,b b

5 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Problems with FastMap? OUTLIERS!  Outliers are points that are not closest on average to the other members of their cluster.  When Selecting points based on distance, FastMap considers all the points of a dataset.  By including outliers, FastMap isn’t robust. RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

6 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY FastMap Pivot Pair: Choosing Outliers RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Axis does not represent majority of data

7 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: Clustering and Excluding Outliers RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering  Compute n distances from random object  Take point of largest distance  Repeat  Clustering  Estimate distance distribution from two extreme points  Find probability of extreme points  Exclude most extreme cluster of low probability points  Finish projection using remaining points  Diagnostic histogram and cluster plots Ratio Function  Uses only distances from pivots (2 nd and 3 rd )  Computes ratios: data fraction / probability of data  Looks for splits according to ratio threshold  Discards smaller portion.

8 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Dataset Generator  Generates clustered data from a mixture of multivariate normal densities  There are five parameters  Number of dimensions  Number of clusters  Cluster variability  Cluster mixing proportions  Seed for random number generator  Other RobustMap parameters  Number of dimensions to extract  Quantile of trimmed max  Ratio threshold for outlying cluster extraction RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

9 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Results  RobustMap identifies and excludes outlying clusters  RobustMap performs dimension reduction  RobustMap exploits robust statistics  RobustMap exploits fast machine learning algorithms (runtime O(nk)) RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

10 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY... PC 1 PC 2 PC 3PC 4 ++ PC year CCM3 run at T42 resolution CO 2 increase to 3x Average Monthly Surface Temperature 1620 x 2500 matrix (Putman, Drake, Ostrouchov, 2000) Decomposition of Climate Model Run Data with PCA (EOF) Image vector PC 1 PC 2 PC 3 PC 4 Concise 4-d summary of 135 year run Winter warming more severe than summer warming Amplitude-in-time plots RM 1RM 2RM 3 RM x FASTER ! + RM 1 RM 2 RM 3 RM 4 Concise 4-d summary of 135 year run Winter warming more severe than summer warming RobustMap Amplitude-in-time plots

11 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Future Plans  Ratio  Compute threshold from probability theory  Create loop for remaining clusters  Develop better probability theory for RobustMap  Add application context visualization RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

12 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Applications  Searching  Images and multimedia databases  String databases (spelling, typing and OCR error correction)  Medical databases  Data Mining and Visualization  Medical databases (ECGs, X-rays, MRI brain scans)  Demographic Data  Time Series  Business, Commerce, and Financial Data  Climate, Astrophysics, Chemistry, and Biology Data RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

13 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Questions?