Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.
Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
Covariance Matrix Applications
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Machine Learning Lecture 8 Data Processing and Representation
Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”
Dimensionality Reduction PCA -- SVD
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Finding Local Linear Correlations in High Dimensional Data Xiang Zhang Feng Pan Wei Wang University of.
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Dimensionality R e d u c t i o n. Another unsupervised task Clustering, etc. -- all forms of data modeling Trying to identify statistically supportable.
Principal Component Analysis
Mutual Information Mathematical Biology Seminar
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Feature Extraction for Outlier Detection in High- Dimensional Spaces Hoang Vu Nguyen Vivekanand Gopalkrishnan.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Bayesian belief networks 2. PCA and ICA
1 When Does Randomization Fail to Protect Privacy? Wenliang (Kevin) Du Department of EECS, Syracuse University.
Lightseminar: Learned Representation in AI An Introduction to Locally Linear Embedding Lawrence K. Saul Sam T. Roweis presented by Chan-Su Lee.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
NUS CS5247 A dimensionality reduction approach to modeling protein flexibility By, By Miguel L. Teodoro, George N. Phillips J* and Lydia E. Kavraki Rice.
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Chapter 3 Data Exploration and Dimension Reduction 1.
Image recognition using analysis of the frequency domain features 1.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
CSE 185 Introduction to Computer Vision Face Recognition.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Face detection and recognition Many slides adapted from K. Grauman and D. Lowe.
Principal Components Analysis ( PCA)
Unsupervised Learning II Feature Extraction
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
Unsupervised Learning II Feature Extraction
Principal Component Analysis (PCA)
Exploratory Factor Analysis
9.3 Filtered delay embeddings
Principal Component Analysis (PCA)
Machine Learning Dimensionality Reduction
Principal Component Analysis
Descriptive Statistics vs. Factor Analysis
Dimensionality Reduction
Feature space tansformation methods
Principal Component Analysis
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University

Finding Latent Patterns in High Dimensional Data An important research problem with wide applications  biology (gene expression analysis, genotype- phenotype association study)  customer transactions, and so on. Common approaches  feature selection  feature transformation  subspace clustering

Existing Approaches Feature selection  find a single representative subset of features that are most relevant for the data mining task at hand Feature transformation  find a set of new (transformed) features that contain the information in the original data as much as possible  Principal Component Analysis (PCA) Correlation clustering  find clusters of data points that may not exist in the axis parallel subspaces but only exist in the projected subspaces.

Motivation Example Question: How to find these local linear correlations (using existing methods)? linearly correlated genes

Applying PCA — Correlated? PCA is an effective way to determine whether a set of features is strongly correlated A global transformation applied to the entire dataset  a few eigenvectors describe most variance in the dataset  small amount of variance represented by the remaining eigenvectors  small residual variance indicates strong correlation

Applying PCA – Representation? The linear correlation is represented as the hyperplane that is orthogonal to the eigenvectors with the minimum variances [1, -1, 1] linear correlations reestablished by full-dimensional PCA embedded linear correlations

Applying Bi-clustering or Correlation Clustering Methods Correlation clustering  no obvious clustering structure Bi-clustering  no strong pair-wise correlations linearly correlated genes

Revisiting Existing Work Feature selection  finds only one representative subset of features Feature transformation  performs one and the same feature transformation for the entire dataset  does not really eliminate the impact of any original attributes Correlation clustering  projected subspaces are usually found by applying standard feature transformation method, such as PCA

Local Linear Correlations - formalization Idea: formalize local linear correlations as strongly correlated feature subsets  Determining if a feature subset is correlated  small residual variance  The correlation may not be supported by all data points -- noise, domain knowledge…  supported by a large portion of the data points

Problem Formalization Suppose that F (m by n) be a submatrix of the dataset D (M by N) Let { } be the eigenvalues of the covariance matrix of F and arranged in ascending order F is strongly correlated feature subset if and(1)(2) total variance variance on the k eigenvectors having smallest eigenvalues (residue variance) number of supporting data points total number of data points

Problem Formalization Suppose that F (m by n) be a submatrix of the dataset D (M by N) larger k, stronger correlation smaller ε, stronger correlation K and ε, together control the strength of the correlation Eigenvalue id Eigenvalues larger ksmaller ε

Goal Goal: to find all strongly correlated feature subsets Enumerate all sub-matrices?  Not feasible (2 M×N sub-matrices in total)  Efficient algorithm needed Any property we can use?  Monotonicity of the objective function

Monotonicity Monotonic w.r.t. the feature subset  If a feature subset is strongly correlated, all its supersets are also strongly correlated  Derived from Interlacing Eigenvalue Theorem  Allow us to focus on finding the smallest feature subsets that are strongly correlated  Enable efficient algorithm – no exhaustive enumeration needed

The CARE Algorithm Selecting the feature subsets  Enumerate feature subsets from smaller size to larger size (DFS or BFS)  If a feature subset is strongly correlated, then its supersets are pruned (monotonicity of the objective function)  Further pruning possible

Monotonicity Non-monotonic w.r.t. the point subset  Adding (or deleting) point from a feature subset can increase or decrease the correlation among the features  Exhaustive enumeration infeasible – effective heuristic needed

The CARE Algorithm Selecting the point subsets  Feature subset may only correlate on a subset of data points  If a feature subset is not strongly correlated on all data points, how to chose the proper point subset?

The CARE Algorithm Successive point deletion heuristic  greedy algorithm – in each iteration, delete the point that resulting the maximum increasing of the correlation among the subset of features  Inefficient – need to evaluate objective function for all data points

The CARE Algorithm Distance-based point deletion heuristic  Let S 1 be the subspace spanned by the k eigenvectors with the smallest eigenvalues  Let S 2 be the subspace spanned by the remaining n-k eigenvectors.  Intuition: Try to reduce the variance in S 1 as much as possible while retaining the variance in S 2  Directly delete (1-δ)M points having large variance in S 1 and small variance in S 2 (refer to paper for details)

The CARE Algorithm A comparison between two point deletion heuristics successivedistance-based

Experimental Results (Synthetic) Linear correlation reestablished Full-dimensional PCACARE Linear correlation embedded

Pair-wise correlations Linear correlation embedded (hyperplan representation) Experimental Results (Synthetic)

Scalability evaluation Experimental Results (Synthetic)

Experimental Results (Wage) Correlation clustering method & CARE CARE only A comparison between correlation clustering method and CARE (dataset (534×11)

Experimental Results Linearly correlated genes (Hyperplan representations) (220 genes for 42 mouse strains) Nrg4: cell part Myh7: cell part; intracelluar part Hist1h2bk: cell part; intracelluar part Arntl: cell part; intracelluar part Nrg4: integral to membrane Olfr281: integral to membrane Slco1a1: integral to membrane P196867: N/A Oazin: catalytic activity Ctse: catalytic activity Mgst3: catalytic activity Hspb2: cellular physiological process L12Rik: cellular physiological process D01Rik: cellular physiological process P213651: N/A Ldb3: intracellular part Sec61g: intracellular part Exosc4: intracellular part BC048403: N/A Mgst3: catalytic activity; intracellular part Nr1d2: intracellular part; metal ion binding Ctse: catalytic activity Pgm3: metal ion binding Hspb2: cellular metabolism Sec61b: cellular metabolism Gucy2g: cellular metabolism Sdh1: cellular metabolism Ptk6: membrane Gucy2g: integral to membrane Clec2g: integral to membrane H2-Q2: integral to membrane

25 An example

26 An example Result of applying PCAResult of applying ISOMAP

27 Finding local correlations Dimension reduction  performs a single feature transformation for the entire dataset To find local correlations  First: identify the correlated feature subspaces  Then: apply dimension reduction methods to uncover the low dimensional structure  Dimension reduction addresses the second aspect  Our focus is the first aspect

28 Finding local correlations Challenges  Modeling subspace correlations  Measurements for pair-wise correlations may not suffice.  Searching algorithm  Exhaustive enumeration is too time consuming.

29 Modeling correlated subspaces Intrinsic dimensionality  the minimum number of free variables required to define the data without any significant information loss Correlation dimension as ID estimator

30 Modeling correlated subspaces Strong correlation  subspace V and feature f a has strong correlation if Redundancy  feature f vi in subspace V is redundant if

31 Modeling correlated subspaces Reducible Subspace and Core Space  subspace Y is reducible if there exist subspace V of Y, such that (1) (2), U is non-redundant  V is the core space of Y, and Y is reducible to V all features in Y are strongly correlated with the cores space V the core space is the smallest non-redundant subspace Y, with which all other features in Y are strongly correlated

32 Modeling correlated subspaces Maximum reducible subspace  Y is a reducible subspace and V is its core space  Y is maximum if it includes all features that are strongly correlated with core space V Goal  To find all maximum reducible subspaces in the full dimensional space

33 Finding reducible subspaces General idea  First find the overall reducible subspace (OR), which is the union of all maximum reducible subspaces  Then identify the individual maximum reducible subspaces (IR) from OR

34 Finding OR Property  suppose Y is a maximum reducible subspace with core space V, then any subspace U of Y, if |U|=|V|, U is also a core space of Y Let RF fa be the remaining features in the datasets after deleting f a, then we have A linear scan of all the features in the dataset can find OR

35 Finding Individual RS Assumption  maximum reducible subspaces are disjoint Method  enumerate candidate core space from size 1 to |OR|  a candidate core space is a subset of OR  find features that are strongly correlated with candidate core space and remove them from OR

36 Finding Individual RS Determine if a feature is strongly correlated with candidate core space  ID-base method :quadratic to number of data points  Sampling based method: sample some data points and see the number of data points distributed around them  see paper for details

37 Experimental result A synthetic dataset consisting of 50 features with 3 RS

38 Experimental result Efficiency evaluation on finding OR

39 Experimental result Sampling v.s. ID based method on finding Individual RS

40 Experimental result Reducible subspaces in NBA dataset (from ESPN website) 28 features for 200 players

41 Thank You !