es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves 973702406 Dept Ciencies Mediques.

Slides:



Advertisements
Similar presentations
Discrimination amongst k populations. We want to determine if an observation vector comes from one of the k populations For this purpose we need to partition.
Advertisements

Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Machine Learning Lecture 8 Data Processing and Representation
Dimensionality Reduction PCA -- SVD
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
Factor Analysis Research Methods and Statistics. Learning Outcomes At the end of this lecture and with additional reading you will be able to Describe.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Unsupervised Learning
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Dan Simon Cleveland State University
Computer Vision Spring ,-685 Instructor: S. Narasimhan WH 5409 T-R 10:30am – 11:50am Lecture #18.
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Clustering Unsupervised learning Generating “classes”
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Chapter 3 Data Exploration and Dimension Reduction 1.
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #19.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Using Support Vector Machines to Enhance the Performance of Bayesian Face Recognition IEEE Transaction on Information Forensics and Security Zhifeng Li,
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Classification Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects.
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Multivariate statistical methods Cluster analysis.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Principal Component Analysis
Multivariate statistical methods
PREDICT 422: Practical Machine Learning
Exploring Microarray data
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Principal Component Analysis (PCA)
Dimension Reduction via PCA (Principal Component Analysis)
Principal Component Analysis
Principal Component Analysis (PCA)
PCA, Clustering and Classification by Agnieszka S. Juncker
Descriptive Statistics vs. Factor Analysis
Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA.
Dimension reduction : PCA and Clustering
Dimensionality Reduction
Factor Analysis BMTRY 726 7/19/2018.
Principal Component Analysis
Presentation transcript:

es/by-sa/2.0/

Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques Basiques, 1st Floor, Room 1.08 Website of the Course: Course:

Complex Datasets When studying complex biological samples sometimes there are to many variables For example, when studying Medaka development using Phospho metabolomics you may have measurements of different amino acids, etc. etc. Question: Can we find markers of development using these metabolites? Question: How do we analyze the data?

Problems How do you visually represent the data? –The sample has many dimensions, so plots are not a good solution How do you make sense or extract information from it? –With so many variables how do you know which ones are important for identifying signatures

Two possible ways (out of many) to address the problems PCA Clustering

Solution 1: Try data reduction method If we can combine the different columns in specific ways, then maybe we can find a way to reduce the number of variables that we need to represent and analyze: –Principal Component Analysis

Variation in data is what identifies signatures Metabolite 1Metabolite 2Metabolite 3… Condition C Condition C Condition C

Variation in data is what identifies signatures Virtual Metabolite: Metabolite 2+ 1/Metabolite 3 Signal Much strong and separates conditions 1, 2, and 3 V. Metab. C2C3C1 020

Principal component analysis From k “old” variables define k “new” variables that are a linear combination of the old variables: y 1 = a 11 x 1 + a 12 x a 1k x k y 2 = a 21 x 1 + a 22 x a 2k x k... y k = a k1 x 1 + a k2 x a kk x k New vars Old vars

Defining the New Variables Y yk's are uncorrelated (orthogonal) y1 explains as much as possible of original variance in data set y2 explains as much as possible of remaining variance etc.

Principal Components Analysis on: Covariance Matrix: –Variables must be in same units –Emphasizes variables with most variance –Mean eigenvalue ≠1.0 Correlation Matrix: –Variables are standardized (mean 0.0, SD 1.0) –Variables can be in different units –All variables have same impact on analysis –Mean eigenvalue = 1.0

Covariance Matrix covariance is the measure of how much two random variables vary together X1X2X3… X1 12 … X2… 2222 3… X3 … …… 3232 …

Covariance Matrix X1X2X3… X1 12 … X2… 2222 3… X3 … …… 3232 … Diagonalize matrix

Eigenvalues are the principal components Tells us how much each PC contributes to a data point

Principal Components are Eigenvalues λ1λ1 λ2λ2 1st Principal Component, y 1 2nd Principal Component, y 2

Now we have reduced problem to two variables Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8

What if things are still a mess? Days 3, 4, 5 and 6 do not separate very well What could we do to try and improve this? Maybe add and extra PC axis to the plot!!!

Days separate well with three variables Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8

Two possible ways to address the problems PCA Clustering

Complex Datasets

Solution 2: Try using all data and representing it in a low dimensional figure If we can cluster the different days according to some distance function between all amino acids, we can represent the data in an intuitive way.

What is data clustering? Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait The number of cluster is usually predefined in advance

Types of data clustering Hierarchical –Find successive clusters using previously established clusters Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters Partitional –Find all clusters at once

First things first: distance is important Selecting a distance measure will determine how data is agglomerated

Reducing the data and finding amino acid signatures in development Decide on number of clusters: Three clusters Do you PCA of the dataset (20 variables, 35 datapoints) Use Euclidean Distance Use a Hierarchical, Divisive Algorithm

Hierarchical, Divisive Clustering: Step 1 – One Cluster Consider all data points as a member of cluster

Hierarchical, Divisive Clustering: Step 1.1 – Building the Second Cluster centroid Furthest point from centroid New seed cluster

Hierarchical, Divisive Clustering: Step 1.1 – Building the Second Cluster Recalculate centroid Add new point further from old centroid, closer to new Rinse and repeat until…

Hierarchical, Divisive Clustering: Step 1.2 – Finishing a Cluster Recalculate centroids: if both centroids become closer, do not add point & stop adding to cluster Add new point further from old centroid, closer to new

Hierarchical, Divisive Clustering: Step 2 – Two Clusters Use optimization algorithm to divide datapoints in such a way that Euc. Dist. Between all point within each of two clusters is minimal

Hierarchical, Divisive Clustering: Step 3 – Three Clusters Continue dividing datapoints until all clusters have been defined

Reducing the data and finding amino acid signatures in development Decide on number of clusters: Three clusters Use Euclidean Distance Use a Hierarchical, Aglomerative Algorithm

Hierarchical, Aglomerative Clustering: Step 1 – 35 Clusters Consider each data point as a cluster

Hierarchical, Aglomerative Clustering: Step 2 – Decreasing the number of clusters Search for the two datapoint that are closest to each other Colapse them into a cluster Repeat until you have only three clusters

Reducing the data and finding amino acid signatures in development Decide on number of clusters: Three clusters Use Euclidean Distance Use a Partitional Algorithm

Partitional Clustering Search for the three datapoint that are farthest from each other Add points to each of these, according to shortest distance Repeat until all points have been partitioned to a cluster

Clustering the days of development with amino acid signatures Get your data matrix Use Euclidean Distance Use a Clustering Algorithm

Final Notes on Clustering If more than three PC are needed to separate the data we could have used the Principal components matrix and cluster from there Clustering can be fuzzy Using algorithms such as genetic algorithms, neural networks of bayesian networks one can extract clusters that are completly non- obvious

Summary PCA allows for data reduction and decreases dimensions of the datasets to be analyzed Clustering allows for classification (independent of PCA) and allows for good visual representations