Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

Slides:

Advertisements

Similar presentations

Basic Gene Expression Data Analysis--Clustering

Advertisements

Self-Organizing Maps Projection of p dimensional observations to a two (or one) dimensional grid space Constraint version of K-means clustering –Prototypes.

Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.

Dimensionality Reduction PCA -- SVD

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Clustering and Dimensionality Reduction Brendan and Yifang April

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.

DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.

SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.

Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.

Introduction to Bioinformatics - Tutorial no. 12

Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.

Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Kansas State University Department of Computing and Information Sciences KDD Group Presentation #7 Fall’01 Friday, November 9, 2001 Cecil P. Schmidt Department.

Chapter 2 Dimensionality Reduction. Linear Methods

Presented By Wanchen Lu 2/25/2013

CSE 185 Introduction to Computer Vision Pattern Recognition.

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.

PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.

CZ5225: Modeling and Simulation in Biology Lecture 8: Microarray disease predictor-gene selection by feature selection methods Prof. Chen Yu Zong Tel:

SINGULAR VALUE DECOMPOSITION (SVD)

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.

Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.

CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Machine Learning Queens College Lecture 7: Clustering.

Principle Component Analysis and its use in MA clustering Lecture 12.

Factor & Cluster Analyses. Factor Analysis Goals Data Process Results.

Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.

ViSOM － A Novel Method for Multivariate Data Projection and Structure Visualization Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Hujun Yin.

1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.

Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.

Classification Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects.

Principal Components Analysis ( PCA)

Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL

Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

PREDICT 422: Practical Machine Learning

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Principal Component Analysis (PCA)

Dimension Reduction via PCA (Principal Component Analysis)

Principal Component Analysis

John Nicholas Owen Sarah Smith

Probabilistic Models with Latent Variables

PCA, Clustering and Classification by Agnieszka S. Juncker

Multivariate Statistical Methods

Dimension reduction : PCA and Clustering

Register variation: correlation, clusters and factors

Multidimensional Space,

Clustering The process of grouping samples so that the samples are similar within each group.

The Numerology of T Cell Functional Diversity

Presentation transcript:

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman

Motivation: Multidimensional data Pat1Pat2Pat3Pat4Pat5Pat6Pat7Pat8Pat _at _at _s_at _at _at _at _at _x_at _at _s_at _s_at _at _s_at _s_at _x_at _at _x_at _s_at _s_at _at _s_at _at _at _at _at _s_at _s_at _s_at _at _at _at _at _at _s_at _s_at _s_at _at _at _at _s_at _s_at _s_at _at _at _x_at _at _s_at _s_at _at _at

Dimension reduction methods Principal component analysis Cluster analysis Multidimensional scaling Correspondance analysis Singular value decomposition

Principal Component Analysis (PCA) used for visualization of complex data developed to capture as much of the variation in data as possible

Principal components 1. principal component (PC1) –the direction along which there is greatest variation 2. principal component (PC2) –the direction with maximum variation left in data, orthogonal to the 1. PC

Principal components

General about principal components –summary variables –linear combinations of the original variables –uncorrelated with each other –capture as much of the original variance as possible

PCA - example

PCA on all Genes Leukemia data, precursor B and T Plot of 34 patients, dimension of 8973 genes reduced to 2

PCA on 100 top significant genes Leukemia data, precursor B and T Plot of 34 patients, dimension of 100 genes reduced to 2

PCA of genes (Leukemia data) Plot of 8973 genes, dimension of 34 patients reduced to 2

Principal components - Variance

Clustering methods Hierarchical Partitioning –K-mean clustering –Self Organizing Maps (SOM)

Hierarchical clustering of leukemia data

Hierarchical clustering Representation of all pairwise distances Parameters: none (distance measure) Results: –in one large cluster –hierarchical tree (dendrogram) Deterministic

Hierarchical clustering - Algorithm Assign each item to its own cluster Join the nearest clusters Reestimate the distance between clusters Repeat for 1 to n

Hierarchical clustering

Data with clustering order and distances Dendrogram representation

Leukemia data - clustering of genes

Leukemia data - clustering of patients

Leukemia data - clustering of patients on top 100 significant genes

K-mean clustering Partition data into K clusters Parameter: Number of clusters (K) must be chosen Randomilized initialization: –different clusters each time

K-mean - Algorithm Assign each item a class in 1 to K (randomly) For each class 1 to K –Calculate the centroid (one of the K-means) –Calculate distance from centroid to each item Assign each item to the nearest centroid Repeat until no items are re-assigned (convergence)

K-mean clustering, K=3

K-mean clustering of Leukemia data

Self Organizing Maps (SOM) Partitioning method (similar to the K-means method) Clusters are organized in a two-dimensional grid Size of grid is specified –(eg. 2x2 or 3x3) SOM algoritm finds the optimal organization of data in the grid

SOM - example

Comparison of clustering methods Hierarchical clustering –Distances between all variables –Timeconsuming with a large number of gene –Advantage to cluster on selected genes K-mean clustering –Faster algorithm –Does only show relations between all variables SOM –more advanced algorithm

Distance measures Euclidian distance Vector angle distance Pearsons distance

Comparison of distance measures

Summary Dimension reduction important to visualize data Methods: –Principal Component Analysis –Clustering Hierarchical K-mean Self organizing maps (distance measure important)

Coffee break Next: Exercises in PCA and clustering