Microarray Clustering

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

BioInformatics (3).
Basic Gene Expression Data Analysis--Clustering
Cluster Analysis: Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Cluster analysis for microarray data Anja von Heydebreck.
Microarray Normalization
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
Mutual Information Mathematical Biology Seminar
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Cluster Analysis: Basic Concepts and Algorithms
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Evaluating Performance for Data Mining Techniques
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Gene expression analysis
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Flat clustering approaches
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Unsupervised Learning
Cluster Analysis of Gene Expression Profiles
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Clustering Manpreet S. Katari.
Data Mining K-means Algorithm
Differential Gene Expression
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
CSE 5243 Intro. to Data Mining
AIM: Clustering the Data together
Hierarchical clustering approaches for high-throughput data
Cluster Analysis of Microarray Data
Clustering.
Gene expression analysis
Clustering.
Multivariate Statistical Methods
Dimension reduction : PCA and Clustering
Clustering Wei Wang.
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
The Omics Dashboard.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering.
Unsupervised Learning
Presentation transcript:

Microarray Clustering Xiaole Shirley Liu STAT115/215

Profiles Heat map display Gene expression profile refers to individual rows or columns as profiles Row is profile of a gene Column is profile of a sample / condition Tongji 2009

Clustering Can cluster genes or samples Genes: have similar expression profile over different samples or conditions Samples: have similar expression profile over all the genes

Clustering Motivation for clustering: Clustering quality: Visualizing data Understand general characteristics of data Make generalizations about gene behavior Classify samples Clustering quality: “Distance”: 1 – corr Maximize inter (between)-cluster distance Minimize intra (within)-cluster distance

Cluster Stability Mask out some data (e.g. only sample a subset of genes or samples) to perform clustering Introduce a little noise to the data, then perform clustering May select only diff expr genes for clustering, or use genes on the same pathway, etc.

Missing Value Estimation Missing values affects clustering E.g. when calculating distances Naïve solution: Ignore gene/sample, fill in 0 or Avg K nearest neighbors Gene i has missing value at sample j Calculate corr between gene i and all other genes using all samples except j Find k genes (e.g. 10) with highest corr to i Calculate k genes’ avg expression at sample j

Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6

Hierarchical Clustering Repeatedly Merge two nodes (either a gene or a cluster) that are closest to each other Re-calculate the distance from newly formed node to all other nodes Branch length represents distance Linkage: distance from newly formed node to all other nodes Tongji 2009

Hierarchical Clustering Linkage Single Complete Average: pairwise distances

Some Brain Teasers If we have n data points How many internal nodes are in the H-cluster? How many possible ways to draw the H-cluster Break

Partitional Clustering Disjoint groups From hierarchical clustering: Cut a line from hierarchical clustering By varying the cut height, we could produce arbitrary number of clusters

K-means Algorithm Choose K centroids at random Expression in Sample1 Iteration = 0

K-means Algorithm Choose K centroids at random Assign object i to closest centroid Iteration = 1

K-means Algorithm Choose K centroids at random Assign object i to closest centroid Recalculate centroid based on current cluster assignment Iteration = 2

K-means Algorithm Choose K centroids at random Assign object i to closest centroid Recalculate centroid based on current cluster assignment Repeat until assignment stabilize Iteration = 3

K-means Algorithm Example Assign each objects to most similar center 1 2 3 4 5 6 7 8 9 10 Update the cluster means 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrarily choose K object as initial cluster center reassign reassign Update the cluster means

K-means Algorithm Deterministic method Initial cluster centers important Can be trapped in local optimal How to pick initial cluster centers Run hierarchical cluster, find cut line Random start many times with different initial centers Not tolerant to outliers or noise

Problem With Outliers x1 x2 x3

Partition Around Medoids Pick one real data point (closest to all) instead of average as the centroid of the cluster More robust in the presence of noise and outliers

How to Pick K K = 2, gradually increase Improvement: reduce within cluster distance and increase between cluster distance Cost: cost with each increase in K Compare the cost with improvement, stop when not worth it

How to Pick K W(k) = total sum of squares within clusters B(k) = sum of squares between cluster means n = total number of data points Calinski & Harabasz, 1974 Hartigan, 1975 stop when H(k) < 10

How to Pick K

Practical Considerations Only cluster genes that are variable across samples The magic of 7! Break

Batch Effect Removal

Motivation Batch Effect: Non-biological variation Make samples not directly comparable Caused by differences: Array platform Lab protocol or experimenter Time of day or hybridization Atmospheric ozone level (Rhodes et al. 2004)

Example of Batch Effect Knock down expression using different cells and different siRNA in 3 batches

Batch vs Normalization Shouldn’t normalization take care of this? NO! Normalization and expression indexing procedures only adjust sample and probe effects Usually these are orthogonal to the batch effects, so batch effects are left in the data

Importance of Experimental Design Record run date, labs, personnel, environmental variables, etc. When possible, balance groups of interests, at least include some controls in each batch Avoid “perfect confounding” when batch and group are perfectly correlated. E.g. Treatments in one batch, controls in another Treatment 1 in Batch 1, Treatment 2 in Batch 2

Decomposing Data Heterogeneity Primary variables (T & C) Random variations (Error) Baseline expression Samples Genes = + +

Decomposing Data Heterogeneity Primary variables (T & C) Dependence variables (Batch) Independent variations (Error) Baseline expression Samples Genes = + + + Intuitively similar to LIMMA’s linear model, consider batch as some kind of treatment Break

Gene Ontology

Gene Annotation How to report differentially expressed genes or gene clusters? Enriched for certain pathways, certain functions, or proteins localized in the same complex, etc? Gene Ontology Consortium Ashburner et al 1998 Annotate gene function in the human genome Now extended to many model organisms Why do we care? Effectively communicate biomedical knowledge Organize and summarize annotations in structured way Allow effective and meaningful computation on gene annotations

GO Categories Molecular function Biological process Cellular component Describe gene’s jobs or abilities E.g. transporters, transcription factor Biological process Events or pathways E.g. cell differentiation, maturation, development Cellular component Describe locations (subcellular structures, macromolecular complexes) E.g. nucleus, cell membrane, protein complexes

GO

GO Relationships: Same term could be annotated at multiple branches Subclass: Is_a Membership: Part_of Topological: adjacent_to; Derivation: derives_from E.g. 5_prime_UTR is part_of a transcript, and mRNA is_a kind of transcript Same term could be annotated at multiple branches Directed acyclic graph

Evaluate Differentially Expressed Genes NetAffx mapped GO terms for all probesets Whole genome Up genes GO term X 100 80 Total 20K 200 Statistical significance? Binomial proportional test p = 100 / 20 K = 0.005 Check z table

Evaluate Differentially Expressed Genes Whole genome Up genes GO term X 100 80 Total 20K 200 Chi sq test or Fisher’s exact test: Up !Up Total GO: 80 (1) 20 (99) 100 !GO: 120 (199) 20K-120 (19701) 20K-100 Total: 200 20K-200 20K Check Chi-sq table

GO Tools for Microarray Analysis http://neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools Hundreds DAVID

Summary Clustering by distance Missing value estimation Hierarchical clustering, linkage K-mean clustering, how to determine K Batch effect considerations COMBAT for batch effect removal GO categories, directed acyclic GO enrichment statistics