Clustering Analysis for Microarray Data

Clustering Analysis for Microarray Data
Ho Kim School of Public Health, Seoul National University

Microarrays 대량의 유전자에 대해 발현현상을 동시관찰
유전자의 regulation과 interaction의 이해에 기여 Full Yeast Genome in a Chip (Brown Lab, Stanford. Univ.)

Microarray Experiment
cDNA Microarray Experiment using Apo AI

Microarray & Statistical Method
생물학적 의문점 -유전자 발현양상의 차이 -유전자 및 샘플의 분류 Experimental Design Microarray Experiment Image Processing Normalization Statistical problems in all the procedures Validation Estimation Testing Clustering Classification 생물학적 해석 및 확인

Image Analysis-Practical Problems 1-
Comet Tails Likely caused by insufficiently rapid immersion of the slides in the succinic anhydride blocking solution.

High Background 2 likely causes: Insufficient blocking. Precipitation of the labeled probe. Weak Signals

Spot overlap: Likely cause: too much rehydration during post - processing.

Steps in Images Processing
1. Addressing: locate centers 2. Segmentation: classification of pixels either as signal or background. using seeded region growing). 3. Information extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.

Normalization Microarray자료에서 발현수준에 영향을 미치는
기술적 변이(systematic variation)를 찾아내서 제거함

M vs. A(Apo AI KO Experiment)
M = log2(R / G) A = log2(R*G) / 2

Lowess Regression Global lowess
Assumption: changes roughly symmetric at all intensities.

Print-tip-group Normalization
Assumption: For every print group, changes roughly symmetric at all intensities. log2R/G -> log2R/G – ci(A) = log2R/ (ki(A)G) ci(A) is the loess fit to the M vs. A plot for the i-th grid

Effect of Location Normalization
Before normalization After print-tip-group Normalization (just location)

Effect of location + scale Normalization
Just location

Cluster Analysis: intro
Genes with similar function yield similar expression patterns in microarray experiments Searching for the groups (clusters) in the data, based on a measure or distance index of similarity or dissimilarity cluster - a set of entities which are alike

Similarity The similarity(dissimilarity) between two objects measures how different they are. Actually we use distance function between two objects Distance property 1. d(x, x) = 0 2. d(x, y) > 0 3. d(x, y) = d(y, x)

Kinds of Distance-1 Euclidean : Manhattan : Maximum, Binary…
(X2,Y2) (X2,Y2) (X1,Y1) (X1,Y1) Euclidean Distance Manhattan Distance

Kinds of Distance-2 Correlation

K-Means Clustering Select k initial seed K is pre-deoermined

K-Means Clustering For each data point, assign cluster
Calculate cluster mean Seed is changed to the mean of the cluster

K-Means Clustering Repeats until seeds don’t change

K-Means Clustering The result depends on initial seeds. As the dimension goes up, the problem gets more serious. As the number of clusters K is changed, the cluster membership can change in arbitrary ways. That is, with say four clusters, the clusters need not be nested within the three clusters above. K-means does not give a linear ordering of objects within a cluster

EXAMPLE

Hierarchical Clustering
find relatively homogeneous clusters of cases based on measured characteristics. Agglomerative(bottom-up) starts with each case in a separate cluster and then combines the clusters sequentially, reducing the number of clusters at each step until only one cluster is left. When there are N cases, this involves N-1 clustering steps, or fusions. Divisive(top-down) Reverse procedure of agglomerative

Dissimilarity measures
– single link: similarity of two most similar members – complete link: similarity of two least – average link: average similarity between members A B C D 2 5 3 9 7 4

Dissimilarity measures single linkage
D((AB), C) =min{(AC),(BC)} =min{5,3} =3 D((AB), D) =min{(AD),(BD)} =min{9,7} =7 D(C,D) = 4 min{3,7,4}=3(AB),C D((ABC),D)=min{(AD),(BD),(CD)}=min{9,7,4}=4 4 A B C D 2 5 3 9 7 4 3 2 A B C D Single

Dissimilarity measures complete linkage
D((AB), C) =max{(AC),(BC)} =max{5,3} =5 D((AB), D) =max{(AD),(BD)} =max{9,7} =9 D(C,D) = 4 min{5,9,4}=4(AB),(CD) D((AB),(CD))=max{(AC),(AD),(BC),(BD)}=min{5,9,3,7}=9 A B C D 2 5 3 9 7 4 9 2 4 Complete

Dissimilarity measures average linkage
D((AB), C) =mean{(AC),(BC)} =mean{5,3} =4 D((AB), D) =mean{(AD),(BD)} =mean{9,7} =8 D(C,D) = 4 * min{5,9,4}=4(AB),(CD) D((AB),(CD))=mean{(AC),(AD),(BC),(BD)}=mean{5,9,3,7}=6 A B C D 2 5 3 9 7 4 6 2 4 Complete

Dendrogram Hierrarchical Clustering(Eisen et al. 1998)
Clustered display of data from time course of serum stimulation of primary human fibroblasts

EXAMPLE

Self Organizing Map(SOM)

SOM is a neural-network that is based on divisive clustering approach.
Microarray Data: High Dimensional p genes, n treatments/times → n times p-dimensional points. SOM: find clusters on 2(or 3)-D surface in n-D space.

SOM Two layers : data layer & neuron layer
Data layer: input, want to find pattern Neuron layer: collection of neuron Neurons are connected with other neurons in neuron layer as well as data in data layer Neuron layer (through learning) will make adaptation to data layer (reduce complexity and, easy to understand for human)

EXAMPLE

SOM parameters 1) Theta- the size of the neighborhood function
2) Phi- the amount a neuron within the neighborhood circumference will be moved towards the input-vector 3) Momentum- the amount to reduce the learning parameters theta and phi after each iteration 4) Tolerance- the circumference around each neuron that will define the set of input-vectors that belongs to each neuron 5) nxn neurons- the number of neurons in the lattice 6) Phi tolerance- a limit for how long the algorithm will keep running 7) Pause- the number of milliseconds to wait at each iteration 8) Updates before repaint- a way to control how many iterations the algorithm will perform before updating the plot

Example - Tamayo et al.(1999)
6 by 5 SOM. The 828 genes that passed the variation filter were grouped into 30 clusters. Each cluster is represented by the centroid (average pattern) for genes in the cluster. Expression levels are shown on y-axis and time points on x-axis. Error bars indicate the SD of average expression. n indicates the number of genes within each cluster.

HL-60 SOM. HL-60 cells were treated with PMA for 0, 0
HL-60 SOM. HL-60 cells were treated with PMA for 0, 0.5, 4, or 24 hours, and expression levels of more than 6,000 genes were measured at each time point. The 567 genes passing the variation filter were grouped by a SOM.

Other methods,-partitioning
K-Means(Ruspini, 1970; centroid): Iterative relocation clustering PAM(Partitioning Around Medoids): robust than K-Means CLARA(Clustering Large Applications): FUSSY: fractions of membership MCLUST: probability-based model Bayes factor for choosing k clusters MDS(Multidimensional Scaling: Tamayo et al.,1999) SOM(Self-Organizing Map: Tamayo et al.,1999)

Other methods, etc CLICK(Sharan and Shamir, 2000)
Gene Shaving(Hastie et al., 2000) Gene Voting(Golub et al., 1999) SVM(Support Vector Machine; Brown et al.,2000)

Software 상용화 S/W NAME COMPANY FEATURE ArraySuite Affymetrix
plots, fold changes ImaGene Biodiscovery quantification of gene expression value, constant-factor normalization GeneSight background adjustment, clustering(hierarchical, SOM) GeneSpring Silicon Genetics normalization, clustering(hierarchical, SOM), fold-change Spotfire PCA, clustering, fold-change Resolver Rosetta clustering, PCA, fold-change, plots LifeArray Incyte clustering, PCA, fold-change Expressionist GeneData clust, PCA, fold-change GeneExpress Gene Logic IPLab MicroArray Suit Scanalytics clustering, image, fold-changes

공개용 S/W NAME ORG FEATURE J-Express U Bergen clustering, PCA UCI/NCGR
t-test for fold-change TreeView Stanford clustering, SOM, image analysis (similarity, Cluster,Lawrence Berkeley Lab, Eisen Lab) EPCLUST EBI clustering SOM Whitehead Inst.

Software - website link
Recently updated site for software of microarray data analysis 열린 강의실, 세미나자료

Clustering Analysis for Microarray Data

Similar presentations

Presentation on theme: "Clustering Analysis for Microarray Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Analysis for Microarray Data

Similar presentations

Presentation on theme: "Clustering Analysis for Microarray Data"— Presentation transcript:

Similar presentations

About project

Feedback