Lecture 4 Microarray & Analysis Alizadeh et al. Nature 403 (2000)
Microarray revolutionized biology and medicine research One gene at a time before, now tens of thousands simultaneously - PROTEOMICS Gene expression Gene disease relation Gene-gene interaction Finding Co-Regulated Genes Understanding Gene Regulatory Networks Many, many more
Basic idea of Microarray 製造原理 – 將可特徵基因之對偶鹼基序列 – 稱為探針 ( probe ) – 排列放置在微晶片 ( microchip ) 上 應用原理 – 將含基因序列之樣品 ( sample ) 液體到在微 晶片上 – 利用互補鹼基雜交作用 ( hybridization ) 的 原理,由 樣品 與微晶片上基因序列相互 作用的情形摘取所需的資訊
Basic idea of Microarray Construction –Place array of probes on microchip Probe (for example) is oligonucleotide ~25 bases long that characterizes gene or genome Each probe has many, many clones Chip is about 2cm by 2cm Application principle –Put (liquid) sample containing genes on microarray and allow probe and gene sequences to hybridize and wash away the rest – Analyze hybridization pattern
cDNA microarray schema cDNA 晶片製造原理
Microarray analysis Operation Principle: Samples are tagged with flourescent material to show pattern of sample-probe interaction (hybridization) Microarray may have 60K probe
Microarray Processing sequence From: Shin-Mu Tseng
Gene Expression Data Gene expression data on p genes for n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j = Log (Red intensity / Green intensity) Log(Avg. PM - Avg. MM) sample1sample2sample3sample4sample5 …
Some possible applications Sample from specific organ to show which genes are expressed Compare samples from healthy and sick host to find gene-disease connection Probes are sets of human pathogens for disease detection
Amount of data from single microarray is huge If just two color, then amount of data on array with N probes is 2 N Cannot analyze pixel by pixel Analyze by pattern – cluster analysis
Major Data Mining Techniques Link Analysis –Associations Discovery –Sequential Pattern Discovery –Similar Time Series Discovery Predictive Modeling –Classification –Clustering
Strengthens signal when averages are taken within clusters of genes (Eisen) Useful (essential ?) when seeking new subclasses of cells, tumours, etc. Leads to readily interpreted figures Cluster Analysis: grouping similarly expressed genes, Cell samples, or both
Some clustering methods and software Partitioning : K-Means, K-Medoids, PAM, CLARA … Hierarchical : Cluster, HAC 、 BIRCH 、 CURE 、 ROCK Density-based : CAST, DBSCAN 、 OPTICS 、 CLIQUE… Grid-based : STING 、 CLIQUE 、 WaveCluster… Model-based : SOM (self-organized map) 、 COBWEB 、 CLASSIT 、 AutoClass… Two-way Clustering Block clustering
A review paper assessing various methods Algorithmic Approaches to Clustering Gene Expression Data, Ron Shamir School of Computer Science, Tel-Aviv University Tel-Aviv – orithmic.html Conclusion: hierarchical clustering exceptional
Density-based clustering
Hierarchical (used most often) agglomerativity divisivity
Hierarchical Clustering: grouping similarly expressed genes gene Sample A B C … … …. … Gene Expression Profile Analysis From: Shin-Mu Tseng
After Clustering gene sample A B C … … …. … Gene Expression Profile Analysis From: Shin-Mu Tseng
Eisen et al. Proc. Natl. Acad. Sci. USA 95 (1998) data clustered randomized row column both time
distance measurements correlation coefficients association coefficients probabilistic similarity coefficients Types of Similarity Measurements
Correlation Coefficients The most popular correlation coefficient is Pearson correlation coefficient (1892) correlation between X={X 1, X 2, …, X n } and Y={Y 1, Y 2, …, Y n } : –where From: Shin-Mu Tseng s XY s XY is the similarity between X & Y
Now can use similarity for Tree construction Normalize similarity so that =1 Then have nxn similarity matrix S whose diagonal elements are 1 Define distance matrix by (for example) D = 1 – S Diagonal elements of D are 0 Now use distance matrix to built tree (using some tree-building software recall lecture on Phylogeny) s XX
A dendrogram (tree) for clustered genes Cluster 6=(1,2) Cluster 7=(1,2,3) Cluster 8=(4,5) Cluster 9= (1,2,3,4,5) Let p = number of genes. 1. Calculate within class correlation. 2. Perform hierarchical clustering which will produce (2p-1) clusters of genes. 3. Average within clusters of genes. 4 Perform testing on averages of clusters of genes as if they were single genes. E.g. p=5
A real case Nature Feb, 2000 Paper by Allzadeh. A et al Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Validation Techniques : Hubert’s Γ Statistics X= [X(i, j)] and Y= [Y(i, j)] are two n × n matrix –X(i, j) : similarity of gene i and gene j –Hubert’s Γ statistic represents the point serial correlation : where M = n (n - 1) / 2 –A higher value of Γ represents the better clustering quality. if genes i and j are in same cluster, otherwise From: Shin-Mu Tseng
Discovering sub-groups
Time Course Data Gene Expression is time-dependent
Sample of time course of clustered genes time
Limitations Cluster analyses : –Usually outside the normal framework of statistical inference –Less appropriate when only a few genes are likely to change –Needs lots of experiments Single gene tests : –May be too noisy in general to show much –May not reveal coordinated effects of positively correlated genes. –Hard to relate to pathways
