Statistical Analysis of DNA Microarray. An Example of HDLSS in Genetics.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”
Machine Learning for Biological Data Mining 장 병 탁 서울대 컴퓨터공학부 And Modified by Kim Hye Jin 포항공과대학교 컴퓨터공학과 IM.
Cluster analysis for microarray data Anja von Heydebreck.
Wenting Zhou, Weichen Wu, Nathan Palmer, Emily Mower, Noah Daniels, Lenore Cowen, Anselm Blumer Tufts University Microarray Data.
Cluster Analysis.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Introduction to Bioinformatics Algorithms Clustering.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Clustering Algorithms Bioinformatics Data Analysis and Tools
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
The Human Genome Project and ~ 100 other genome projects:
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Introduction to Bioinformatics - Tutorial no. 12
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Analysis of microarray data
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Gene expression & Clustering (Chapter 10)
Clustering of DNA Microarray Data Michael Slifker CIS 526.
DNA MICROARRAYS WHAT ARE THEY? BEFORE WE ANSWER THAT FIRST TAKE 1 MIN TO WRITE DOWN WHAT YOU KNOW ABOUT GENE EXPRESSION THEN SHARE YOUR THOUGHTS IN GROUPS.
Microarrays.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Gene expression analysis
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 4 Clustering Algorithms Bioinformatics Data Analysis and Tools
Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
High-throughput omic datasets and clustering
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Detecting DNA with DNA probes arrays. DNA sequences can be detected by DNA probes and arrays (= collection of microscopic DNA spots attached to a solid.
Computational Biology
Unsupervised Learning
PREDICT 422: Practical Machine Learning
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Cluster Analysis of Microarray Data
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Clustering.
Dimension reduction : PCA and Clustering
SEEM4630 Tutorial 3 – Clustering.
Clustering.
Unsupervised Learning
Presentation transcript:

Statistical Analysis of DNA Microarray. An Example of HDLSS in Genetics.

The Data

Expression Matrix Rows represent genes = feature vectors. Columns represent different cell samples. Ex: cancer cells from different patients. Each element (i,j) of the array represents the expression level of gene i in cell sample j.

Goal of Analysis of Expression Matrix Some statistical methods applied to: 1.“Group” similar genes together => groups of functionally similar genes. 2. “Extract” representative gene in each group. 3.”Group” similar cell samples together.

Overview DNA Microarray Technology One cell sample. Level of expression. Microarray technique.

Getting the Data... One Cell Sample at a Time

Getting the Data…measuring the Level of Expression Gene by Gene. Each spot in this DNA microarray represents the level of expression of a single gene in the tumor cell compared to a reference cell. Standardize the level of expression of this cell to make it comparable to other cells. Expressed in reference cell. Expressed in reference and tumor cell. Expressed in tumor cell Nor expressed.

Level of Expression … mRNA

All the cells contain the same DNA = same genes, but in one cell not all genes are active. What differentiate the cells is what genes are active or expressed. To measure the cell expression we measure the genetic molecule “RNA messenger” denoted by mRNA.

Measuring The Level of Expression … Complementary Strands

RNAm … DNA RNAm is one strand copy of a piece of DNA. Highly unstable. DNA is double stranded, one strand complementary to the other. Stable.

Getting One Sample … Microarray Technique

Microarray Technique (Cont.)…The Microarray Microarrays are made from a collection of purified DNA's. A drop of each type of DNA in solution is placed onto a specially-prepared glass microscope slide by an arraying machine. The arraying machine can quickly produce a regular grid of thousands of spots in a square about 2 cm on a side, small enough to fit under a standard slide coverslip. The DNA in the spots is bonded to the glass to keep it from washing off during the hybridization reaction

Microarray Technique (Cont.) …Description of the Method Definition of Microarray from the National Human Genome Research Institute : “…The method uses a robot to precisely apply droplets containing functional DNA to glass slides. Researchers then attach fluorescent labels to DNA from the cell they are studying. The labeled probes are allowed to bind to complementary DNA strands on the slides. The slides are put into a scanning microscope that can measure the brightness of each fluorescent dot; brightness reveals how much of a specific DNA fragment is present, an indicator of how active it is.”

Microarray Technique (Cont.) …The Method Step by Step First step : to measure the gene expression level of a cell, collect RNAm from the cell of interest, usually cancer cell. Have the same quantity of RNAm from a “reference cell”. Second step: RNAm to cDNA. The RNAm is highly unstable, to stabilize it we complement the strand and create cDNA(complementary DNA). Third step: creates cDNA probes. Label cDNA from each cell by fluorescent dyes. A differently colored fluor is used for each sample.

Microarray Technique …The Method Step by Step (Contd.) Fourth step: hybridize the cDNA probes from the two samples to the Microarray. Once the cDNA probes have been hybridized to the array and any loose probe has been washed off, the array must be scanned to determine how much of each probe is bound to each spot.

Statistical Methods Clustering. Gene shaving algorithm: use of PCA for clustering.

Clustering Overview - Kmean clustering. - Hierarchical clustering. - Validation method.

What Is Clustering? For a sample of size n described by a d- dimensional feature space, Clustering is a procedure that: 1. Divide the d-dimensional feature space in k disjoint groups. 2. Data points within each group are more similar to each other than to any data point in other groups. Illustration for n = 45, d = 2 and k = 3.

Similarity Between Feature Vectors Choice of the similarity function depends on the data. For example: if data is invariant by linear transformation or rotation than the similarity function has to be invariant too. Similarity function could be a distance or an inner product. Examples of similarity functions: 1Euclidean distance, used to illustrate for d = 2. 2 Correlation is used for microarray data.

K-means Clustering Divide the d dimensional feature space on “k” parts described by Voronoi partition of the k mean vectors. Algorithm finds the vector of means of clusters. Illustration for d =2 and k = 3, red points represent means of clusters and red lines represent Voronoi partition.

Algorithm for K-means Clustering Algorithm 1.Begin initialize n, k, m 1, m 2,..., m k 2.Do classify n samples according to nearest m i 3.recompute m i 4.until no change in m i 5.return m 1, m 2,..., m k 6.end Computational Complexity O(ndkT) T is the number of iterations For d = 2, illustration of the trajectories of the 3 means.

K-mean Clustering for Microarray Data Cf picture k.mean. K -means clustering of lymphoma data. Lymphoma profiles were clustered using the expression of 148 germinal-center-specific genes and Euclidean distance metric.(a) represents the germinal-cell subtype; and (b) represents the activated subtype. Each column represents a specific gene and each row a specific cancer profile.

Hierarchical Clustering Dendrogram Venn Diagram of Clustered Data

Hierarchical Clustering (Cont.) Multilevel clustering, at level 1 we have n clusters and at level n we have one cluster. Agglomerative HC: starts with singleton and merge clusters. Divisive HC :starts with one sample and split clusters.

Hierarchical Clustering …Nearest Neighbor Algorithm Nearest Neighbor Algorithm is an agglomerative HC (bottom-up). The algorithm starts with n nodes (n is the size of our sample). At every level the 2 most similar nodes are merged together into one node. The algorithm stops when we get the desired number of clusters.

Nearest Neighbor, data to cluster.

Nearest Neighbor, Level 2, k = 7 clusters.

Nearest Neighbor, Level 3, k = 6 clusters.

Nearest Neighbor, Level 4, k = 5 clusters.

Nearest Neighbor, Level 5, k = 4 clusters.

Nearest Neighbor, Level 6, k = 3 clusters.

Nearest Neighbor, Level 7, k = 2 clusters.

Nearest Neighbor, Level 8, k = 1 cluster.

Results of Hierarchical Clustering on Microarray Data Grouping similar functional genes. Grouping similar cell samples. Cf picture Perou.trend.review2001.pdf file page6.

Criterion Function for Clustering Criterion Functions depend on grouping and number of clusters. Examples are: 1.Sum of squared errors   || x - mi || 2. 2.Scatter Criteria |S W | / |S T | ; where S T =S W +S B. i.e. decompose the total scatter matrix into between- cluster scatter matrix and within-cluster scatter matrix. Best cluster minimizes the criterion.

Gene Shaving The “gene shaving” method is also a method of clustering genes and sample cells. But unlike classic clustering, in this method one gene could belong to more than one cluster.

Gene Shaving Iteration

Gene Shaving…iteration 1. Start with the entire expression matrix X, each row centered to have zero mean. 2. Compute the leading PC of the rows of X. 3. Shave off the proportion alpha (10%) of the genes having smallest absolute inner-product with the leading PC. 4. repeats steps 2 and 3 until only one gene remains. 5. This produces a nested sequence of gene clusters S n ...  S k  …  S 1 where S k denotes a cluster of k genes. Estimates the optimal cluster size k using the gap statistic. 6. Orthogonalize each row of X with respect to  Sk, the average gene in S k, optimal from step5. 7. Repeat steps 1-5 with orthogonalized data, to find the second optimal cluster. This process continued until a max of M clusters are found.

To Estimate Cluster Size : Gap Estimate For cluster S k let D k be the scatter estimate. i.e D k = 100 S B /S T. For b in {1,…,B}, let 1.X * (b) permuted data matrix ( permuting the elements within each row of X ). 2. D k * (b) is the scatter estimate for cluster S k * (b). D k * is the mean of D k * (b)’s. Gap(k) = D k - D k *. Choose k that produces the largest gap.

Gene Shaving (Cont.) The first three gene clusters found for the DLCL data

Gene Shaving (Cont.) Percent of gene variance explained by first j gene shaving column averages (j = 1,2,... 10) (solid curve), and by first j principal components (broken curve). For the shaving results, the total number of genes in the first j clusters is also indicated.

Gene Shaving ( Cont.) a) Variance plots for real and randomized data. The percent variance explained by each cluster, both for the original data, and for an average over three randomized versions. (b) Gap estimates of cluster size. The gap curve, which highlights the difference between the pair of curves, is shown.

References Pattern Classification Richard O.Duda, Peter E.Hart and David G.Stork Chapter 10. ‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns T. Hastie, R. Tibshirani, M.B. Eisen, A Alizadeh, R. Levy,L Staudt, W.C Chan, D.Botstein and P. Brown. Genome Biology Cluster analysis and display of genome-wide expression patterns, PNAS (1998).

References Basic microarray analysis: grouping and feature reduction. S. Raychaudhuri, P.Sutphin, J.T. Chang and Russ B. Trends in Biotechnology Tumor classification using gene expression patterns from DNA microarrays.Charles M. Perou, Patrick O.Brown and David Botstein. Trends in Molecular medicine,December Pictures and definition of microarray technology from National Human Genome Research Institute