Analysis of microarray data

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

BioInformatics (3).
Basic Gene Expression Data Analysis--Clustering
Clustering.
Cluster analysis for microarray data Anja von Heydebreck.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Clustering II.
Clustering (slide from Han and Kamber)
Analysis of microarray data. Gene expression database – a conceptual view Samples Genes Gene expression levels Sample annotations Gene annotations Gene.
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
Microarrays. Regulation of Gene Expression Cells respond to environment Heat Food Supply Responds to environmental conditions Various external messages.
Identification of regulatory elements. Transcriptional Regulation Strongest regulation happens during transcription Best place to regulate: No energy.
Microarray II. What is a microarray Microarray Experiment RT-PCR LASER DNA “Chip” High glucose Low glucose.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Microarray I. Cells respond to environment Heat Food Supply Responds to environmental conditions Various external messages.
Microarrays. Regulation of Gene Expression Cells respond to environment Heat Food Supply Responds to environmental conditions Various external messages.
Microarrays Technology behind microarrays Data analysis approaches
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Cluster Analysis Class web site: Statistics for Microarrays.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Introduction to Bioinformatics - Tutorial no. 12
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Normalization Review and Cluster Analysis Class web site: Statistics for Microarrays.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Persistent Systems Pvt. Ltd. Gene Expression Analysis Using Microarrays Dr Mushtaq Ahmed Technology Incubation Division Persistent.
Analysis of microarray data
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Gene expression & Clustering (Chapter 10)
Clustering of DNA Microarray Data Michael Slifker CIS 526.
Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data –Subtraction of Background Noise –Global/local.
Microarrays.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Statistical Analysis of DNA Microarray. An Example of HDLSS in Genetics.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Semi-Supervised Clustering
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Cluster Analysis in Bioinformatics
Dimension reduction : PCA and Clustering
Presentation transcript:

Analysis of microarray data

HTS Using Hybridization Microarray Chip Probe: oligos/cDNA (gene templates) + Target: cDNA (variables to be detected) Samples Hybridization Analysis of outcome Pathways Targets/Leads Disease Class. Functional Annotation Physiological states

Timeline for drug discovery Discovery (5 yrs) 5000 Gene expression study Pre-Clinical (1 yr) 50 Clinical (6 yrs) 5 Review (2 yrs) 1 Marketed

Microarray for Yeast Figure from DeRisi et al. (See next slide).

cDNA Microarrays Use robot to spot glass slides at precise points with complete gene/EST sequences Able to measure qualitatively relative expression levels of genes Differential expression by use of simultaneous, two-colour fluorescence hybridisation

Microarray Experiment RT-PCR DNA “Chip” High glucose RT-PCR LASER Low glucose

Microarray for Yeast Figure from DeRisi et al. (See next slide).

cDNA Microarrays Use robot to spot glass slides at precise points with complete gene/EST sequences Able to measure qualitatively relative expression levels of genes Differential expression by use of simultaneous, two-colour fluorescence hybridisation

Microarray Experiment RT-PCR DNA “Chip” High glucose RT-PCR LASER Low glucose

Raw data – images Red (Cy5) dot overexpressed or up-regulated Green (Cy3) dot underexpressed or down-regulated Yellow dot equally expressed Intensity - “absolute” level red/green - ratio of expression 2 - 2x overexpressed 0.5 - 2x underexpressed log2( red/green ) - “log ratio” 1 2x overexpressed -1 2x underexpressed cDNA plotted microarray

Microarray Expression Value Representation expression value types composite spots primary spots primary measurements derived values composite images e.g., green/red ratios primary images Source: MGED

Analysing Expression Data Measure gene expression levels under various conditions The more experiments the finer the classification Clustering reveals groupings of genes and/or experiments / tissues / treatments Hypothesize similar regulatory mechanisms and perhaps role Analysis of expression data needs to be integrated with other types of biological analysis and knowledge Gene 1 Gene 2 Gene n Condition 1 Condition m

Gene expression database – a conceptual view Sample annotations Samples Gene annotations Gene expression matrix Genes Gene expression levels

Gene Expression Profiles Measure gene expression of many genes Repeat under various conditions Which genes are behaving similarly co-regulated co-expressed

Bioinformatics in microarray data Array design Data extraction (Pixel to matrix) Background correction Data normalization Data analysis

Data normalization expression of gen x in experiment i expression of gen x in reference Logarithm of ratio - treats induction and repression of identical magnitude as numerical equal but with opposite sign.

Levels of analysis Level 1: Which genes are induced / repressed? Gives a good understanding of the biology. Methods: Factor-2 rule, t-test. Level 2: Which genes are co-regulated? Inference of function. -Clustering algorithms, -Support Vector Machines. Level 3: Which genes regulate others? Reconstruction of networks. - Transcriptions factor binding sites, - Bayesian networks.

Analysis of multiple experiments Expression of gene x in m experiments can be represented by an exression vector with m elementer 1) The vector can be normalized so that m = 0, s2 = 1 Gener whith low expression and low variation can be correlated to gens with high expression and high variation. 2) Discretization: up regulation: +1 no regulation: 0 down regulation: -1

Clustering Hierachical clustering: - Transforms n (genes) * m (experiments) matrix into a diagonal n * n similarity (or distance) matrix Similarity (or distance) measures: Euclidic distance Pearsons correlation coefficent Ud fra denne matrix kan man bygge et dendrogram, ved Eisen et al. 1998 PNAS 95:14863-14868

Key Terms in Cluster Analysis Distance & Similarity measures Hierarchical & non-hierarchical Single/complete/average linkage Dendrograms & ordering

Distance Measures: Minkowski Metric

Most Common Minkowski Metrics

An Example x 3 y 4

Taken from http://www. icgeb. trieste

Similarity Measures: Correlation Coefficient

Similarity Measures: Correlation Coefficient Expression Level Expression Level Gene A Gene B Gene B Gene A Time Time Expression Level Gene B Gene A Time

Distance-based Clustering Assign a distance measure between data Find a partition such that: Distance between objects within partition (i.e. same cluster) is minimized Distance between objects from different clusters is maximised Issues : Requires defining a distance (similarity) measure in situation where it is unclear how to assign it What relative weighting to give to one attribute vs another? Number of possible partition is super-exponential

Hierarchical Clustering Techniques At the beginning, each object (gene) is a cluster. In each of the subsequent steps, two closest clusters will merge into one cluster until there is only one cluster left.

Hierarchical Clustering Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process hierarchical clustering is this: 1.Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain. 2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3.Compute distances (similarities) between the new cluster and each of the old clusters. 4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

The distance between two clusters is defined as the distance between Single-Link Method / Nearest Neighbor (NN): minimum of pairwise dissimilarities Complete-Link / Furthest Neighbor (FN): maximum of pairwise dissimilarities Unweighted Pair Group Method with Arithmetic Mean (UPGMA): average of pairwise dissimilarities Their Centroids. Average of all cross-cluster pairs.

Computing Distances single-link clustering (also called the connectedness or minimum method) : we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster. complete-link clustering (also called the diameter or maximum method): we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster. average-link clustering : we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

Single-Link Method Euclidean Distance a a,b b a,b,c a,b,c,d c d c d d (1) (2) (3) Distance Matrix

Complete-Link Method Euclidean Distance a a,b a,b b a,b,c,d c,d c d c (1) (2) (3) Distance Matrix

Compare Dendrograms Single-Link Complete-Link 2 4 6

Ordered dendrograms 2 n-1 linear orderings of n elements (n= # genes or conditions) Maximizing adjacent similarity is impractical. So order by: Average expression level, Time of max induction, or Chromosome positioning Eisen98

Serum stimulation of human fibroblasts (24h) Cholesterol biosynthesis Celle cyclus I-E response Signalling/ Angiogenesis Wound healning

k-means clustering Tavazoie et al. 1999 Nature Genet. 22:281-285

Which clustering methods do you suggest for the following two-dimensional data?

Clustering by K-means Given a set S of N p-dimension vectors without any prior knowledge about the set, the K-means clustering algorithm forms K disjoint nonempty subsets such that each subset minimizes some measure of dissimilarity locally. The algorithm will globally yield an optimal dissimilarity of all subsets. K-means algorithm has time complexity O(RKN) where K is the number of desired clusters and R is the number of iterations to converges. Euclidean distance metric between the coordinates of any two genes in the space reflects ignorance of a more biologically relevant measure of distance. K-means is an unsupervised, iterative algorithm that minimizes the within-cluster sum of squared distances from the cluster mean. The first cluster center is chosen as the centroid of the entire data set and subsequent centers are chosen by finding the data point farthest from the centers already chosen. 200-400 iterations.

K-Means Clustering Algorithm 1) Select an initial partition of k clusters 2) Assign each object to the cluster with the closest center: 3) Compute the new centers of the clusters: 4) Repeat step 2 and 3 until no object changes cluster

1. centroide

k = 6 6. centroide 5. centroide 3. centroide 1. centroide 2. centroide

k = 6 6. centroide 5. centroide 3. centroide 1. centroide 2. centroide

k = 6 6. centroide 3. centroide 5. centroide 1. centroide 2. centroide

Self organizing maps Tamayo et al. 1999 PNAS 96:2907-2912

k = 6 1. centroide 2. centroide 3. centroide 4. centroide 5. centroide

k = 6

k = 6

k = 6

Partitioning vs. Hierarchical Advantage: Provides clusters that satisfy some optimality criterion (approximately) Disadvantages: Need initial K, long computation time Hierarchical Advantage: Fast computation (agglomerative) Disadvantages: Rigid, cannot correct later for erroneous decisions made earlier

Generic Clustering Tasks Estimating number of clusters Assigning each object to a cluster Assessing strength/confidence of cluster assignments for individual objects Assessing cluster homogeneity

Clustering and promoter elements Harmer et al. 2000 Science 290:2110-2113

An Example Cluster (DeRisi et al, 1997)

Cluster of co-expressed genes, pattern discovery in regulatory regions 600 basepairs Expression profiles Retrieve Upstream regions Pattern over-represented in cluster

Some Discovered Patterns Vilo et al. 2001 Pattern Probability Cluster No. Total ACGCG 6.41E-39 96 75 1088 ACGCGT 5.23E-38 94 52 387 CCTCGACTAA 5.43E-38 27 18 23 GACGCG 7.89E-31 86 40 284 TTTCGAAACTTACAAAAAT 2.08E-29 26 14 18 TTCTTGTCAAAAAGC 2.08E-29 26 14 18 ACATACTATTGTTAAT 3.81E-28 22 13 18 GATGAGATG 5.60E-28 68 24 83 TGTTTATATTGATGGA 1.90E-27 24 13 18 GATGGATTTCTTGTCAAAA 5.04E-27 18 12 18 TATAAATAGAGC 1.51E-26 27 13 18 GATTTCTTGTCAAA 3.40E-26 20 12 18 GATGGATTTCTTG 3.40E-26 20 12 18 GGTGGCAA 4.18E-26 40 20 96 TTCTTGTCAAAAAGCA 5.10E-26 29 13 18

Results Jaak Vilo Over 6000 “interesting” patterns Many from homologous upstreams - Removed Leaves 1500 patterns These patterns clustered into 62 groups Found alignments, consensus, and profiles Of 62 clusters - 48 had patterns matching SCPD (experimentally mapped) binding site database Jaak Vilo

The "GGTGGCAA" Cluster Jaak Vilo

From Gifford 2001 Science 293:2049-2050 34 genes, 140 experiments

Two sided clustering Alizadeh et al. 2000 Nature 403:505-5011

Diffuse large B-cell lymphoma

Principal Component Analysis (Singular Value Decomposition) Alter et al. 2000 PNAS 97:10101-10106

Bayesian Networks Analysis Friedman et al. 2000 J. Comp. Biol., 7:601-620

- Kan kun representere acykliske relationer.

Principal Component Analysis

Clustering methods Hierarchical clustering: Distance measures: complete linkage average linkage single linkage Distance measures: Euclidean Correlation based Rank correlation Manhattan ... Partition-based K-means Specify K Randomly select “centers” Assign genes to centers Recalculate centers to “gravity center” Iterate until stabilizes Can get to local minimum Fast for large datasets Initial selection of centers