Microarrays Cluster analysis.

Microarrays Cluster analysis

Data mining Given a large mass of data, where we don’t have any prior expectations of patterns or relationships, how can we start to find which patterns or relationships are suggested by the raw data? There are a wide variety of techniques: density estimation, clustering classification, regression.

Microarray data clustering
Which genes are co-expressed? Regulatory networks: which genes activate systems of other genes? Which samples or treatments are similar?

Cluster analysis Comparative microarrays of the same set of genes with expression patterns in: Different tissues Different tumour types Different individuals Different chemical exposures Different time intervals

Multi-dimensional data sets
ExptID Tissue type Category Microarray data 27746 Cerebellum Brain PPAR binding protein ; Keratin ; Trophoblast-derived noncoding RNA ; etc 18600 Ciliary body Eye Axin ; Keratin associated protein ; Syncollin .074; etc 24276 Nasal retina etc 20654 Optic nerve 27742 Iris 27678 Cornea Data from Stanford microarray database: Differential gene expression in anatomical compartments of the human eye. Authors Diehn JJ, Diehn M, Marmor MF, Brown PO Roughly 30,000 genes measured in 14 different tissues

What to cluster? Usually, the analysis aims to cluster genes: finding genes that show the same pattern, that may be co-expressed, part of the same metabolic pathway, etc. We can also cluster in the other direction: similar tissue types in terms of gene expression, or similar chemicals in terms of the gene expression response. genes eye brain tissues actin keratin

Clustering genes: distances between genes
In order to cluster genes with similar expression patterns, first we need a measure of how similar the expression patterns are. What we need is a distance between two n-dimensional vectors There are a variety of choices, I will illustrate two Euclidean distance Pearson correlation coefficient

Euclidean distance Given two genes with expression levels (log R/G ratio) in n tissues G = {g1, g2, g3, …gn} and H = {h1, h2, h3, … hn} the euclidean distance is De = [(g1- h1)2 + (g2- h2)2 + … +(gn-hn)2]1/2 Second log ratio First log ratio

Pearson correlation coefficient
This is the cosine of the angle between the two points Calculated as dot product, normalised PCC = (g1h1 + g2h2 + g3h3 + … + gnhn)/ [g12+…+gn2]1/2[h12+…+hn2]1/2 Dcc = 1 - PCC Second log ratio  First log ratio

Relative tradeoffs The euclidean distance is probably most intuitive. It also allows us to cluster points near the origin accurately. The pearson correlation coefficient is good at detecting similar trends at different scales. It may place points near the origin (little change in expression) far apart. An ideal distance might combine both

Euclidean vs Pearson  First log ratio Second log ratio

Clustering of genes Once we have defined a distance between genes we can cluster them Two basic and common clustering algorithms: Hierarchical K-means

Hierarchical clustering
Compare all gene distances pairwise Cluster together the two closest genes Cluster the next-two closest (or closest to the cluster we already have When we are clustering clusters, we can use a variety of algorithms: mean position, minimum distance, maximum distance, etc Mean position is UPGMA

from http://dir.niehs.nih.gov/microarray/datasets/home-pub.htm
Unique Patterns of Gene Expression Changes in Liver After Treatment of Mice for Two Weeks With Different Known Carcinogens and Non-carcinogens Iida M., Anna C.H., Holliday W. M. , Collins J.B., Cunningham M.L., Sills R.C. and Devereux T.R from

K-means clustering Pre-define the number of clusters you want (=k).
Arbitrarily pick k locations for the clusters (random) For each gene, determine which cluster it is closest to, and assign it to that cluster Redefine each cluster’s location as the average of the genes assigned to it Repeat gene assignment and cluster location until stable clusters formed.

K-means clustering

K-means clustering (pearson)

Relative tradeoffs Hierarchical clustering does not specify clusters, just a hierarchy It doesn’t reflect complex relationships well K-means clustering is very good with points in clusters and very bad with points that don’t belong to any cluster Both methods can be made more sophisticated in various ways

Principal component analysis
Another way of looking at data that doesn’t find clusters but may be useful for making sense of it all is principal component analysis In multidimensional datasets, find the principal axes of variation

Three examples

Principal components There are as many principal components as there are dimensions in the data set Each principal component is orthogonal to all other pcs Each principal component has a direction (vector) and a size (the amount of variation it explains) All the sizes sum to 1 Typically, once the size is less than 0.05, we ignore smaller components

Calculating principal components
Data needs to be centred: origin corresponds to mean or expectation of data points. The first axis: we are looking for a unit vector u, such that the length of each data point projected along that vector is maximal: u = arg max E[ (ux)2] And repeat, subtracting the projection of each x along u out.

Eigenvectors and eigen values
If we calculate the data covariance matrix S = E[xx], then the principal components of the data are the eigenvectors of S, and their eigenvalues are the sizes. “Eigengenes” - represent the directions of the principal components and may represent an underlying shared property.

Microarrays Cluster analysis.

Similar presentations

Presentation on theme: "Microarrays Cluster analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Microarrays Cluster analysis.

Similar presentations

Presentation on theme: "Microarrays Cluster analysis."— Presentation transcript:

Similar presentations

About project

Feedback