Download presentation
Presentation is loading. Please wait.
Published byJulian McLaughlin Modified over 9 years ago
1
More on Microarrays Chitta Baral Arizona State University
2
Some case studies Yeast cell cycle –Goal: Find groups of yeast genes whose expression profiles are similar over a 24-hour period. –Approach: Obtain gene expression measurements using Affymetrix S98 genome microarrays for a synchronized sample of yeast cells over a 24- hr period by sampling the total RNA populations at 30-minute intervals. 48 separate time points were sampled twice (for duplicate measurements) at each time point. (T 1 … T 48 and replicates T 1 ’ …T 48 ’ ) Drug intervention study –Goal: Characterize effect(s) of drug X three hours after it is introduced into normal adult mice by the expression level of liver cell genes. –Approach: Gene expression profiles of normal adult mice liver cells that are not treated with drug X are used as the control state. Call the preintervention or control state A, and the post intervention state B For replicate measurements, liver samples were obtained without drug X application from M A adult mice and another M B adult mice liver samples were obtained after drug X was applied.
3
Some potential questions when trying to cluster What uncategorized genes have an expression pattern similar to these genes that are well-characterized? How different is the pattern of expression of gene X from other genes? What genes closely share a pattern of expression with gene X? What category of function might gene X belong to? What are all the pairs of genes that closely share patterns of expression? Are there subtypes of disease X discernible by tissue gene expression? What tissue is this sample tissue closest to?
4
Questions – cont. Which are the different patterns of gene expression? Which genes have a pattern that may have been a result of the influence of gene X? What are all the gene-gene interactions present among these tissue samples? Which genes best differentiate these two group of tissues? Which gene-gene interactions best differentiate these two groups of tissue samples. DIFFERENT ALGORITHMS ARE MORE PARTICULARLY SUITED TO ANSWER SOME OF THESE QUESTIONS, COMPARED WITH THE OTHERS.
5
Bioinformatics algorithms and some known uses -- Unsupervised Feature determination: Determining genes with interesting properties, without looking for a particular pattern determined a priori. –Principal component analysis: determine genes explaining the majority of the variance in the data set. Cluster determination: Determine groups of genes or samples with similar patterns of gene expression. –Nearest neighbour clustering: # clusters are decided first, the clusters are calculated and each gene is assigned to a single cluster. Self-organizing maps –Tamayao et al. used it to functionally cluster genes into various patterned time courses in HL-60 cell macrophage differentiation. –Toronen used hierarchical SOM to cluster yeast genes responsible fore diauxic shift. K-means clustering –Soukas et al. used it to cluster genes involved in leptin signalling. –Tavazoie et al. used it to cluster genes with common regulatory sequences.
6
k-means clustering: basic idea Input: n objects (or points) and a number k A set of k-clusters that mimimizes the squared-error criterion (sum of squared errors = i=1 k p in Ci |p-m i |2, where m i is the mean of cluster c i.) Algorithm – complexity is O(nkt), where t = #iterations –Arbitrarily choose k objects as the initial cluster centers –Repeat –(Re)assign each object to the cluster to which the object is the most similar based on the mean value of the objects in the cluster. –Update the cluster means, (i.e., calculate the mean value of the objects for each cluster) –Until no change.
7
Pluses and minuses of k-means Pluses: Low complexity Minuses –Mean of a cluster may not be easy to define (data with categorical attributes) –Necessity of specifying k –Not suitable for discovering clusters of non-convex shape or of very different sizes –Sensitive to noise and outlier data points (a small number of such data can substantially influence the mean value) –Some of the above objections (especially the last one) can be overcomed by the k-medoid algorithm. Instead of the mean value of the objects in a cluster as a reference point, the medoid can be used, which is the most centrally located object in a cluster.
8
K-Medoid clustering Input: Set of objects (often points in a multi-dimensional space) Output: These objects clustered into k clusters Algorithm: Complexity O(n 2 ) when k <<n for each iteration –Select arbitrarily k representative objects. –Mark these objects as selected and mark the remaining as non-selected –Repeat (complexity O(k(n-k)n)) –For all selected objects O i do ****(complexity O(k(n-k)n)) For all non-selected objects O h compute C ih, where C ih, denotes the total cost of swapping i with h, i.e., C ih = j C jih, where C jih = d hj – d ij, and d ij denotes d(O i, O j ) the distance between O i and O j. ****(complexity O(k(n-k))) –Select i min, h min such that C imin,hmin = Min i,h C ih ****(complexity O(k(n-k))) – If C imin,hmin < 0 then mark O i as non-selected and O h as selected. –Until no change –The selected objects now define the clusters. A non-selected object O j belongs to the cluster represented by an object O i if d(O i, O j ) = Min Oe d(O j, O e ), where min is taken over all selected objects O e.
9
Self organizing maps A neural network algorithm that has been used for a wide variety of applications, mostly for engineering problems but also for data analysis. SOM can be used at the same time both to reduce the amount of data by clustering, and for projecting the data nonlinearly onto a lower-dimensional display. SOM vs k-means –In the SOM the distance of each input from all of the reference vectors instead of just the closest one is taken into account, weighted by the neighborhood kernel h. Thus, the SOM functions as a conventional clustering algorithm if the width of the neighborhood kernel is zero. –Whereas in the K-means clustering algorithm the number K of clusters should be chosen according to the number of clusters there are in the data, in the SOM the number of reference vectors can be chosen to be much larger, irrespective of the number of clusters. The cluster structures will become visible on the special displays
10
Bioinformatics algorithms and some known uses – Unsupervised; cont. Cluster determination (cont.) –Aggolmerative clustering: bottom up method, where clusters start as empty, then genes are successively added to the existing clusters Dendograms: Groups are defined as sub-trees in a phylogenetic-type tree created by a comprehensive pair-wise dissimilarity measure. 2-D Dendograms –Divisive or partitional clustering: top-down method, where large clusters are successively broken down into smaller ones, until each sub-cluster contains only one object (gene) Dendograms and 2-D Dendograms.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.