Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis of microarray data. Gene expression database – a conceptual view Samples Genes Gene expression levels Sample annotations Gene annotations Gene.

Similar presentations


Presentation on theme: "Analysis of microarray data. Gene expression database – a conceptual view Samples Genes Gene expression levels Sample annotations Gene annotations Gene."— Presentation transcript:

1 Analysis of microarray data

2 Gene expression database – a conceptual view Samples Genes Gene expression levels Sample annotations Gene annotations Gene expression matrix

3 An Example 4 3 x y

4 Distance-based Clustering Assign a distance measure between data Find a partition such that: –Distance between objects within partition (i.e. same cluster) is minimized –Distance between objects from different clusters is maximised Issues : –Requires defining a distance (similarity) measure in situation where it is unclear how to assign it –What relative weighting to give to one attribute vs another? –Number of possible partition is super-exponential

5 Hierarchical Clustering Techniques At the beginning, each object (gene) is a cluster. In each of the subsequent steps, two closest clusters will merge into one cluster until there is only one cluster left.

6 Hierarchical Clustering Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process hierarchical clustering is this: 1.Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain. 2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3.Compute distances (similarities) between the new cluster and each of the old clusters. 4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

7 The distance between two clusters is defined as the distance between Single-Link Method / Nearest Neighbor (NN): minimum of pairwise dissimilarities Complete-Link / Furthest Neighbor (FN): maximum of pairwise dissimilarities Unweighted Pair Group Method with Arithmetic Mean (UPGMA): average of pairwise dissimilarities Their Centroids. Average of all cross-cluster pairs.

8 Computing Distances single-link clustering (also called the connectedness or minimum method) : we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster. complete-link clustering (also called the diameter or maximum method): we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster. average-link clustering : we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

9 Single-Link Method b a Distance Matrix Euclidean Distance (1) (2) (3) a,b,c ccd a,b dd a,b,c,d

10 Complete-Link Method b a Distance Matrix Euclidean Distance (1) (2) (3) a,b ccd d c,d a,b,c,d

11 Compare Dendrograms 2 4 6 0 Single-LinkComplete-Link

12 Ordered dendrograms 2 n-1 linear orderings of n elements (n= # genes or conditions) Maximizing adjacent similarity is impractical. So order by: Average expression level, Time of max induction, or Chromosome positioning Eisen98

13 Self organizing maps Tamayo et al. 1999 PNAS 96:2907-2912

14

15 1. centroide2. centroide3. centroide 4. centroide 5. centroide6. centroide k = 6

16

17

18

19 Partitioning vs. Hierarchical Partitioning –Advantage: Provides clusters that satisfy some optimality criterion (approximately) –Disadvantages: Need initial K, long computation time Hierarchical –Advantage: Fast computation (agglomerative) –Disadvantages: Rigid, cannot correct later for erroneous decisions made earlier

20 Generic Clustering Tasks Estimating number of clusters Assigning each object to a cluster Assessing strength/confidence of cluster assignments for individual objects Assessing cluster homogeneity

21 Clustering and promoter elements Harmer et al. 2000 Science 290:2110-2113

22 An Example Cluster (DeRisi et al, 1997)

23 Cluster of co-expressed genes, pattern discovery in regulatory regions 600 basepairs Expression profiles Upstream regions Retrieve Pattern over-represented in cluster

24 Some Discovered Patterns Pattern Probability ClusterNo.Total ACGCG 6.41E-3996751088 ACGCGT 5.23E-389452 387 CCTCGACTAA 5.43E-382718 23 GACGCG 7.89E-318640 284 TTTCGAAACTTACAAAAAT 2.08E-292614 18 TTCTTGTCAAAAAGC 2.08E-292614 18 ACATACTATTGTTAAT 3.81E-282213 18 GATGAGATG 5.60E-286824 83 TGTTTATATTGATGGA 1.90E-272413 18 GATGGATTTCTTGTCAAAA 5.04E-271812 18 TATAAATAGAGC 1.51E-262713 18 GATTTCTTGTCAAA 3.40E-262012 18 GATGGATTTCTTG 3.40E-262012 18 GGTGGCAA 4.18E-264020 96 TTCTTGTCAAAAAGCA 5.10E-262913 18 Vilo et al. 2001

25 Jaak Vilo The " GGTGGCAA " Cluster

26 Two sided clustering Alizadeh et al. 2000 Nature 403:505-5011

27 Diffuse large B-cell lymphoma

28

29

30

31 Neighborhood analysis Golub et al 2002

32 Acute Leukemias acute lymphoblastic leukemia, ALL acute myeloid leukemia, AML –Not distinguishable, but different clinical outcome

33 Neighborhood analysis Class predictor

34

35

36 Regulatory pathway reconstruction Ideker et al Science 2001

37

38

39

40

41

42 Chromatin IP Chip (ChIP-chip) Iver et al. 2000

43

44

45 Protein Function Prediction Jensen et al 2002

46 NetOGlyc, NetPhos, PEST regions, PSIPRED, SEG filter, SignalP, PSORT, TMHMM.

47

48

49 Protein Function Prediction II Marcotte & Eisenberg 1999

50

51

52

53

54

55

56

57

58 Biochemical pathways Dandekar et al 1999

59

60 Standard resolutionStandard resolution | High resolution Figure 1 Pathway alignment for glycolysis, Entner–Doudoroff pathway and pyruvate processing Enzymes for each pathway part (top; EC numbers and enzyme subunits are given below these) are compared in 17 organisms and represented as small rectangles. Filled and empty rectangles indicate the presence and absence respectively of enzyme-encoding genes in the different species listed at the left. Further details are given in the text; different isoenzymes and enzyme families are listed in Table 2.

61 Flux balance analysis Edwards et al 2000

62

63

64

65

66

67 Comparative genome analysis

68

69

70


Download ppt "Analysis of microarray data. Gene expression database – a conceptual view Samples Genes Gene expression levels Sample annotations Gene annotations Gene."

Similar presentations


Ads by Google