Presentation is loading. Please wait.

Presentation is loading. Please wait.

LSM3241: Bioinformatics and Biocomputing Lecture 8: Gene Expression Profiles and Microarray Data Analysis Prof. Chen Yu Zong Tel: 6874-6877

Similar presentations


Presentation on theme: "LSM3241: Bioinformatics and Biocomputing Lecture 8: Gene Expression Profiles and Microarray Data Analysis Prof. Chen Yu Zong Tel: 6874-6877"— Presentation transcript:

1 LSM3241: Bioinformatics and Biocomputing Lecture 8: Gene Expression Profiles and Microarray Data Analysis Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS yzchen@cz3.nus.edu.sg http://xin.cz3.nus.edu.sgyzchen@cz3.nus.edu.sg http://xin.cz3.nus.edu.sg

2 2 Biology and Cells All living organisms consist of cells (trillions of cells in human, yeast has one cell). Cells are of many different types (blood, skin, nerve), but all arose from a single cell (the fertilized egg) Each* cell contains a complete copy of the genome (the program for making the organism), encoded in DNA.

3 3 Gene Expression Cells are different because of differential gene expression. About 40% of human genes are expressed at one time. Gene is expressed by transcribing DNA into single- stranded mRNA mRNA is later translated into a protein Microarrays measure the level of mRNA expression

4 4 Overview of Molecular Biology Cell Nucleus Chromosome Protein Gene (DNA) Gene (mRNA), single strand cDNA

5 5 Gene Expression Genes control cell behavior by controlling which proteins are made by a cell House keeping genes vs. cell/tissue specific genes Regulation: Transcriptional (promoters and enhancers) Post Transcriptional (RNA splicing, stability, localization - small non coding RNAs)

6 6 Gene Expression Regulation: Translational (3’UTR repressors, poly A tail) Post Transcriptional (RNA splicing, stability, localization - small non coding RNAs) Post Translational (Protein modification: carbohydrates, lipids, phosphorylation, hydroxylation, methlylation, precursor protein) cDNA

7 7 Gene Expression Measurement mRNA expression represents dynamic aspects of cell mRNA expression can be measured by latest technology mRNA is isolated and labeled with fluorescent protein mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser

8 8 Traditional Methods Northern Blotting –Single RNA isolated –Probed with labeled cDNA RT-PCR –Primers amplify specific cDNA transcripts

9 9 Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at same time – Glass slide of DNA molecules Molecule: string of bases (25 bp – 500 bp) uniquely identifies gene or unit to be studied

10 10 Gene Expression Microarrays The main types of gene expression microarrays: Short oligonucleotide arrays (Affymetrix) cDNA or spotted arrays (Brown/Botstein). Long oligonucleotide arrays (Agilent Inkjet); Fiber-optic arrays...

11 11 Fabrications of Microarrays Size of a microscope slide Images: http://www.affymetrix.com/

12 12 Differing Conditions Ultimate Goal: –Understand expression level of genes under different conditions Helps to: –Determine genes involved in a disease –Pathways to a disease –Used as a screening tool

13 13 Gene Conditions Cell types (brain vs. liver) Developmental (fetal vs. adult) Response to stimulus Gene activity (wild vs. mutant) Disease states (healthy vs. diseased)

14 14 Expressed Genes Genes under a given condition –mRNA extracted from cells –mRNA labeled –Labeled mRNA is mRNA present in a given condition –Labeled mRNA will hybridize (base pair) with corresponding sequence on slide

15 15 Two Different Types of Microarrays Custom spotted arrays (up to 20,000 sequences) –cDNA –Oligonucleotide High-density (up to 100,000 sequences) synthetic oligonucleotide arrays –Affymetrix (25 bases) –SHOW AFFYMETRIX LAYOUT

16 16 Custom Arrays Mostly cDNA arrays 2-dye (2-channel) –RNA from two sources (cDNA created) Source 1: labeled with red dye Source 2: labeled with green dye

17 17 Two Channel Microarrays Microarrays measure gene expression Two different samples: –Control (green label) –Sample (red label) Both are washed over the microarray –Hybridization occurs –Each spot is one of 4 colors

18 18 Microarray Technology

19 19 Microarray Image Analysis Microarrays detect gene interactions: 4 colors: –Green: high control –Red: High sample –Yellow: Equal –Black: None Problem is to quantify image signals

20 20 Single Color Microarrays Prefabricated –Affymetrix (25mers) Custom –cDNA (500 bases or so) –Spotted oligos (70-80 bases)

21 21 Microarray Animations Davidson University: http://www.bio.davidson.edu/courses/genomics/chip/chip.html Imagecyte: http://www.imagecyte.com/array2.html

22 22 Basic idea of Microarray Construction –Place array of probes on microchip Probe (for example) is oligonucleotide ~25 bases long that characterizes gene or genome Each probe has many, many clones Chip is about 2cm by 2cm Application principle –Put (liquid) sample containing genes on microarray and allow probe and gene sequences to hybridize and wash away the rest – Analyze hybridization pattern

23 23 Microarray analysis Operation Principle: Samples are tagged with flourescent material to show pattern of sample-probe interaction (hybridization) Microarray may have 60K probe

24 24 Microarray Processing sequence

25 25 Gene Expression Data Gene expression data on p genes for n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j = Log (Red intensity / Green intensity) Log(Avg. PM - Avg. MM) sample1sample2sample3sample4sample5 … 1 0.46 0.30 0.80 1.51 0.90... 2-0.10 0.49 0.24 0.06 0.46... 3 0.15 0.74 0.04 0.10 0.20... 4-0.45-1.03-0.79-0.56-0.32... 5-0.06 1.06 1.35 1.09-1.09...

26 26 Some possible applications Sample from specific organ to show which genes are expressed and responsible for a functionality Compare samples from healthy and sick host to find gene-disease connection Analyze samples to differentiate sick and healthy, disease subtypes, drug response groups Probe samples, including human pathogens, for disease detection

27 27 Huge amount of data from single microarray If just two color, then amount of data on array with N probes is 2 N Cannot analyze pixel by pixel Analyze by pattern – cluster analysis

28 28 Major Data Mining Techniques Link Analysis –Associations Discovery –Sequential Pattern Discovery –Similar Time Series Discovery Predictive Modeling –Classification (assigns genes into known classes) –Clustering (groups genes into unknown clusters)

29 29 Supervised vs. Unsupervised Learning Supervised: there is a teacher, class labels are known Support vector machines Backpropagation neural networks Unsupervised: No teacher, class labels are unknown Clustering Self-organizing maps

30 30 Strengthens signal when averages are taken within clusters of genes (Eisen) Useful (essential?) when seeking new subclasses of cells, diseases, drug responses etc. Leads to readily interpreted figures Cluster Analysis: Grouping Similarly Expressed Genes, Cell Samples, or Both

31 31 Some clustering methods and software Partitioning:K-Means, K-Medoids, PAM, CLARA … Hierarchical:Cluster, HAC、BIRCH、CURE、 ROCK Density-based: CAST, DBSCAN、OPTICS、 CLIQUE… Grid-based:STING、CLIQUE、WaveCluster… Model-based:SOM (self-organized map)、 COBWEB、CLASSIT、AutoClass… Two-way Clustering Block clustering

32 32 Partitioning

33 33 Density-based clustering

34 34 Hierarchical (used most often)

35 35 Gene Expression Data Gene expression data on p genes for n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j = Log (Red intensity / Green intensity) Log(Avg. PM - Avg. MM) sample1sample2sample3sample4sample5 … 1 0.46 0.30 0.80 1.51 0.90... 2-0.10 0.49 0.24 0.06 0.46... 3 0.15 0.74 0.04 0.10 0.20... 4-0.45-1.03-0.79-0.56-0.32... 5-0.06 1.06 1.35 1.09-1.09...

36 36 Expression Vectors Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types. -0.8 0.8 1.5 1.8 0.5 -1.3 -0.4 1.5 Line Graph -2 2 Numeric Vector Heat map

37 37 Expression Vectors As Points in ‘ Expression Space ’ Experiment 1 Experiment 2 Experiment 3 Similar Expression -0.8 -0.6 0.91.2 -0.3 1.3 -0.7 t 1t 2t 3 G1 G2 G3 G4 G5 -0.4 -0.8 -0.7 1.30.9 -0.6

38 38 Cluster Analysis Group a collection of objects into subsets or “clusters” such that objects within a cluster are closely related to one another than objects assigned to different clusters.

39 39 How can we do this? What is closely related? Distance or similarity metric What is close? Clustering algorithm How do we minimize distance between objects in a group while maximizing distances between groups?

40 40 Distance Metrics Euclidean Distance measures average distance Manhattan (City Block) measures average in each dimension Correlation measures difference with respect to linear trends Gene Expression 1 Gene Expression 2 (5.5,6) (3.5,4)

41 41 Clustering Time Series Data Measure gene expression on consecutive days Gene Measurement matrix G1= [1.2 4.0 5.0 1.0] G2= [2.0 2.5 5.5 6.0] G3= [4.5 3.0 2.5 1.0] G4= [3.5 1.5 1.2 1.5]

42 42 Euclidean Distance Distance is the square root of the sum of the squared distance between coordinates 05.34.35.1 5.306.46.5 4.36.402.3 5.16.52.30

43 43 City Block or Manhattan Distance G1= [1.2 4.0 5.0 1.0] G2= [2.0 2.5 5.5 6.0] G3= [4.5 3.0 2.5 1.0] G4= [3.5 1.5 1.2 1.5] Distance is the sum of the absolute value between coordinates 07.86.89.1 7.801111.3 6.81104.3 9.111.34.30

44 44 Correlation Distance Pearson correlation measures the degree of linear relationship between variables, [-1,1] Distance is 1-(pearson correlation), range of [0,2] 0.91.981.6.9101.91.7.981.90.22 1.61.7.220

45 45 Similarity Measurements Pearson Correlation Two profiles (vectors) and +1  Pearson Correlation  – 1

46 46 Hierarchical Clustering (HCL-1) IDEA: Iteratively combines genes into groups based on similar patterns of observed expression By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships. Display the data as a heat map and dendrogram Cluster genes, samples or both

47 47 Hierarchical Clustering Dendrogram Venn Diagram of Clustered Data

48 48 Hierarchical clustering Merging (agglomerative): start with every measurement as a separate cluster then combine Splitting: make one large cluster, then split up into smaller pieces What is the distance between two clusters?

49 49 Distance between clusters Single-link: distance is the shortest distance from any member of one cluster to any member of the other cluster Complete link: distance is the longest distance from any member of one cluster to any member of the other cluster Average: Distance between the average of all points in each cluster Ward: minimizes the sum of squares of any two clusters

50 50 Hierarchical Clustering-Merging Euclidean distance Average linking Gene expression time series Distance between clusters when combined

51 51 Manhattan Distance Average linking Gene expression time series Distance between clusters when combined

52 52 Correlation Distance

53 53 Data Standardization Data points are normalized with respect to mean and variance, “sphering” the data After sphering, Euclidean and correlation distance are equivalent Standardization makes sense if you are not interested in the size of the effects, but in the effect itself Results are misleading for noisy data

54 54 ABCD Dist ABCD A2072 B1025 C3 D Distance MatrixInitial Data Items Hierarchical Clustering

55 55 ABCD Dist ABCD A2072 B1025 C3 D Distance MatrixInitial Data Items Hierarchical Clustering

56 56 Current Clusters Single Linkage Hierarchical Clustering Dist ABCD A2072 B1025 C3 D Distance Matrix ABCD 2

57 57 Dist ADBC 203 B10 C Distance MatrixCurrent Clusters Single Linkage Hierarchical Clustering ABCD

58 58 ABCD Dist ADBC 203 B10 C Distance MatrixCurrent Clusters Single Linkage Hierarchical Clustering

59 59 Dist ADBC 203 B10 C Distance MatrixCurrent Clusters Single Linkage Hierarchical Clustering ABCD 3

60 60 Dist AD C B 10 B Distance MatrixCurrent Clusters Single Linkage Hierarchical Clustering ABCD

61 61 ABCD Dist AD C B 10 B Distance MatrixCurrent Clusters Single Linkage Hierarchical Clustering

62 62 Dist AD C B 10 B Distance MatrixCurrent Clusters Single Linkage Hierarchical Clustering ABCD 10

63 63 ABCD Dist AD CB Distance MatrixFinal Result Single Linkage Hierarchical Clustering

64 64 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

65 65 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

66 66 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

67 67 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

68 68 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

69 69 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

70 70 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

71 71 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

72 72 Hierarchical Clustering HL

73 73 Hierarchical Clustering The Leaf Ordering Problem: Find ‘optimal’ layout of branches for a given dendrogram architecture 2 N-1 possible orderings of the branches For a small microarray dataset of 500 genes, there are 1.6*E150 branch configurations Samples Genes

74 74 Hierarchical Clustering The Leaf Ordering Problem:

75 75 Hierarchical Clustering Pros: –Commonly used algorithm –Simple and quick to calculate Cons: –Real genes probably do not have a hierarchical organization

76 76 Using Hierarchical Clustering 1.Choose what samples and genes to use in your analysis 2.Choose similarity/distance metric 3.Choose clustering direction 4.Choose linkage method 5.Calculate the dendrogram 6.Choose height/number of clusters for interpretation 7.Assess results 8.Interpret cluster structure

77 77Limitations Cluster analyses: –Usually outside the normal framework of statistical inference –Less appropriate when only a few genes are likely to change –Needs lots of experiments Single gene tests : –May be too noisy in general to show much –May not reveal coordinated effects of positively correlated genes. –Hard to relate to pathways

78 78 Useful Links Affymetrix www.affymetrix.comwww.affymetrix.com Michael Eisen Lab at LBL (hierarchical clustering software “Cluster” and “Tree View” (Windows)) rana.lbl.gov/ Review of Currently Available Microarray Software www.the-scientist.com/yr2001/apr/profile1_010430.html www.the-scientist.com/yr2001/apr/profile1_010430.html ArrayExpress at the EBI http://www.ebi.ac.uk/arrayexpress/http://www.ebi.ac.uk/arrayexpress/ Stanford MicroArray Database http://genome-www5.stanford.edu/http://genome-www5.stanford.edu/ Yale Microarray Database http://info.med.yale.edu/microarray/http://info.med.yale.edu/microarray/ Microarray DB www.biologie.ens.fr/en/genetiqu/puces/bddeng.htmlwww.biologie.ens.fr/en/genetiqu/puces/bddeng.html


Download ppt "LSM3241: Bioinformatics and Biocomputing Lecture 8: Gene Expression Profiles and Microarray Data Analysis Prof. Chen Yu Zong Tel: 6874-6877"

Similar presentations


Ads by Google