Download presentation
Presentation is loading. Please wait.
Published byVirgil Morris Modified over 8 years ago
1
Tutorial 8 Gene expression analysis 1
2
How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering –Tools for clustering - EPCLUST Functional analysis –Go annotation –DAVID 2
3
Gene expression data sources 3 MicroarraysRNA-seq experiments
4
How to interpret an expression data matrix Each column represents all the gene expression levels from a single sample. Each row represents the expression of a gene across all experiments. Sample 1Sample 2Sample 3Sample 4Sample 5Sample 6 Gene 1-1.2-2.1-3-1.51.82.9 Gene 22.70.2-1.11.6-2.2-1.7 Gene 3-2.51.5-0.1-1.10.1 Gene 42.92.62.5-2.3-0.1-2.3 Gene 50.11.92.62.22.7-2.1 Gene 6-2.9-1.9-2.4-0.1-1.92.9 4
5
Raw data pre-processing Raw data – the data values that we get from the microarray/ sequencer. Raw values are a general term used for the raw measurements made by an instrument. In microarrays the raw data is probe intensities. In sequencing the raw data is counts per gene. Raw data will almost always need to undergo some kind of processing in order to be in adequate quality and have a biological meaning. –For example high throughput sequencing raw data are the sequenced reads. They need to get mapped to the genome, possibly filtered, and then variant calling is done. 5
6
6 Expression profiles DBs GEO (Gene Expression Omnibus) http://www.ncbi.nlm.nih.gov/geo/ Human genome browser http://genome.ucsc.edu/ ArrayExpress http://www.ebi.ac.uk/arrayexpress/
7
7 The current rate of submission and processing is over 10,000 samples per month. In 2002 Nature journals announce requirement for microarray data deposit to public databases.
8
8 Searching for expression profiles in the GEO http://www.ncbi.nlm.nih.gov/geo/
9
GEO accession IDs GPL**** - platform ID GSM**** - sample ID GSE**** - series ID GDS**** - dataset ID A Series record defines a set of related samples considered to be part of a group. A GDS record represents a collection of biologically and statistically comparable GEO samples. Not every experiment has a GDS. 9
10
Download dataset Clustering Statistical analysis 10
11
Raw data (soft file) 11... Probes Genes Expression values per sample (GSM) Gene annotations
12
Clustering analysis 12 Zoom in
13
Clustering analysis – zoom in 13
14
14 Clustering analysis – zoom in
15
15
16
Viewing the expression levels 16
17
17 Viewing the expression levels
18
18
19
Clustering Grouping together genes with a similar signature 19
20
This clustering method is based on distances between expression profiles of different genes. Genes with similar expression patterns are grouped together. 20 Hierarchical Clustering
21
21 In both phylogenetic trees and in clustering we create a tree based on distance matrix. When computing phylogenetic trees: We compute distances between sequences. When computing clustering dendograms we compute distances between expression values. ATCTGTCCGCTCG ATGTGTGCGCTTG Expr.1Expr.2Expr.3Expr.4Expr.5Expr.6 Gene 1 Gene 2 Rings a bell?... Score
22
22 Hierarchical clustering methods produce a tree or a dendrogram. They avoid specifying how many clusters are appropriate. The partitions are obtained from cutting the tree at different levels. 2 clusters 4 clusters 6 clusters
23
23 The more clusters you want the higher the similarity is within each cluster. http://discoveryexhibition.org/pmwiki.php /Entries/Seo2009
24
Hierarchical clustering results 24 http://www.spandidos- publications.com/10.3892/ijo.2012.1644 You can cluster both samples and genes (separately)
25
An algorithm to classify the data into K number of groups. 25 K=4 Unsupervised Clustering – K-means clustering
26
How does it work? 26 The algorithm iteratively divides the genes into K groups and calculates the center of each group. The results are the optimal groups (center distances) for K clusters. 1 k initial "means" (in this casek=3) are randomly selected from the data set (shown in color). 2 k clusters are created by associating every observation with the nearest mean 3 The centroid of each of the k clusters becomes the new means. 4 Steps 2 and 3 are repeated until convergence has been reached.
27
27 How should we determine K? Trial and error Take K as square root of gene number
28
28 http://www.bioinf.ebc.ee/EP/EP/EPCLUST/ Tool for clustering - EPclust
29
29
30
30 Choose distance metric Choose algorithm
31
31 Hierarchical clustering
32
32 Zoom in by clicking on the nodes
33
33
34
34 K-means clustering
35
Graphical representation of the cluster Samples found in cluster 35
36
10 clusters, as requested 36
37
Now that we have clusters – we want to know what is the function of each group. There is a need for some kind of generalization for gene functions. 37 Now what?
38
Gene Ontology (GO) http://www.geneontology.org/ The Gene Ontology project provides an ontology of defined terms representing gene product properties. The ontology covers three domains: Biological process Cellular component Molecular function
39
39 Cellular Component (CC) - the parts of a cell or its extracellular environment. Molecular Function (MF) - the elemental activities of a gene product at the molecular level, such as binding or catalysis. Biological Process (BP) - operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms. Gene Ontology (GO)
40
The GO tree – a partial example
42
DAVID Functional Annotation Bioinformatics Microarray Analysis Identify enriched biological themes, particularly GO terms Discover enriched functional-related gene/protein groups http://david.abcc.ncifcrf.gov/
43
ID conversion annotation
44
Functional annotation - upload 44 Gene list you want to explore (for example all the genes in a certain cluster) What is the identifier? (probes/ gene names/ gene IDs) You can supply a background list as well
45
Functional annotation - results 45 Different kinds of enrichments are calculated
46
Genes from your list involved in this category Charts for each category Functional annotation - results
47
Minimum number of genes for corresponding term Maximum EASE score/ E-value Genes from your list involved in this category P-Value Enriched terms associated with your genes Source of term Adjusted P-Value
48
Gene expression analysis 48 How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering –Tools for clustering - EPCLUST Functional analysis –Go annotation –DAVID
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.