Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tutorial 6 : RNA - Sequencing Analysis and GO enrichment

Similar presentations


Presentation on theme: "Tutorial 6 : RNA - Sequencing Analysis and GO enrichment"— Presentation transcript:

1 Tutorial 6 : RNA - Sequencing Analysis and GO enrichment

2 Agenda GEO – Gene Expression Omnibus
RNA-seq pipeline – Normalization, Replicates, Differential expression  GO (Gene Ontology) and GO enrichment

3 GEO – Gene Expression Omnibus

4 Gene expression data sources
Microarrays RNA-seq experiments

5 How to interpret an expression data matrix
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9 Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7 Gene 3 -2.5 1.5 -0.1 -1 0.1 Gene 4 2.6 2.5 -2.3 Gene 5 1.9 2.2 Gene 6 -2.9 -1.9 -2.4 Each column represents all the gene expression levels from a single sample. Each row represents the expression of a gene across all experiments.

6 The current rate of submission and processing is over 10,000 samples per month.
In 2002 Nature journals announced requirement for microarray data deposit to public databases.

7 Searching for expression profiles in the GEO
*further curated= statistically comparable datasets

8 Clustering Statistical analysis Download dataset

9 Raw data (soft file) Expression values per sample (GSM)
... Expression values per sample (GSM) Probes Genes Gene annotations

10

11 Viewing the expression levels

12 Viewing the expression levels

13

14 RNA-Seq pipeline and analysis

15 Insert – The cDNA fragment that is used for sequencing
Read – The part of the insert that is sequenced Single end: Sequencing each fragment from one side Paired end: Sequencing each fragment from both sides

16 Mapping + Counting After sequencing we receive large amounts of reads.
Each read represents a transcript from our initial RNA sample, the purpose of the first step – Alignment, is to find the genomic coordinates from which the read originated. After detecting the origin position of the read in the genome, we check whether or not this genomic region contains a known gene, if so – the read will be counted as an expression unit for it. After mapping and counting, we receive a matrix where each row represents a gene and each column a single sample – the value of each cell is the amount of reads mapped to this a specific gene in the sample. Tophat Splice junction

17 Normalization In order to bring all samples to a common scale and make the expression levels of genes comparable between samples, raw counts are normalized. The normalization is done by assigning a normalization factor (size factor) for each sample, using DESeq2 package that also perform the differential expression analysis. The raw counts of each sample are divided by its calculated size factor.

18 Replicates evaluation
The significance of a gene’s change between conditions is affected not only by the fold change between the two tested groups, but also by the within-group variability. In order to evaluate the similarity between samples and test for batch effects in the data, several diagnostic plots are generated using the normalized data

19 Replicates evaluation
The more replicates we have – the easier it is to detect outliers and remove them if necessary Heatmap of Euclidian distances clustered using hierarchical clustering High similarity in each group of replicates The differences between strains are more significant than the treatment effect Darker color – Higher similarity Lighter color – Lower similarity Strain 1 Low temperature Control Treated Outlier? High temperature Did the outlier sample showed technical differences from its replicates? Lower mapping or counting percentages? Strain 2 control

20 Replicates evaluation
PCA plots – visualizing experimental covariates and batch effects Time 0 Time 3 We can see that the replicates show relatively high similarity with respect to the first two principle components Time 1 Time 2 Good separation of groups of replicates according to the first principle component

21 Differential expression analysis
The differential gene expression analysis is conducted using DESeq2 R package. DESeq2 generates a matrix of normalized counts and performs statistical tests to determine whether genes are differentially expressed between groups of samples or combinations of factors. Aim: Detect changes between experimental conditions of interest that are significantly larger than the technical and biological variability among replicates.

22 Normalized counts Treated Normalized counts control
Differential expression analysis Filtered before multiple testing correction due to low expression levels Threshold of 0.05 on adjusted p-value Gene ID Gene position baseMean log2FoldChange FoldChange P-value adjusted p-value Flag Normalized counts Treated Normalized counts control Significantly DE EHI_089420 DS571553: NA Low_Counts 1.17;0 0;2.31;1.12 no EHI_045110 DS571147: Tested 226.08;140.55 674.53;642.51;468.79 yes EHI_ DS571312: E-05 ;2027.7 244.18;60.86;385.6 Down regulated Up regulated

23 Differential expression analysis
Possible filters to generate lists of genes for further analysis: Set more strict or permissive threshold on the adjusted p-value Set a threshold on the fold change to separate up and down regulated genes or to ignore relatively mild changes Set a threshold on the average expression levels to ignore lowly expressed genes The thresholds and filtering criteria should be set specifically for each project to match the expected effects and the question of interest

24 Now what? We measured thousand of genes using RNA-Seq, and found the genes that change the most between two conditions. How can you make sense out of a list of hundreds of thousands of genes? There is a need for some kind of generalization for gene functions.

25 GO (gene ontology) annotations

26 Gene Ontology (GO) The Gene Ontology project defines concepts/classes used to describe gene function, and relationships between these concepts. The ontology covers three domains: Biological process Cellular component Molecular function

27

28 Gene Ontology (GO) Note:
Cellular Component (CC) - the parts of a cell or its extracellular environment. Molecular Function (MF) - the elemental activities of a gene product at the molecular level, such as binding or catalysis. Biological Process (BP) - operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms. Note: A single gene may by active in more than on area of the cell and/or perform several different functions – thus It may be assigned to many distinct or related term

29 The GO tree – a partial example
More general More specific

30 GO enrichment

31 Smaller (more significant) p-value
Our question: is the fraction of related genes in the input list is more than random chance comparing to the background? Background (all genes) 20,000 genes 200 genes involved in “response to heat shock” Our list of genes 2,000 genes 20 genes involved in “response to heat shock” Fraction of related genes in the background: 200 20,000 = 1 100 Fraction of related genes in our list: 20 2,000 = 1 100 No The fraction of related genes in the input list is significantly greater than the background fraction The probability that this event might be due to random chance decreases Smaller (more significant) p-value

32 GO enrichment Immune processes Metabolic process Signal transduction Cell cycle Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 Gene 10 Gene 11 Gene 12 We would now want to check what gene functions are enriched in our list of interest.

33 DAVID http://david.abcc.ncifcrf.gov/
DAVID  Functional Annotation Bioinformatics Microarray Analysis Identify enriched biological themes, particularly GO terms Discover enriched functional-related gene/protein groups

34 annotation ID conversion

35 Functional annotation - upload
Gene list you want to explore (for example all the genes in a certain cluster/all differentially expressed genes) What is the identifier? (probes/ gene names/ gene IDs) You can supply a background list as well

36

37 Functional annotation - results
Different kinds of enrichments are calculated

38 Functional annotation - results
Charts for each category Genes from your list involved in this category

39 בגנים שהגיבו לטיפול בניסוי הנ"ל מועשרות לדוגמא הפונקציות:
תגובה דלקתית ריפוי פצעים תגובה חיסונית Minimum number of genes for corresponding term Maximum EASE score/ E-value Source of term Enriched terms associated with your genes Genes from your list involved in this category P-Value Adjusted P-Value

40 Functional annotation - results
Charts for each category

41 גנים אלו פעילים בעיקר בחללים בין התאים

42 Summary GEO – Gen Expression Omnibus
RNA-seq pipeline – Normalization, Replicates, Differential expression  GO (Gene Ontology) and GO enrichment


Download ppt "Tutorial 6 : RNA - Sequencing Analysis and GO enrichment"

Similar presentations


Ads by Google