Tutorial 6 : RNA - Sequencing Analysis and GO enrichment

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Peter Tsai Bioinformatics Institute, University of Auckland
Gene Ontology John Pinney
RNA-seq analysis case study Anne de Jong 2015
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Fuzzy K means.
Protein and Function Databases
Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Scaffold Download free viewer:
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
High Throughput Sequencing
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
RNAseq analyses -- methods
Gene expression analysis
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Changes in Gene Regulation in Δ Zap1 Strain of Saccharomyces cerevisiae due to Cold Shock Jim McDonald and Paul Magnano.
Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.
Summarizing Differential Expression Using Mann-Whitney U-tests.
1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
No reference available
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Canadian Bioinformatics Workshops
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
David Amar, Tom Hait, and Ron Shamir
CellExpress Tutorial A Comprehensive Microarray-Based Cancer Cell Line and Clinical Sample Gene Expression Analysis Online System :8080 NTU.
Clustering Manpreet S. Katari.
Biases and their Effect on Biological Interpretation
Gene expression from RNA-Seq
Exploring Microarray data
Microarray Experiment Design and Data Interpretation
Gene expression.
Statistical Testing with Genes
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Functional Genomics in Evolutionary Research
Microarray Clustering
The Omics Dashboard Suzanne Paley Pathway Tools Workshop 2018
Artefacts and Biases in Gene Set Analysis
Supplementary Figure 1. A Case no. #1 #2 #3 #5 #6 #7 #8 #9 #10 #11 #13
Analysis of GO annotation at cluster level by Agnieszka S. Juncker
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Exploring and Understanding ChIP-Seq data
Gene expression analysis
Expression profiling of snoRNAs in normal hematopoiesis and AML
Learning to count: quantifying signal
Volume 23, Issue 4, Pages (April 2018)
LR LS SR SS RR RS Cluster T7 Cluster T6 Cluster T4 Cluster T1
Adrien Le Thomas, Georgi K. Marinov, Alexei A. Aravin  Cell Reports 
StatQuest!
Artefacts and Biases in Gene Set Analysis
Working with RNA-Seq Data
Volume 149, Issue 7, Pages (June 2012)
Interpretation of Similar Gene Expression Reordering
Gene Expression Analysis
Microarray Data Analysis
Volume 10, Issue 10, Pages (October 2017)
The Omics Dashboard.
Volume 42, Issue 6, Pages (June 2011)
Volume 16, Issue 2, Pages (February 2015)
Sequence Analysis - RNA-Seq 2
Statistical Testing with Genes
Cancer Cell Line Encyclopedia
Differential Expression of RNA-Seq Data
Presentation transcript:

Tutorial 6 : RNA - Sequencing Analysis and GO enrichment

Agenda GEO – Gene Expression Omnibus RNA-seq pipeline – Normalization, Replicates, Differential expression  GO (Gene Ontology) and GO enrichment

GEO – Gene Expression Omnibus

Gene expression data sources Microarrays RNA-seq experiments

How to interpret an expression data matrix Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9 Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7 Gene 3 -2.5 1.5 -0.1 -1 0.1 Gene 4 2.6 2.5 -2.3 Gene 5 1.9 2.2 Gene 6 -2.9 -1.9 -2.4 Each column represents all the gene expression levels from a single sample. Each row represents the expression of a gene across all experiments.

The current rate of submission and processing is over 10,000 samples per month. In 2002 Nature journals announced requirement for microarray data deposit to public databases.

Searching for expression profiles in the GEO http://www.ncbi.nlm.nih.gov/geo/ *further curated= statistically comparable datasets

Clustering Statistical analysis Download dataset

Raw data (soft file) Expression values per sample (GSM) ... Expression values per sample (GSM) Probes Genes Gene annotations

Viewing the expression levels

Viewing the expression levels

RNA-Seq pipeline and analysis

Insert – The cDNA fragment that is used for sequencing Read – The part of the insert that is sequenced Single end: Sequencing each fragment from one side Paired end: Sequencing each fragment from both sides

Mapping + Counting After sequencing we receive large amounts of reads. Each read represents a transcript from our initial RNA sample, the purpose of the first step – Alignment, is to find the genomic coordinates from which the read originated. After detecting the origin position of the read in the genome, we check whether or not this genomic region contains a known gene, if so – the read will be counted as an expression unit for it. After mapping and counting, we receive a matrix where each row represents a gene and each column a single sample – the value of each cell is the amount of reads mapped to this a specific gene in the sample. Tophat Splice junction

Normalization In order to bring all samples to a common scale and make the expression levels of genes comparable between samples, raw counts are normalized. The normalization is done by assigning a normalization factor (size factor) for each sample, using DESeq2 package that also perform the differential expression analysis. The raw counts of each sample are divided by its calculated size factor.

Replicates evaluation The significance of a gene’s change between conditions is affected not only by the fold change between the two tested groups, but also by the within-group variability. In order to evaluate the similarity between samples and test for batch effects in the data, several diagnostic plots are generated using the normalized data

Replicates evaluation The more replicates we have – the easier it is to detect outliers and remove them if necessary Heatmap of Euclidian distances clustered using hierarchical clustering High similarity in each group of replicates The differences between strains are more significant than the treatment effect Darker color – Higher similarity Lighter color – Lower similarity Strain 1 Low temperature Control Treated Outlier? High temperature Did the outlier sample showed technical differences from its replicates? Lower mapping or counting percentages? Strain 2 control

Replicates evaluation PCA plots – visualizing experimental covariates and batch effects Time 0 Time 3 We can see that the replicates show relatively high similarity with respect to the first two principle components Time 1 Time 2 Good separation of groups of replicates according to the first principle component

Differential expression analysis The differential gene expression analysis is conducted using DESeq2 R package. DESeq2 generates a matrix of normalized counts and performs statistical tests to determine whether genes are differentially expressed between groups of samples or combinations of factors. Aim: Detect changes between experimental conditions of interest that are significantly larger than the technical and biological variability among replicates.

Normalized counts Treated Normalized counts control Differential expression analysis Filtered before multiple testing correction due to low expression levels Threshold of 0.05 on adjusted p-value Gene ID Gene position baseMean log2FoldChange FoldChange P-value adjusted p-value Flag Normalized counts Treated Normalized counts control Significantly DE EHI_089420 DS571553:1893-3385 0.921356674 -0.113760015 0.924176292 0.798135014 NA Low_Counts 1.17;0 0;2.31;1.12 no EHI_045110 DS571147:94858-96333 430.4921647 -1.494064536 0.355010959 0.000400676 0.01861908 Tested 226.08;140.55 674.53;642.51;468.79 yes EHI_033240-1 DS571312:16339-16734 759.2026371 2.222073167 4.665634092 1.93908E-05 0.003359081 1077.67;2027.7 244.18;60.86;385.6 Down regulated Up regulated

Differential expression analysis Possible filters to generate lists of genes for further analysis: Set more strict or permissive threshold on the adjusted p-value Set a threshold on the fold change to separate up and down regulated genes or to ignore relatively mild changes Set a threshold on the average expression levels to ignore lowly expressed genes The thresholds and filtering criteria should be set specifically for each project to match the expected effects and the question of interest

Now what? We measured thousand of genes using RNA-Seq, and found the genes that change the most between two conditions. How can you make sense out of a list of hundreds of thousands of genes? There is a need for some kind of generalization for gene functions.

GO (gene ontology) annotations

Gene Ontology (GO) http://www.geneontology.org/ The Gene Ontology project defines concepts/classes used to describe gene function, and relationships between these concepts. The ontology covers three domains: Biological process Cellular component Molecular function

Gene Ontology (GO) Note: Cellular Component (CC) - the parts of a cell or its extracellular environment. Molecular Function (MF) - the elemental activities of a gene product at the molecular level, such as binding or catalysis. Biological Process (BP) - operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms. Note: A single gene may by active in more than on area of the cell and/or perform several different functions – thus It may be assigned to many distinct or related term

The GO tree – a partial example More general More specific

GO enrichment

Smaller (more significant) p-value Our question: is the fraction of related genes in the input list is more than random chance comparing to the background? Background (all genes) 20,000 genes 200 genes involved in “response to heat shock” Our list of genes 2,000 genes 20 genes involved in “response to heat shock” Fraction of related genes in the background: 200 20,000 = 1 100 Fraction of related genes in our list: 20 2,000 = 1 100 No The fraction of related genes in the input list is significantly greater than the background fraction The probability that this event might be due to random chance decreases Smaller (more significant) p-value

GO enrichment Immune processes Metabolic process Signal transduction Cell cycle Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 Gene 10 Gene 11 Gene 12 We would now want to check what gene functions are enriched in our list of interest.

DAVID http://david.abcc.ncifcrf.gov/   DAVID  http://david.abcc.ncifcrf.gov/ Functional Annotation Bioinformatics Microarray Analysis Identify enriched biological themes, particularly GO terms Discover enriched functional-related gene/protein groups

annotation ID conversion

Functional annotation - upload Gene list you want to explore (for example all the genes in a certain cluster/all differentially expressed genes) What is the identifier? (probes/ gene names/ gene IDs) You can supply a background list as well

Functional annotation - results Different kinds of enrichments are calculated

Functional annotation - results Charts for each category Genes from your list involved in this category

בגנים שהגיבו לטיפול בניסוי הנ"ל מועשרות לדוגמא הפונקציות: תגובה דלקתית ריפוי פצעים תגובה חיסונית Minimum number of genes for corresponding term Maximum EASE score/ E-value Source of term Enriched terms associated with your genes Genes from your list involved in this category P-Value Adjusted P-Value

Functional annotation - results Charts for each category

גנים אלו פעילים בעיקר בחללים בין התאים

Summary GEO – Gen Expression Omnibus RNA-seq pipeline – Normalization, Replicates, Differential expression  GO (Gene Ontology) and GO enrichment