Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2017-03 RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2017-03.

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

RNA-seq library prep introduction
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
Visualising and Exploring BS-Seq Data
Simon v2.3 RNA-Seq Analysis Simon v2.3.
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNAseq analysis Bioinformatics Analysis Team
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Ribosomal Profiling Data Handling and Analysis
RNA-seq Analysis in Galaxy
RNA-Seq and RNA Structure Prediction
Brief workflow RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA (cDNA). Fragments meeting a certain size specification.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
RNAseq analyses -- methods
RNA-Seq Analysis Simon V4.1.
Transcriptome Analysis
Eran Yanowski, Eran Hornstein’s: Monitor drug impact on the transcriptome of mouse beta cells (primary and cell-line) using Transeq/RNA-Seq Report.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Introduction to RNAseq
The iPlant Collaborative
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Misleading bioinformatics: Mistakes, Biases, Mis-interpretations and how to avoid them Festival of Genomics 2017 Course Exercise Material:
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Differential Methylation Analysis
Statistics Behind Differential Gene Expression
SeqMonk tools for methylation analysis
Short Read Sequencing Analysis Workshop
RNA Quantitation from RNAseq Data
Placental Bioinformatics
RNA-Seq for the Next Generation RNA-Seq Intro Slides
Simon v1.0 Motif Searching Simon v1.0.
Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
Moderní metody analýzy genomu
Biases and their Effect on Biological Interpretation
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Understanding and Validating Experimental Expectations
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
The FASTQ format and quality control
Kallisto: near-optimal RNA seq quantification tool
Differential Expression from RNA-seq
Analysing ChIP-Seq Data
Sequence Analysis 2- RNA-Seq
Artefacts and Biases in Gene Set Analysis
Transcriptome analysis
ChIP-Seq Data Processing and QC
Exploring and Understanding ChIP-Seq data
Visualising and Exploring BS-Seq Data
Learning to count: quantifying signal
Simon V Motif Searching Simon V
Assessing changes in data – Part 2, Differential Expression with DESeq2
Artefacts and Biases in Gene Set Analysis
Volume 16, Issue 8, Pages (August 2016)
Additional file 2: RNA-Seq data analysis pipeline
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
BF528 - Sequence Analysis Fundamentals
Introduction to RNA-Seq & Transcriptome Analysis
Sequence Analysis - RNA-Seq 1
Differential Expression of RNA-Seq Data
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2017-03 RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2017-03

RNA-Seq Libraries rRNA depleted mRNA Fragment Random prime + RT 2nd strand synthesis (+ U) A-tailing Adapter Ligation (U strand degradation) Sequencing NNNN u u u u u A T u A T A T

Reference based RNA-Seq Analysis QC (Trimming) Mapping Statistical Analysis Quantitation Mapped QC

Quality Control

QC: Raw Data Sequence call quality

QC: Raw Data Sequence bias

QC: Raw Data Duplication level

Mapping to a reference

Mapping Exon 1 Exon 2 Exon 3 Genome Simple mapping within exons Mapping between exons Spliced mapping

RNA-Seq Mapping Software HiSat2 (https://ccb.jhu.edu/software/hisat2/) Star (http://code.google.com/p/rna-star/) Tophat (http://tophat.cbcb.umd.edu/)

HiSat2 pipeline Reference FastA files Indexed Genome Reference GTF Models Reads (fastq) Yes Maps with known junctions Report Pool of known splice junctions Yes Add Maps convincingly with novel junction? Report No Discard

Quality Control (mapped data)

Mapping Statistics 2221186 reads; of these: 2221186 (100.00%) were paired; of these: 710157 (31.97%) aligned concordantly 0 times 794711 (35.78%) aligned concordantly exactly 1 time 716318 (32.25%) aligned concordantly >1 times ---- 710157 pairs aligned concordantly 0 times; of these: 3924 (0.55%) aligned discordantly 1 time ---- 706233 pairs aligned 0 times concordantly or discordantly; of these: 1412466 mates make up the pairs; of these: 1280507 (90.66%) aligned 0 times 52740 (3.73%) aligned exactly 1 time 79219 (5.61%) aligned >1 times 71.18% overall alignment rate

Viewing Mapped Data Reads over exons Reads over introns Reads in intergenic Strand specificity

SeqMonk RNA-Seq QC (good)

SeqMonk RNA-Seq QC (bad)

SeqMonk RNA-Seq QC (bad)

Duplication (good)

Duplication (bad)

Look at poor QC samples

Quantitation

Quantitation Splice form 1 Exon 1 Exon 2 Exon 3 Splice form 2 Exon 1 Definitely splice form 1 Definitely splice form 2 Ambiguous

Simple Quantitation Count read overlaps with exons of each gene Consider library directionality Simple Gene level quantitation Many programs Seqmonk (graphical) Feature Counts (subread) BEDTools HTSeq

Analysing Splicing Estimate transcript level expression Cufflinks, RSEM, bitSeq etc. Compare splicing decision ratios rMATS, logistic regression Compare exon expression to gene expression DEXSeq Analyse junction usage directly Counts + DESeq/EdgeR

RPKM / FPKM / TPM RPKM (Reads per kilobase of transcript per million reads of library) Corrects for total library coverage Corrects for gene length Comparable between different genes within the same dataset FPKM (Fragments per kilobase of transcript per million reads of library) Only relevant for paired end libraries Pairs are not independent observation Effectively halves raw counts TPM (transcripts per million) Normalises to transcript copies instead of reads Corrects for cases where the average transcript length differs between samples

Normalisation – Coverage Outliers

Normalisation – DNA Contamination

Filtering Genes Remove things which are uninteresting or shouldn’t be measured Reduces noise – easier to achieve significance Non-coding (miRNA, snoRNA etc) in RNA-Seq Known mis-spliced forms (exon skipping etc) Mitochondrial genes X/Y chr genes in mixed sex populations Unknown/Unannotated genes

Visualising Expression Linear Log2 CD74 Eef1a1 Actb Lars2 Eef2

Differential Expression

Differential Expression Mapped data Normalised expression (+ confidence) Raw counts Continuous stats Binomial stats

DE-Seq binomial Stats Are the counts we see for gene X in condition 1 consistent with those for gene X in condition 2? Size factors Estimator of library sampling depth More stable measure than total coverage Based on median ratio between conditions Variance – required for NB distribution Insufficient observations to allow direct measure Custom variance distribution fitted to real data Smooth distribution assumed to allow fitting

Dispersion shrinkage Plot observed per gene dispersion Calculate average dispersion for genes with similar observation Individual dispersions regressed towards the mean. Weighted by Distance from mean Number of observations Points more than 2SD above the mean are not regressed

Other filters Cook’s Distance – Identifies high variance Effect on mean of removal of one replicate For n<3 test not performed For n = 3-6 failures are removed For n>6 outliers removed to make trimmed mean Disable with cooksCutoff=FALSE Hit count optimisation Low intensity reads are removed Limits multiple-testing to give max significant hits Disable with independentFiltering=FALSE

Visualising Differential Expression Results 5x5 Replicates 5,000 out of 22,000 genes (23%) identified as DE using DESeq (p<0.05) Need further filtering!

Magnitude of effect filtering Intensity difference test Fold change biases to low expression Need a test which Has a statistical basis Doesn’t bias by expression level Returns sensible numbers of hits

Assumptions Noise is related to observation level Similar to DESeq Differences between conditions are either A direct response to stimulus Noise, either technical or biological Find points whose differences aren’t explained by general disruption

Method

Results

Experimental Design

Practical Experiment Design What type of library? What type of sequencing? How many reads? How many replicates?

What type of library? Directional libraries if possible Easier to spot contamination No mixed signals from antisense transcription May be difficult for low input samples mRNA vs total vs depletion etc. Down to experimental questions Remember LINC RNA may not have polyA tail Active transcription vs standing mRNA pool

What type of sequencing Depends on your interest Expression quantitation of known genes 50bp single end is fine Expression plus splice junction usage 100bp (or longer if possible) single end Novel transcript discovery 100bp paired end

How many reads Typically aim for 20 million reads for human / mouse sized genome More reads: De-novo discovery Low expressed transcripts More replicates more useful than more reads

Replicates Compared to arrays, RNA-Seq is a very clean technical measure of expression Generally don’t run technical replicates Must run biological replicates For clean systems (eg cell lines) 3x3 or 4x4 is common Higher numbers required as the system gets more variable Always plan for at least one sample to fail Randomise across sample groups

Exercises Look at raw QC Mapping with HiSat2 Small test data Quantitation and visualisation Use SeqMonk graphical program Larger replicated data Differential expression DESeq2 Intensity Difference Review in SeqMonk

Useful links FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ HiSat2 https://ccb.jhu.edu/software/hisat2/ SeqMonk http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/ Cufflinks http://cufflinks.cbcb.umd.edu/ DESeq2 https://bioconductor.org/packages/release/bioc/html/DESeq2.html Bioconductor http://www.bioconductor.org/ DupRadar http://sourceforge.net/projects/dupradar/