RNA-Seq analysis in R (Bioconductor)

RNA-Seq analysis in R (Bioconductor)
Soumya Luthra

Analyse the resulting short-read sequences
Index What is RNA-Seq? Analysis Pipeline Mapping Quantification Normalization Differential Expression Visualization Demo EdgeR DESeq2 Design Experiment Carefully design the experiment Purify RNA Isolate and purify input RNA Prepare Library Convert the RNA to cDNA and add sequencing adapters Sequence Sequence cDNAs using one of the available NGS platforms Analysis Analyse the resulting short-read sequences

What is RNA-Seq? RNA-Seq is the process of sequencing the transcriptome which includes protein coding and non-coding transcripts. Applications: Gene (exon, isoform) expression estimation Differential gene (exon, isoform) expression analysis Transcriptome assembly - Map exon, intron boundaries, splice junctions Discovery of novel transcribed regions Analyse alternate splicing

Analysis Pipeline Slide 19: Vall d’hebron

Step 1: Mapping Align reads against a set of reference genome or transcriptome Challenges: Computationally intensive – large number of reads Mapping reads across splice junctions. TopHat2 – Splice Aware aligner Extracting the transcript sequences and using Bowtie to align reads to the virtual transcriptome first The reads that do not fully map to the transcriptome will then be mapped on the genome and spliced alignment is attempted

Step 2: Quantification Identify the read that uniquely map to a gene.
How to resolve overlaps? Identify the read that uniquely map to a gene. Tools: HTSeq-count – Python package featureCounts() - from Rsubread R package summarizeOverlaps() - from GenomicsAlignments R package Counting rules: Count mapped reads, not base-pairs Count each read at most once Discard a read if It cannot be assigned to any feature It cannot be uniquely mapped It can be assigned to more than one gene (ambiguous) The mates do not map to the same gene Do not discard if there are read duplicates

Step 3: Normalization Read counts need to be properly normalized to accommodate for the following biases and extract meaningful expression estimates: Sequencing depth – Higher the sequencing depths, higher the counts Gene length - Longer transcripts are expected to generate more reads Count distribution The main biases that must be accounted for in the normalization and/or differential expression calculations are:

Step 4: Differential Expression Analysis
How do the expression levels differ across several conditions? Challenges: Count data is discrete – no normal distribution. Cannot perform t-test. Small number of replicates – can not use permutation methods Account for variability in measurements across biological replicates of an experiment

Poisson Distribution? Mean = Variance
Is read count data Poisson Distributed? Over-dispersion - variance in RNA-Seq measurements of gene expression are larger than the theoretical values

Negative Binomial Distribution
NB has been shown to be a good fit to RNA-Seq data It is flexible enough to account for biological variability Model: Makes the assumption that an observation say Ygj (observed number of reads for gene g sample j, has a mean μgj and a variance of μgj + Φg μ2, where Φg represents over-dispersion relative to poisson distribution. The mean parameter depends on the sequencing depth as well as on the mount of RNA from gene in the sample Obtaining good estimates of each gene’s dispersion is critical for statistical testing. Tools: EdgeR and DESeq test model the count data using a Negative Binomial distribution and perform statistical tests for differential expression.

edgeR EdgeR treats the Poisson variance as simple sampling variance, and refers to the dispersion estimate as the "biological coefficient of variation.” Estimating dispersion: EdgeR shares information across genes to determine a common dispersion. It then calculates a dispersion estimate per gene and shrinks it towards the common dispersion. The gene-specific (referred to in edgeR as tagwise) dispersion estimates are used in the test for differential expression. Statistical Test: Simple design - Fischer’s exact test. Complex design - Generalized linear model framework

DESeq Differential gene expression from count data based on negative binomial distribution. Offers two transformations for stabilizing the variance of count data VST – Variance stabilizing Transformation Regularized log (rlog)

Thank You

RNA-Seq analysis in R (Bioconductor)

Similar presentations

Presentation on theme: "RNA-Seq analysis in R (Bioconductor)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RNA-Seq analysis in R (Bioconductor)

Similar presentations

Presentation on theme: "RNA-Seq analysis in R (Bioconductor)"— Presentation transcript:

Similar presentations

About project

Feedback