Differential Expression from RNA-seq X. Shirley Liu STAT115/215, BIO/BST282
Sequencing Read Distribution The number of patients arriving in an emergency room between 10 and 11 pm # Reads mapped to a gene of 1KB long Poisson dist λ average events per interval K # events in an interval Var = mean = λ
Sequencing Read Distribution In reality, sequencing data is over-dispersed (Mean<Variance) Negative binomial NB(r, p) # of success before the first r failure, if Pb(succ) is p
Modeling Read Over Dispersion Variance estimated by borrowing information from all the genes – hierarchical models Test whether gene i expression follows same NB() between 2 conditions FDR?
Fold Change with Var Shrinkage shrinkage is not equal. strong moderation for low information genes: low counts almost no shrinkage noisy estimates due to low counts large FDR from the statistical model, but we shouldn't trust the estimate itself
Splicing Transcripts Assign reads to splice isoforms (TopHat)
Reference-based assembly Transcript Assembly Reference-based assembly Cufflinks De novo assembly Trinity
Isoform Inference If given known set of isoforms Estimate x to maximize the likelihood of observing n
Known Isoform Abundance Inference
Identification of Differential Splicing Between RNA-seq Samples Most differential splicing detection algorithms call differentially expressed exons, not whole transcripts, esp for novel splicing
Splicing Isoform Inference With known isoform set, sometimes the gene-level expression level inference is great, although isoform abundances might have uncertainty (e.g. known set incomplete) De novo method are usually better at detecting differential exon splicing, but not whole transcripts De novo isoform inference is a non-identifiable problem if RNA-seq reads are short and gene is long with too many exons Experimental validation of quantitative differential splicing is still quite hard
Active Field HISAT2 for fast alignment Kallisto and Sleuth Hierarchical index https://ccb.jhu.edu/software/hisat2/index.shtml Kallisto and Sleuth Kallisto TPM, Sleuth differential expression Known genes and transcripts https://scilifelab.github.io/courses/rnaseq/labs/kallisto
Summary Break RNA-seq design considerations Read mapping: BWA, STAR Quality control: RSeQC Expression index: R/FPKM and TPM Differential expression: LIMMA-VOOM and DESeq Transcriptome assembly: Cufflinks, Trinity Alternative splicing: r/MATs New developments: HISAT2, Kallisto and Sleuth Break
Single Cell RNA-seq
Why Single-Cell RNA-seq? Heterogeneous cell populations Kolodziejczyk et al, Mol Cell 2015
Why Single-Cell RNA-seq?
Two General Approaches From Ziegenhain et al. 2017
Drop-Seq From Macosko et al. 2015 Drop-seq overview. Cells mix with reagents in a droplet. RNA attaches to particle with specific barcode, etc, etc. From Macosko et al. 2015
Variations cDNA conversion rate: 2-25% Droplet size Reagent concentration Cell ct & dilution PCR efficiency UMI controls over amplification of one transcript
Sequencing Results PE seq $$$, one read has cell barcode, UMI and polyA Compress all transcripts with the same barcode and same UMI into 1 From Macosko et al. 2015
SMART-based vs Droplet-based Fresh cells One-cell at a time Small cell population Lower dropout Cell barcode Full length Transcripts / cell higher Per cell transcription more accurate $$$ Droplet-based Fresh cells All droplets together Higher dropout Cell barcode UMI for PCR bias correction 3’ bias Transcripts / cell lower Per cluster transcription more accurate $$$
Potential Applications Understand stem cell differentiation or state transition Map heterogeneity in complex tissue type (tumor / brain / blood, etc) Identify new cell types with new functions Stochastic and dynamic responses to perturbation … Break
Quality Control
Dropouts Kharchenko, et al, Nat Meth 2014; Zheng et al, Nat Comm 2017
From Kolodziejczyk et al. 2015 In each single cell, we observe variations in the gene counts. A good proportion of the variation doesn’t help us discover biology. UMIs discussed more on other slide; transcription kinetics are largely unknown. Multiple methods have been proposed for cell cycle adjustment but have had limited success From Kolodziejczyk et al. 2015
Visualizing scRNA-seq data t-distributed stochastic neighbor embedding (tSNE) New dimension reduction method Preserve pair-wise distance, but focus on points close by Distant between far-away clusters don’t matter Colors are manually labeled Density should be labeled Non-deterministic
Reconstruction of Retinal Cell Types PCA ~14K high quality cells from 44K sequenced cells T-SNE on 32 statistically significant PCA Density based clustering
Checking Batch Effect Single cells from different days
Seurat is an R package designed for QC, analysis, and exploration of single cell RNA-seq data Satija et al., Nat. Biotech. 2015
Summary SMART-based vs Droplet-based single-cell sequencing Barcode and UMI Dropout modeling tSNE for visualization
Acknowledgement Wei Li Michael Love Alisha Holloway Simon Andrews Radhika Khetani Chengzhong Zhang Etai Jacob Caleb Lareau Luca Pinello Assieh Saadatpour