Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Methods for Analysis of Single Cell RNA-Seq Data

Similar presentations


Presentation on theme: "Computational Methods for Analysis of Single Cell RNA-Seq Data"— Presentation transcript:

1 Computational Methods for Analysis of Single Cell RNA-Seq Data
Ion Măndoiu Computer Science & Engineering Department University of Connecticut

2 Outline Intro to scRNA-Seq Typical analysis pipeline for scRNA-Seq
Motivation scRNA-Seq protocols Analysis challenges for single cell data Typical analysis pipeline for scRNA-Seq Conclusions

3 Why Single Cell RNA-Seq?
Macaulay and Voet, PLOS Genetics, 2014

4 Applications: Cell Differentiation

5 Applications: Tumor Heterogeneity

6 Applications: Cell Type Identification

7 Single Cell RNA-Seq Growth

8 Recent Technology Breaktroughs
DIY

9 Fluidigm Workflow Microfluidic chips (IFCs) with 96 and 800 capture chambers

10 Full-Transcript Sequencing

11 3’-end Sequencing (Fluidigm HT)

12 5’-end Sequencing w/ UMIs (STRT-C1)
Islam et al. 2013,

13 3’-end Sequencing w/ UMIs (10X)
Encapsulates up to 48,000 cells in 10 minutes

14 Challenges Low RT efficiency & sequencing depth
Results in “zero-inflated” data Hicks et al. 2015,

15 Challenges Low RT efficiency & sequencing depth PCR amplification bias
UMIs help Ziegenhain et al. 2017, Mol. Cel. 65(4), pp. 631–643.e4

16 Challenges Low RT efficiency & sequencing depth PCR amplification bias
Cell quality Live/dead Missing cells Multiple cells

17 Challenges Low RT efficiency & sequencing depth PCR amplification bias
Cell quality Stochastic effects Cells captured in different cell cycle phases Transcriptional bursting hard to distinguish from technical artifacts

18 Challenges Low RT efficiency & sequencing depth PCR amplification bias
Cell quality Stochastic effects Cell capture bias Capture rates may not representative of population frequencies

19 Challenges Low RT efficiency & sequencing depth PCR amplification bias
Cell quality Stochastic effects Cell capture bias Analysis tools lagging behind Protocols still evolving rapidly Standard methods do not scale well with #cells

20 Outline Intro to scRNA-Seq Typical analysis pipeline for scRNA-Seq
Primary analysis: reads QC, mapping, and quantification Secondary analysis: cells QC, normalization, clustering, and differential expression Tertiary analysis: functional annotation Conclusions

21 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Reads are typically processed to: Remove adapters Collapse identical sequences Filters and/or trim reads based on quality scores

22 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Sounds like a good idea! Improved N50 in genome assembly Decreased FPs in variant calling Higher overall read mapping rate Not all good for RNA-Seq quantification Fewer reads Shorter reads fewer unique alignments

23 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Trimming parameters can significantly change results! Williams et al. 2016,

24 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Trimming parameters can significantly change results! Williams et al. 2016,

25 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Trimming parameters can significantly change results! Williams et al. 2016,

26 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis RNA-Seq read mapping strategies: Ungapped mapping (with mismatches) to genome Cannot align reads spanning exon-junctions Local alignment (Smith-Waterman) to genome Very slow Spliced alignment to genome Computationally harder than ungapped alignment, but much faster than local alignment Mapping on transcript libraries Fastest, but cannot align reads from un-annotated transcripts Mapping on exon-exon junction libraries Cannot align reads overlapping un-annotated exons Hybrid approaches

27 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Comparison of spliced read mapping tools Kim et al.

28 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Cannot use raw read counts (why not?) Islam et al.

29 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis CPM = count per million Ignores multireads  underestimates expression of genes in large families Does not normalize for gene length  cannot compare CPMs b/w genes Comparing CPMs between samples assumes similar transcriptome size RPKM/FPKM = reads/fragments per kilobase per million Length for multi-isoform genes? Comparing FPKM between samples assumes similar (weighted) transcriptome size TPM: transcripts per million Still relative measure of expression, but comparable between samples Most accurate estimation methods use multireads and isoform level estimation UMI counts Absolute measure of expression?

30 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Gene ambiguous reads A B C D E Isoform ambiguous reads

31 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Expectation-maximization approach (IsoEM, RSEM) A B C i j Fa(i) Fa (j)

32 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis EM Algorithm 3. Compute expected #reads for each transcript 0.5 2.5 1 1.5 1. Start with random transcript frequencies 0.2 0.5 1 2. Fractionally allocate reads to transcripts

33 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis EM Algorithm 3. Compute expected #reads for each transcript 0.5 2.5 1 1.5 0.5/6 2.5/6 1/6 1.5/6 1. Start with random transcript frequencies 2. Fractionally allocate reads to transcripts 4. Update transcript frequencies using maximum likelihood estimates

34 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis EM Algorithm 3. Compute expected #reads for each transcript 0.5/6 2.5/6 1/6 1.5/6 1. Start with random transcript frequencies 2. Fractionally allocate reads to transcripts 4. Update transcript frequencies using maximum likelihood estimates 5. Repeat steps 2-4 until convergence

35 Accuracy comparison (Kanitz et al. 2015)
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Accuracy comparison (Kanitz et al. 2015) 30M reads

36 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Runtime comparison (Kanitz et al. 2015) 1 core 16 cores

37 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Why speed matters? Quantifying estimate uncertainty by bootstrapping M. reads M. reads M. reads

38 Detected genes/cell -- main population
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Detected genes/cell -- main population Detected genes/cell -- bi-modal distribution Detected genes/cell -- minor population

39 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Technical failures or biologically interesting cells?

40 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Cellity cell-QC pipeline Ilicic et al. 2016,

41 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Cellity cell-QC pipeline Ilicic et al. 2016,

42 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Methods developed for bulk samples commonly used, but poor fit for scRNA-Seq data… Vallejos et al. 2017,

43 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis .. and have major impact on downstream analyses Highly Variable Genes Vallejos et al. 2017,

44 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Alternative methods External controls (ERCC) Pool-based deconvolution

45 Typical first step is dimensionality reduction by PCA
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Typical first step is dimensionality reduction by PCA 1st component = direction of max. variance 2nd component = orthogonal on 1st, max. residual variance

46 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis ZIFA: dimensionality reduction methods for zero-inflated data Pierson and Yau 2015,

47 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Many clustering algorithms K-means Hierarchical clustering Expectation-Maximization Graph based scRNA-Seq clustering is an active research area Reducing effect of confounders such as detection rate & cell cycle phase Gene selection Distance metric learning Scalability

48 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Accuracy comparison for simulated mixtures of CD19+ B cells, CD8+ cytotoxic T cells, CD4+/CD450RO+ memory T cells, CD4+/CD25+ regulatory T cells, and CD4+ helper T cells.

49 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Single-cell specific methods SCDE: Single-Cell Differential Expression (Kharchenko et al. 2014) scDD: single-cell Differential Distributions (Korthauer et al. 2015) MAST: Model based Analysis of Single-cell Transcriptomics (Finak et al. 2015) D3E: Discrete Distributional Differential Expression (Delmans and Hemberg 2016)

50 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis

51 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis

52 Marker-based annotation (A. Jackson, unpublished)
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Marker-based annotation (A. Jackson, unpublished)

53 Matching clusters to known cell types and organism parts
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Matching clusters to known cell types and organism parts Guo M, Wang H, Potter SS, Whitsett JA, Xu Y (2015) SINCERA: A Pipeline for Single-Cell RNA- Seq Profiling Analysis. PLoS Comput Biol 11(11):e doi: /journal.pcbi

54 Lineage inference: SCUBA
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Lineage inference: SCUBA

55 Lineage inference: ECLAIR
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Lineage inference: ECLAIR

56 Conclusions The range of single-cell applications continues to expand, fueled by advances in technology Primary analysis is compute intensive Requires server/cluster/cloud + linux + scripting Existing frameworks (e.g., Galaxy) lack latest SC tools and do not handle large number of files well

57 Conclusions Secondary/tertiary analyses can be done on PC/Mac using R
Several comprehensive packages and pipelines available: Seurat SINCERA Granatum Pipelines provide extensive parametrization, but lack support for sensitivity analysis Robust single-cell specific methods still needed

58 Joint analysis of bulk and scRNA-Seq
Needed to get unbiased population frequencies of cell types Can also identify cell types missed by current capture protocols

59 Linear model Cell signatures Cell concentrations heterogeneous mixture
cell type 1 cell type 2 cell type 3 𝑥 1 𝑥 2 𝑥 3 𝑥 4 𝑥 5 𝑥 6 gene 1 𝑠 11 𝑠 13 gene 2 𝑐 1 𝑐 2 𝑐 3 gene 3 The heterogeneous mixture is a linear combination of the canonical cell-type gene expression signatures at some set of concentrations. gene 4 gene 5 𝑠 63 gene 6 𝑠 61 Cell signatures Cell concentrations heterogeneous mixture

60 Estimation of mixture proportions
c min⁡( 𝑆𝑐−𝑥 2 ), 𝑠.𝑡. 𝑙=0…𝑘 𝑐 𝑙 =1 𝑐 𝑙 ≥0 ∀𝑙=0…𝑘

61 Simultaneous Estimation of Mixture Proportions and Missing Signature
C min 𝑆𝐶−𝑋 2 , 𝑠.𝑡. 𝑙=0…𝑘 𝑐 𝑙 𝑗 =1 ∀𝑗=0…𝑛 𝑐 𝑙 ≥0 𝑙=0…𝑘 𝑠 𝑖 ≥0 𝑖=0…𝑚 This non-negative non-linear least squares objective is a simple model that works well in practice and it is similar to the NMF approaches. The initial guess for the missing signature is set by the average of the known signatures, the concentrations are set to be uniform.

62 Acknowledgements


Download ppt "Computational Methods for Analysis of Single Cell RNA-Seq Data"

Similar presentations


Ads by Google