Computational Methods for Analysis of Single Cell RNA-Seq Data

Computational Methods for Analysis of Single Cell RNA-Seq Data
Ion Măndoiu Computer Science & Engineering Department University of Connecticut

Outline Motivation and challenges Locality sensitive imputation
TF-IDF based feature selection& clustering SC1 pipeline Ongoing work

Recent Technology Breaktroughs
DIY

3’-end Sequencing w/ UMIs (10X)
Encapsulates up to 48,000 cells in 10 minutes

Primary Analysis Produces UMI counts (gene expression matrix)

Challenges Allelic dropouts
Szulwach et al.

Challenges Low RT efficiency & sequencing depth
Hicks et al. 2015, Hicks et al. 2015,

Challenges PCR amplification bias
Ziegenhain et al. 2017, Mol. Cel. 65(4), pp. 631–643.e4 Ziegenhain et al. 2017, Mol. Cel. 65(4), pp. 631–643.e4

Challenges Cell “quality” Live/dead Stress response Multiplets

Challenges Many more: Stochastic effects Cell capture bias Scalability
Cells captured in different cell cycle phases Transcriptional bursting hard to distinguish from technical artifacts Cell capture bias Capture rates may not be representative of population frequencies Scalability Million cell datasets…

TF-IDF based feature selection & clustering SC1 pipeline Ongoing work

Imputation for scRNA-Seq Data
CD45 UMI count=0 CD CD45+ Can drop-outs be recovered by imputation?

Existing Imputation Methods
BISCUIT (Azizi et al., GCB 2017) CIDR (Lin, Troup, & Ho, Genome Biol. 2017) DRImpute (Kwak et al., bioRxiv 2017) LSImpute (Moussa & Mandoiu, ISBRA 2018) MAGIC (van Dijk et al. bioRxiv. 2017) netSmooth (Ronen & Akalin, F1000Res. 2018) scImpute (Li & Li, Nat. Comm. 2018)

LSImpute Step 1: Selecting a small number (m) of cell pairs with highest similarity (O(n) using locality Sensitive Hashing) Step 2. Group selected cells into 𝑚 clusters using spherical k-means Step 3. For each cluster, replace zeros with median/mean expression of the gene within the cluster Step 4. Collapse selected cells into centroid clusters and repeat until highest pair similarity drops below a given threshold

Imputation Experimental Setup
209 somatosensory neurons isolated from the mouse dorsal root ganglion (Li et al., Cell research 2016) ≈31.5M reads/cell ≈10,950 +/-1,218 genes/cell Read subsampling 50k-20M reads Ground truth: TPM values determined by running IsoEM2 (Mandric et al., Bioinformatics 2017) on full set of reads

Evaluation metrics Gene detection fraction Median percent error (MPE)
Number of cells in which the gene is detected divided by total number of cells Compared to `true' detection ratio in scatter plots Median percent error (MPE) Median of the set of relative errors for the gene detection fraction Gene detection accuracy TP+TN / (N*M) where N and M are the # genes and # cells respectively. Clustering micro-accuracy

Gene Detection Fraction
100k M M Raw Data DrImpute scImpute KNNImpute LSImputeMed LSImputeMean

MPE Plots Raw DrImpute scImpute KNNImpute LSImputeMed LSImputeMean

Gene Detection Accuracy

Clustering Accuracy sKmeans, top TF-IDF

Accuracy on 10x Data 638 MethA cells: 500k reads/cell, down-sampled to 50k reads/cell Detection Accuracy : Raw 0.97; DrImpute 0.95; LSImpute 0.974 True vs. down-sampled DrImpute LSImpute

TF-IDF Transformation
Borrowed from information retrieval Product of two factors: Term frequency: How frequently a term occurs in a document? Inverse document frequency: How uncommon the term is in the document collection? For scRNA-Seq data: For gene i in cell j with count fij : 𝑇 𝐹 𝑖𝑗 = 𝑓 𝑖𝑗 / max 𝑘 𝑓 𝑘𝑗 If gene i is detected in ni out of N cells: 𝐼𝐷 𝐹 𝑖 = log 2 (𝑁/ 𝑛 𝑖 ) TF-IDF score: 𝑇 𝐹 𝑖𝑗 × 𝐼𝐷 𝐹 𝑖

TF-IDF Based Feature Selection

TF-IDF Based Clustering
Cells QC, Genes QC, Gap-Statistics Analysis Data Transformation: Log2(x+1) or none Feature Selection: PCA, tSNE, highly variable genes* or none Seurat (K-means)* Seurat (SNN)* GMM K-means Sph. K-means HC (E/P) Louvain (E) Data Transformation: TF-IDF Feature Selection: High avg. TFIDF score (Top) or Highly variable TF-IDF (Var) HC (E/P/C) Data Binarization: Cutoff threshold per cell based on cell avg. TF-IDF(Bin) HC (E/P/C/J) Greedy (E/P/C/J) Louvain (E/P/C/J)

Experimental Setup: 10x PBMC
FACS sorted blood cells of 7 types [Zheng et al., Nat. Comm. 2017] 7:1, 3:1, 1:1, 1:3, and 1:7 simulated mixtures of cell type pairs of varying dissimilarity (1000 cells/pair) 7-way mixture, equal proportions (7000 cells/mix) All datasets available at

Experimental Setup: 10x PBMC

Experimental Setup: Pancreatic Cells
2045 Pancreatic cells of 7 types [Segerstolpe et al. 2016] Annotated based on known markers (removed for clustering) Capture proportions: 185 acinar cells, 886 alpha cells, 270 beta cells, 197 gamma cells, 114 delta cells, 386 ductal cells, and 7 epsilon cells

Pairs: 1:1 mixtures

Pairs: 1:3/3:1 mixtures

Pairs: 1:7/7:1 mixtures

7-Way PBMC Mixture

Pancreatic Cells

SC1 Analysis Workflow Data Exploration & QC Normalization
Imputation Normalization Data Transformation Feature Selection Dim. Reduction Clustering DE Analysis Enrichment Analysis Visualization Cell cycle analysis

SC1 Analysis Workflow

Ongoing Work Additional methods for normalization, clustering, DE, etc. Comprehensive validation of LSImpute on 10x data and integration in SC1 Integration of cell cycle analysis Additional pipeline components (cell type matching, lineage inference, RNA velocity,…) Joint analysis of bulk and single cell RNA-Seq data

Acknowledgments Marmar Moussa

Joint analysis of bulk and scRNA-Seq
Needed to get unbiased population frequencies of cell types Potential to identify cell types missed by capture protocols

heterogeneous mixture
Linear model cell type 1 cell type 2 cell type 3 𝑥 1 𝑥 2 𝑥 3 𝑥 4 𝑥 5 𝑥 6 gene 1 𝑠 11 𝑠 13 gene 2 𝑐 1 𝑐 2 𝑐 3 gene 3 gene 4 gene 5 𝑠 63 gene 6 𝑠 61 Cell type signatures Cell concentrations heterogeneous mixture

Estimation of mixture proportions
c min⁡( 𝑆𝑐−𝑥 2 ), 𝑠.𝑡. 𝑙=0…𝑘 𝑐 𝑙 =1 𝑐 𝑙 ≥0 ∀𝑙=0…𝑘

Simultaneous Estimation of Mixture Proportions and Missing Signature
C min 𝑆𝐶−𝑋 2 , 𝑠.𝑡. 𝑙=0…𝑘 𝑐 𝑙 𝑗 =1 ∀𝑗=0…𝑛 𝑐 𝑙 ≥0 𝑙=0…𝑘 𝑠 𝑖 ≥0 𝑖=0…𝑚

Intron Retention & Cell Cycle
IR measured for T cells sorted at different stages of the cell cycle ~1K differentially retained introns with distinct patterns of retention for each stage of the cell cycle. These introns were retained from genes enriched for cell cycle (p = 8E-6). Middleton, Robert, et al. "IRFinder: assessing the impact of intron retention on mammalian gene expression." Genome biology 18.1 (2017): 51.

Computational Methods for Analysis of Single Cell RNA-Seq Data

Similar presentations

Presentation on theme: "Computational Methods for Analysis of Single Cell RNA-Seq Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Methods for Analysis of Single Cell RNA-Seq Data

Similar presentations

Presentation on theme: "Computational Methods for Analysis of Single Cell RNA-Seq Data"— Presentation transcript:

Similar presentations

About project

Feedback