Single cell RNAseq Kathie Mihindukulasuriya, PhD

Single cell RNAseq Kathie Mihindukulasuriya, PhD Senior Scientist, Cruchaga Lab Department of Psychiatry Washington University in St. Louis

Plan: Single cell RNA-seq vs bulk RNA-seq Current single cell protocols and platforms Processing single cell RNA-seq data Biology based analysis Current challenges in single cell RNA-seq processing and analysis

Bulk RNAseq vs single cell RNASeq

Technique Advantages Challenges Bulk RNAseq More economical averaged gene expression across thousands of cells (may lose key signals) - Deconvolution may restore these signals (CIBERSORT, xCell) Deeper sequencing Single cell RNAseq Can identify rare cell types Dropout problem Can identify transcriptional differences in different cell types preserving the initial relative abundance of mRNA in a cell technical noise and biological variation Data noisier and more complex than bulk RNAseq technical noise and biological variation make analysis more challenging Hybrid - Use scRNAseq to estimate cell type proportions in bulkRNAseq (BSEQ-sc, Cell Population Mapping (CPM)

What are some types of questions that can be answered by scRNAseq?

scRNA-seq protocols full-length transcript sequencing approaches 3’ end or 5’ end Smart-seq2, Fluidigm C1 (96 cell) 3’: Drop-seq, Seq-Well, Chromium, DroNC-seq, Fluidigm C1 (800 cell) 5’: STRT-seq, Pluses increased number of mappable reads Suitable for: cell-type discovery, assessing tissue composition, allelic gene expression analysis, isoform discovery UMIs (multiplexing of samples, improved gene expression quantification and throughput) Lower cost Minuses cannot be multiplexed via sample pooling into a single tube for library generation (increased cost and labor) No UMIs (no digital quantification of transcripts) Not suitable for alternative splicing (AS) detection, allelic expression exploration and RNA-editing identification Less sensitivity than full-length

scRNA-seq platforms Pros Cons Fluidigm C1 (microfluidics) / Fluidigm C1 mRNA Seq HT Allow visual inspection of captured cells (can exclude empty wells and doublets) Customizable lower false positives than tube-based technologies less bias than tube-based technologies C1 = full-length transcript / HT = 3’ 300–7,000 genes per cell - Only 2 inlets for samples - Low throughput (up to 96 cells/ 800 cells) - >10,000/ 1,000 cells required for capture - Relatively long prep time (2 runs per day) capture efficiency depends on uniformity of cell size and shape High cost of cartridges Cells must be fresh and processed immediately Droplet-based (10X Genomics Chromium) - Very high throughput - Up to 8 samples per run - System cost relatively low 500–1,500 genes per cell Limited customizability (little control over cell input; susceptible to selection biases) Plate methods (SMART-seq2) - Can simultaneously measure gnome DNA and transcriptome - not restricted by cell size, shape, homogeneity, or total numbers (suitable for very rare cell populations) - Economical (uses off the shelf reagents) ~4,000–7,000 genes per cell - No UMIs and barcodes (no gene level quantification or multiplexing of samples)

Fluidigm C1

scRNA-seq platforms Pros Cons Fluidigm C1 (microfluidics) / Fluidigm C1 mRNA Seq HT Allow visual inspection of captured cells (can exclude empty wells and doublets) Customizable lower false positives than tube-based technologies less bias than tube-based technologies C1 = full-length transcript / HT = 3’ 300–7,000 genes per cell - Only 2 inlets for samples - Low throughput (up to 96 cells/ 800 cells) - >10,000/ 1,000 cells required for capture - Relatively long prep time (2 runs per day) capture efficiency depends on uniformity of cell size and shape High cost of cartridges Cells must be fresh and processed immediately Droplet-based (10X Genomics Chromium) Very high throughput Up to 8 samples per run System cost relatively low 500–1,500 genes per cell Limited customizability (little control over cell input; susceptible to selection biases) Plate methods (SMART-seq2) Can simultaneously measure gnome DNA and transcriptome not restricted by cell size, shape, homogeneity, or total numbers (suitable for very rare cell populations) Economical (uses off the shelf reagents) ~4,000–7,000 genes per cell - No UMIs and barcodes (no gene level quantification or multiplexing of samples)

Methods of single-cell isolation:
Droplet-based Methods of single-cell isolation: Limiting dilution: not very efficient Micromanipulation: Time consuming; low throughput FACS: highly purified single cells IF cells express cell surface marker

Methods of single-cell isolation:
Droplet-based Methods of single-cell isolation: Laser capture microdissection isolate cells from solid samples Microfluidic technology low sample consumption low analysis cost precise fluid control Decreased risk of external contamination CellSearch Antibody conjugated to magnetic particles To isolate desired cells Good for rare cell types

Droplet-based cell lysis -> reverse transcription into first-strand cDNA -> second-strand synthesis -> cDNA amplification UMIs: - 4–10 random nucleotides that are introduced with the primer used for cDNA generation before amplification multiple reads with the same UMI sequence map to the same gene = one molecule Cell barcodes: labeling of cDNA by a cell-specific DNA sequence that allows multiplexing at an early stage

scRNA-seq platforms Pros Cons Fluidigm C1 (microfluidics) / Fluidigm C1 mRNA Seq HT Allow visual inspection of captured cells (can exclude empty wells and doublets) Customizable lower false positives than tube-based technologies less bias than tube-based technologies C1 = full-length transcript / HT = 3’ 300–7,000 genes per cell - Only 2 inlets for samples - Low throughput (up to 96 cells/ 800 cells) - >10,000/ 1,000 cells required for capture - Relatively long prep time (2 runs per day) capture efficiency depends on uniformity of cell size and shape High cost of cartridges Cells must be fresh and processed immediately Droplet-based (10X Genomics Chromium) Very high throughput Up to 8 samples per run System cost relatively low 500–1,500 genes per cell Limited customizability (little control over cell input; susceptible to selection biases) Plate methods (SMART-seq2) Can simultaneously measure gnome DNA and transcriptome not restricted by cell size, shape, homogeneity, or total numbers (suitable for very rare cell populations) Economical (uses off the shelf reagents) ~4,000–7,000 genes per cell - No UMIs and barcodes (no gene level quantification or multiplexing of samples)

Plate-based Template Switching Oligonucleotide

Remove barcodes from cell-free mRNA
Processing scRNA-seq data Map reads to genome, not transcriptome Decreases multi-mapping reads Critical for snRNA-seq Splice-aware aligners (STAR) Pseudoaligners (faster) Associate reads with genes or transcripts - featureCounts - HTSeq remove PCR noise using UMIs demultiplexing to identify cells Remove barcodes from cell-free mRNA (much lower average read count than barcodes derived from intact cells)

Processing scRNA-seq data
Remove low-quality ‘cells’ based on mapping statistics: overrepresentation of mitochondrial RNAs, ribosomal RNAs (>40%), spike-ins, adapters and/or reads that map outside of exons Normalization to correct for unwanted variation among cells caused by technical variation remove batch effects Biology-based analysis (like differential expression)

Some examples of biology-based analysis
Purpose: to directly investigate AD brain changes in cell proportion and gene expression using single cell resolution Del-Aguila, J.L. et al. A single- nuclei RNA sequencing study of Mendelian and sporadic AD in the human brain. bioRxiv. Mar. 30, doi:

To identify different cell types in brain samples by a CGS approach (unsupervised graph-based clustering) and then annotated by cell type using marker genes t-distributed Stochastic Neighbor Embedding (tSNE) plot is a dimensionality reduction technique Differences with PCA: tSNE always produces a 2D separation tSNE is non-deterministic (you won't get exactly the same output each time you run it) tSNE tends to cope better with non-linear signals in your data, (less impact of outliers; visible separation between relevant groups is improved) 4. After tSNE input features are no longer identifiable, and you cannot make any inference based only on the output of t-SNE NOTE: very computationally intensive (may need to apply another dimensionality reduction technique like PCA first)

To identify different cell types in brain samples:
Classic Gene Set (CGS) from Pooled Subjects: (Seurat FindVariableGenes -> 2,360 genes -> calculate 100 PCs -> identified the optimal number of PCs (65) 6 cell types 25 clusters

To identify different cell types in brain samples:
Consensus Gene Set (ConGen) from each subject: (Seurat FindVariableGenes -> 2,447 (S1); 2,354 (S2); 1,972 (S3) -> R function intersection to identify common genes (1,434) -> calculate 100 PCs -> identified the optimal number of PCs (25) 14 cell types; better resolution

Cluster annotation Evaluating the expression of maker genes for neurons, astrocytes, oligodendrocytes, microglia, oligodendrocyte precursor cells, endothelial cells, excitatory and inhibitory neurons (from literature) -> Seurat DotPlot to visualize the average gene expression for the marker genes in each cluster

Workflow Analysis Plan

Single cell analysis: current challenges
- Biggest challenge: missing data (excess zeros) “Dropout” - technical (not captured) - biological (really no expression) sampling (just not deep enough sequencing) can’t distinguish between these dropout = largest source of variation How to deal with missing data? Increase read depth Impute the missing data based on clustered cells (DrImpute, CIDR, MAGIC, scimpute) Impute the missing data based on bulk RNAseq data (SCRABBLE) Use biological knowledge – gene-gene coexpression (netNMF-sc)

Explosion of methods and software, but not yet clear best practices Doublet Identification demuxlet - [shell] - Multiplexed droplet single-cell RNA-sequencing using natural genetic variation DoubletFinder - [R] - Doublet detection in single-cell RNA sequencing data using artificial nearest neighbors. BioRxiv DoubletDecon - [R] - Cell-State Aware Removal of Single-Cell RNA-Seq Doublets. [BioRxiv](DoubletDecon: Cell-State Aware Removal of Single-Cell RNA-Seq Doublets) DoubletDetection - [R, Python] - A Python3 package to detect doublets (technical errors) in single-cell RNA-seq count matrices. An R implementation is in development. Scrublet - [Python] - Computational identification of cell doublets in single-cell transcriptomic data. BioRxiv

Assigning cell types to clusters of cells: - dimensionality reduction (tSNE, PCA, UMAP) -> unsupervised clustering -> annotation of clusters Use of marker genes Known marker genes Expression high enough to be measured (not always true for known cell surface markers) Subjective (different researchers choose different markers) Novel cell types? Use of annotated training data (e.g. reference atlas) comparisons with annotated reference data using automatically chosen genes that optimally discriminate between cell types (scmap, SingleR) - allow the assignment of cells to an intermediate or unassigned type (CHETAH) Challenge: human data often clusters by individual, rather than cell type

How to combine datasets for analysis: scmap: projection of single-cell RNA-seq data across data sets scMerge: using genes that do not to change across all samples and a robust algorithm to infer pseudoreplicates between datasets.

Look to see advances in single cell RNA seq cancer research for solutions to problems

Single cell RNAseq Kathie Mihindukulasuriya, PhD

Similar presentations

Presentation on theme: "Single cell RNAseq Kathie Mihindukulasuriya, PhD"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Single cell RNAseq Kathie Mihindukulasuriya, PhD

Similar presentations

Presentation on theme: "Single cell RNAseq Kathie Mihindukulasuriya, PhD"— Presentation transcript:

Similar presentations

About project

Feedback