Presentation is loading. Please wait.

Presentation is loading. Please wait.

Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017

Similar presentations


Presentation on theme: "Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017"— Presentation transcript:

1 Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
Task1: Identifying cells from droplet data and removing erroneous molecules Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017

2 Important sources of inaccuracy in barcoded scRNA-seq
Barcode sequences may represent empty droplets/wells Ambient mRNA that is present in the cell suspension and contaminate dataset with non specific UMIs Barcode (or UMI) errors introduced during synthesis, amplification, library prep or sequencing.

3 Barcodes and UMIs are attached to beads, which are then used for RNA capture
Drop-seq: base by base random synthesis A C T G Mix Base 1 Base 2 Base 12 10x In-drop: random pairing of known barcodes Barcode pool* #1 Barcode pool* #2 Mix X = Whitelist of ~750k sequences In Klein 2015; n=384

4 Read structure 10x (v_2 default): DropSeq (Macosko ‘15):
Genomic: 98bp Barcode (Cell): 16bp UMI (transcript): 10bp 10x (v_2 default): DropSeq (Macosko ‘15): Genomic: 60bp Barcode (Cell): 12bp UMI (transcript): 8bp Genomic: 60bp Barcode (Cell): 8-11bp UMI (transcript): 8bp Barcode (Cell): 8bp inDrop (Klein ‘15): Adaptor (22bp) Fixed. Poly-T capture primer (remaining)

5 Causes for barcode/ UMI inaccuracy
Deletions during cell barcode synthesis DropSeq: inefficiencies during serial extension lead to a short barcode; On the read this will look like random ending to the barcode; Slippage of T to all reads (prior to additional errors) InDrop: only one barcode, no adaptor sequence (easily filtered) 10X: Not an issue (primers are purified) Errors during amplification, library prep and sequencing Relevant to UMI and cell barcode in all technologies

6 The Task – find real barcodes and real UMI
Input: unfiltered triplets per dataset Gene*, Barcode, UMI, [#of PCR duplicates] *For the Mackosko er al data (human/ mouse mix), we also indicate the organism Output: Set of real barcodes Set of real UMIs per barcode

7 Key considerations- Finding real cells/barcodes
The number of UMIs per barcode can be used as a first approximation, finding a threshold value Analyzing the umi-count distribution can be done using several approaches (e.g., inflection point, 99th percentile divided by a constant) This can become problematic if the cell population is a mixture of cells with different sizes. It may make sense to study the distribution of specific genes. Deletion in barcode is likely to be associated with “T” slippage at the end of the read Detect and use biases in substitutions, location of deletions When merging two barcodes – look at their genes/ UMIs For 10x: use the whitelist (provided to you) Tom Smith’s blog

8 Key considerations- Finding bad UMIs
Assumptions on the homogeneous distribution of ambient umis can be useful for detecting them. Inference of background umi distributions from non-cell barcodes? Inference of sequence/synthesis errors using genes known to be specific for a subpopulation (e.g. human UMIs in mouse cells, globin genes in non-erythrocytes). Ambient umi distribution may assist in finding real cell-barcodes, and vice versa.

9 Datasets for improving your barcode/umi filters
DS_1: 16 simulated datasets – each including a mixture of real data and simulated noisy barcodes, combining two cell states with different ratios. [location: task1_simu] DS_2: Droplet data from Macosko et al Human mouse mix. ~570 cells [location: Macosko_2015/counts/*raw] DS_3: Droplet data from 10x on PBMC Human. ~8.3k cells [location: 10x_pbmc/counts/*raw] Multimapped reads discarded

10 Output file formats fname: groupX_task1_DS_X_cell_barocdes_i (you can have up to five solutions) Format: tab delimited text: Field 1: Cell Barcode Field 2: Source barcode #barcode in the data that was mapped to cell barcode (Barcodes not in file are considered noise) fname: groupX_task1_DS_X_cell_umis_i (you can have up to five solutions) Field 1: Gene Field 2: Barcode Field 2: good_umi_count Filed 3: filtered_umi_count

11 Estimating performance on simulated data
Simulated data – focused only on cell detection assuming all barcodes are real, but only some represent real cells. Specificity and sensitivity of calling cells Spec/Sens. stratified by number of UMIs Breakdown of sensitivity among two cell states/types Simulated data – calling barcodes Specificity and sensitivity of calling barcodes + the above metrics for calling cells Note that there are no umi’s to correct in the simulation – consideration for umi correction should be applied to the real data.

12 Estimating performance on real data
Simple metrics to begin with: Number of called cells, Fraction of UMI explained by cells, number of mixed human- mouse cells Then, tests on information content of cells and non-cells (hoping to see low information content in filtered cells and high in re). Here are some ideas: Detect k called cells that have the lowest umi count – Group 1 Detect 100 called non-cells that have the highest umi count – Group 2 Sample additional 100 called cells (with any umi count) – Group 3 Generate 100 randomized ambient profiles – by sampling each profile from the umi’s in non-cells with replacement. Down-sample cells in group 1-3 to generate profiles with the same number of umis Cluster the 400 cells into four clusters using e.g. hclust on cell-to-cell correlations (normalize the genes). We want good mixing of group 2/4, and perhaps of 1/3 (if cell size is not strongly correlated with expression profile) OR compute any other indication to the information content in the four clusters – distribution of the 1-nearset neighbor distance for example

13 Evaluating UMI-correction
Looking at the profile of UMI’s that were removed and per called-cell, and computing their: cell-to-cell correlations (should be low after normalizing each gene) Correlation to the distribution of umis in non-cells (should be high) Looking at statistics for the variability of umi distribution before and after umi-correction – normalized variance should increase after removing the noise.

14 Additional observations to make with real data
Collect error stats: BP Substitution Position of deletion in DropSeq Distribution of #UMI associated with corrected barcodes When collapsing 2 barcodes – stats of shared umi/ genes

15 References See discussion the following blog post.
Correcting UMIs (Sudbery lab): UMI-tools package Correcting cell barcodes (Pachter lab): SirCel package A tech feature in Nature methods (Apr ‘17) provides an additional good read for issues with UMI errors.


Download ppt "Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017"

Similar presentations


Ads by Google