Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

The Good, Bad, and Ugly of Next-Gen Sequencing
FACTORIAL ANOVA Overview of Factorial ANOVA Factorial Designs Types of Effects Assumptions Analyzing the Variance Regression Equation Fixed and Random.
RNAseq.
Visualising and Exploring BS-Seq Data
Chromatin Immuno-precipitation (CHIP)-chip Analysis
Analysis of SAGE Data: An Introduction Kevin R. Coombes Section of Bioinformatics.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Analyzing the Results of a Simulation and Estimating Errors Jason Cooper.
Yanxin Shi 1, Fan Guo 1, Wei Wu 2, Eric P. Xing 1 GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data RECOMB 2007 Presentation.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Mutual Information Mathematical Biology Seminar
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
1 Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster.
High Throughput Sequencing
Biostatistics-Lecture 9 Experimental designs Ruibin Xi Peking University School of Mathematical Sciences.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
RNA-Seq Analysis Simon V4.1.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
The iPlant Collaborative
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Confidence intervals and hypothesis testing Petter Mostad
Statistics for Differential Expression Naomi Altman Oct. 06.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Introduction to RNAseq
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
 1 Species Richness 5.19 UF Community-level Studies Many community-level studies collect occupancy-type data (species lists). Imperfect detection.
Project Plan Task 8 and VERSUS2 Installation problems Anatoly Myravyev and Anastasia Bundel, Hydrometcenter of Russia March 2010.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Konstantin Okonechnikov Qualimap v2: advanced quality control of
Differential Methylation Analysis
Simon v RNA-Seq Analysis Simon v
RNA Quantitation from RNAseq Data
Biases and their Effect on Biological Interpretation
SAGExplore web server tutorial for Module III:
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
QC analysis Uppsala University Work done by Jonas Almlöf
Lab meeting
 The human genome contains approximately genes.  At any given moment, each of our cells has some combination of these genes turned on & others.
Lecture 7. Topics in RNA Bioinformatics (Single-Cell RNA Sequencing)
Differential Expression from RNA-seq
Design and Analysis of Single-Cell Sequencing Experiments
Outlier Discovery/Anomaly Detection
Comparative Analysis of Single-Cell RNA Sequencing Methods
Single-Factor Studies
Single-Factor Studies
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
AGEseq: Analysis of Genome Editing by Sequencing
Iterative resolution of multi-reads in multiple genomes
NGS on SOP-generated HRV-16-specific sequence from pure and mixed samples is slightly less sensitive than quantitative real-time RT-PCR (qrRT-PCR). NGS.
An Introduction to the Analysis of Single-Cell RNA-Sequencing Data
Zhenhai Zhang, B. Franklin Pugh  Cell 
Volume 63, Issue 6, Pages (September 2016)
DeltaV Neural - Expert In Expert mode, the user can select the training parameters, recommend you use the defaults for most applications.
Predicting Gene Expression from Sequence
Volume 2, Issue 5, Pages (May 2016)
BF528 - Sequence Analysis Fundamentals
Sequence Analysis - RNA-Seq 1
Presentation transcript:

Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017 Task1: Identifying cells from droplet data and removing erroneous molecules Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017

Important sources of inaccuracy in barcoded scRNA-seq Barcode sequences may represent empty droplets/wells Ambient mRNA that is present in the cell suspension and contaminate dataset with non specific UMIs Barcode (or UMI) errors introduced during synthesis, amplification, library prep or sequencing.

Barcodes and UMIs are attached to beads, which are then used for RNA capture Drop-seq: base by base random synthesis A C T G Mix Base 1 Base 2 Base 12 … 10x In-drop: random pairing of known barcodes Barcode pool* #1 Barcode pool* #2 Mix X = Whitelist of ~750k sequences In Klein 2015; n=384

Read structure 10x (v_2 default): DropSeq (Macosko ‘15): Genomic: 98bp Barcode (Cell): 16bp UMI (transcript): 10bp 10x (v_2 default): DropSeq (Macosko ‘15): Genomic: 60bp Barcode (Cell): 12bp UMI (transcript): 8bp Genomic: 60bp Barcode (Cell): 8-11bp UMI (transcript): 8bp Barcode (Cell): 8bp inDrop (Klein ‘15): Adaptor (22bp) Fixed. Poly-T capture primer (remaining)

Causes for barcode/ UMI inaccuracy Deletions during cell barcode synthesis DropSeq: inefficiencies during serial extension lead to a short barcode; On the read this will look like random ending to the barcode; Slippage of T to all reads (prior to additional errors) InDrop: only one barcode, no adaptor sequence (easily filtered) 10X: Not an issue (primers are purified) Errors during amplification, library prep and sequencing Relevant to UMI and cell barcode in all technologies

The Task – find real barcodes and real UMI Input: unfiltered triplets per dataset Gene*, Barcode, UMI, [#of PCR duplicates] *For the Mackosko er al data (human/ mouse mix), we also indicate the organism Output: Set of real barcodes Set of real UMIs per barcode

Key considerations- Finding real cells/barcodes The number of UMIs per barcode can be used as a first approximation, finding a threshold value Analyzing the umi-count distribution can be done using several approaches (e.g., inflection point, 99th percentile divided by a constant) This can become problematic if the cell population is a mixture of cells with different sizes. It may make sense to study the distribution of specific genes. Deletion in barcode is likely to be associated with “T” slippage at the end of the read Detect and use biases in substitutions, location of deletions When merging two barcodes – look at their genes/ UMIs For 10x: use the whitelist (provided to you) Tom Smith’s blog

Key considerations- Finding bad UMIs Assumptions on the homogeneous distribution of ambient umis can be useful for detecting them. Inference of background umi distributions from non-cell barcodes? Inference of sequence/synthesis errors using genes known to be specific for a subpopulation (e.g. human UMIs in mouse cells, globin genes in non-erythrocytes). Ambient umi distribution may assist in finding real cell-barcodes, and vice versa.

Datasets for improving your barcode/umi filters DS_1: 16 simulated datasets – each including a mixture of real data and simulated noisy barcodes, combining two cell states with different ratios. [location: task1_simu] DS_2: Droplet data from Macosko et al Human mouse mix. ~570 cells [location: Macosko_2015/counts/*raw] DS_3: Droplet data from 10x on PBMC Human. ~8.3k cells [location: 10x_pbmc/counts/*raw] Multimapped reads discarded

Output file formats fname: groupX_task1_DS_X_cell_barocdes_i (you can have up to five solutions) Format: tab delimited text: Field 1: Cell Barcode Field 2: Source barcode #barcode in the data that was mapped to cell barcode (Barcodes not in file are considered noise) fname: groupX_task1_DS_X_cell_umis_i (you can have up to five solutions) Field 1: Gene Field 2: Barcode Field 2: good_umi_count Filed 3: filtered_umi_count

Estimating performance on simulated data Simulated data – focused only on cell detection assuming all barcodes are real, but only some represent real cells. Specificity and sensitivity of calling cells Spec/Sens. stratified by number of UMIs Breakdown of sensitivity among two cell states/types Simulated data – calling barcodes Specificity and sensitivity of calling barcodes + the above metrics for calling cells Note that there are no umi’s to correct in the simulation – consideration for umi correction should be applied to the real data.

Estimating performance on real data Simple metrics to begin with: Number of called cells, Fraction of UMI explained by cells, number of mixed human- mouse cells Then, tests on information content of cells and non-cells (hoping to see low information content in filtered cells and high in re). Here are some ideas: Detect k called cells that have the lowest umi count – Group 1 Detect 100 called non-cells that have the highest umi count – Group 2 Sample additional 100 called cells (with any umi count) – Group 3 Generate 100 randomized ambient profiles – by sampling each profile from the umi’s in non-cells with replacement. Down-sample cells in group 1-3 to generate profiles with the same number of umis Cluster the 400 cells into four clusters using e.g. hclust on cell-to-cell correlations (normalize the genes). We want good mixing of group 2/4, and perhaps of 1/3 (if cell size is not strongly correlated with expression profile) OR compute any other indication to the information content in the four clusters – distribution of the 1-nearset neighbor distance for example

Evaluating UMI-correction Looking at the profile of UMI’s that were removed and per called-cell, and computing their: cell-to-cell correlations (should be low after normalizing each gene) Correlation to the distribution of umis in non-cells (should be high) Looking at statistics for the variability of umi distribution before and after umi-correction – normalized variance should increase after removing the noise.

Additional observations to make with real data Collect error stats: BP Substitution Position of deletion in DropSeq Distribution of #UMI associated with corrected barcodes When collapsing 2 barcodes – stats of shared umi/ genes

References See discussion the following blog post. Correcting UMIs (Sudbery lab): UMI-tools package Correcting cell barcodes (Pachter lab): SirCel package A tech feature in Nature methods (Apr ‘17) provides an additional good read for issues with UMI errors.