J AMES L INDSAY 1 I ON MANDOIU 1 C RAIG N ELSON 2 Towards Whole-Transcriptome Deconvolution with Single-cell Data U NIVERSITY O F C ONNECTICUT 1 D EPARTMENT.

Slides:



Advertisements
Similar presentations
Review bootstrap and permutation
Advertisements

General Linear Model With correlated error terms  =  2 V ≠  2 I.
Previous Lecture: Distributions. Introduction to Biostatistics and Bioinformatics Estimation I This Lecture By Judy Zhong Assistant Professor Division.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Sampling: Final and Initial Sample Size Determination
Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7.3 Estimating a Population mean µ (σ known) Objective Find the confidence.
Exhaustive Signature Algorithm
Mutual Information Mathematical Biology Seminar
Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Lecture 5: Learning models using EM
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Yongjin Park, Stanley Shackney, and Russell Schwartz Accepted Computational Biology and Bioinformatics.
Independent Component Analysis (ICA) and Factor Analysis (FA)
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
1 Confidence Intervals for Means. 2 When the sample size n< 30 case1-1. the underlying distribution is normal with known variance case1-2. the underlying.
1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7.2 Estimating a Population Proportion Objective Find the confidence.
Supplemental Figure 1. Relationship between peaksize and signal values in the NCOR1 data set. Peaks with a signal value lower than 2 (red line) were discarded.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
J AMES L INDSAY 1 C AROLINE J AKUBA 2 I ON MANDOIU 1 C RAIG N ELSON 2 Gene Expression Deconvolution with Single-cell Data U NIVERSITY O F C ONNECTICUT.
DATASET DESCRIPTION PCA RESULTS Dataset #1 RNA-Seq of neural cells (MiSeq) [2]  65 cells  Ground truth clusters:  Group I (Neural Progenitors), Group.
Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY.
Random Sampling, Point Estimation and Maximum Likelihood.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
1 Inferences About The Pearson Correlation Coefficient.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Example: Bioassay experiment Problem statement –Observations: At each level of dose, 5 animals are tested, and number of death are observed.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
QUICK: Review of confidence intervals Inference: provides methods for drawing conclusions about a population from sample data. Confidence Intervals estimate.
: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Analyzing Expression Data: Clustering and Stats Chapter 16.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Review of Spectral Unmixing for Hyperspectral Imagery Lidan Miao Sept. 29, 2005.
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
Anthony Gitter Cancer Bioinformatics (BMI 826/CS 838) May 5, 2015
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 5.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
1/61: Topic 1.2 – Extensions of the Linear Regression Model Microeconometric Modeling William Greene Stern School of Business New York University New York.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.
Canadian Bioinformatics Workshops
Bayesian Estimation and Confidence Intervals Lecture XXII.
High-throughput genomic profiling of tumor-infiltrating leukocytes
Cluster Analysis II 10/03/2012.
Bayesian Estimation and Confidence Intervals
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Exploring Microarray data
Microeconometric Modeling
Quantifying uncertainty using the bootstrap
Sensitivity of RNA‐seq.
Microeconometric Modeling
Estimating the Value of a Parameter Using Confidence Intervals
Inference of Environmental Factor-Microbe and Microbe-Microbe Associations from Metagenomic Data Using a Hierarchical Bayesian Statistical Model  Yuqing.
Linear Regression and Correlation
Volume 3, Issue 1, Pages (July 2016)
Determining Which Method to use
Linear Regression and Correlation
Confidence Interval.
Volume 6, Issue 2, Pages e5 (February 2018)
Pearson correlation of gene expression identifies distinct groups of male- and female-enriched genes. Pearson correlation of gene expression identifies.
Volume 6, Issue 2, Pages e5 (February 2018)
Microeconometric Modeling
Presentation transcript:

J AMES L INDSAY 1 I ON MANDOIU 1 C RAIG N ELSON 2 Towards Whole-Transcriptome Deconvolution with Single-cell Data U NIVERSITY O F C ONNECTICUT 1 D EPARTMENT OF C OMPUTER S CIENCE AND E NGINEERING 2 D EPARTMENT OF M OLECULAR AND C ELL B IOLOGY

Mouse Embryo Somites POSTERIOR / TAIL ANTERIOR / HEAD Node Neural tube Primitive streak

Unknown Mesoderm Progenitor What is the expression profile of the progenitor cell type? NSB=node-streak border; PSM=presomitic mesoderm; S=somite; NT=neural tube/neurectoderm; EN=endoderm

Characterizing Cell-types Goal: Whole transcriptome expression profiles of individual cell-types Technically challenging to measure whole transcriptome expression from single-cells Approach: Computational Deconvolution of cell mixtures Assisted by single-cell qPCR expression data for a small number of genes

Modeling Cell Mixtures Mixtures (X) are a linear combination of s ignature matrix (S) and concentration matrix (C) mixtures genes cell types genes mixtures cell types

Previous Work 1.Coupled Deconvolution Given: X, Infer: S, C NMF Repsilber, BMC Bioinformatics, 2010 Minimum polytope Schwartz, BMC Bioinformatics, Estimation of Mixing Proportions Given: X, S Infer: C Quadratic ProgGong, PLoS One, 2012 LDAQiao, PLoS Comp Bio, 2o12 3.Estimation of Expression Signatures Given: X, C Infer: S csSAMShen-Orr, Nature Brief Com, 2010

Single-cell Assisted Deconvolution Given: X and single-cells qPCR data Infer: S, C Approach: 1.Identify cell-types and estimate reduced signature matrix using single-cells qPCR data Outlier removal K-means clustering followed by averaging 2.Estimate mixing proportions C using Quadratic programming, 1 mixture at a time 3.Estimate full expression signature matrix S using C Quadratic programming, 1 gene at a time

Step 1: Outlier Removal + Clustering unfilteredfiltered Remove cells that have maximum Pearson correlation to other cells below.95

Step 1: PCA of Clustering

Step 2: Estimate Mixture Proportions For a given mixture i: Reduced signature matrix. Centroid of k-means clusters

Step 3: Estimating Full Expression Signatures s: new gene to estimate signatures mixtures genes cell types genes mixtures cell types Now solve: C: known from step 2 x: observed signals from new gene

Experimental Design Simulated Concentrations Sample uniformly at random [0,1] Scale column sum to 1. Simulated Mixtures Choose single-cells randomly with replacement from each cluster Sum to generate mixture Single Cell Profiles 92 profiles 31 genes

Data: RT-qPCR CT values are the cycle in which gene was detected Relative Normalization to house-keeping genes HouseKeeping genes gapdh, bactin1 geometric mean Vandesompele, 2002 dCT(x) = geometric mean – CT(x) expression(x) = 2^dCT(x)

Accuracy of Inferred Mixing Proportions

Concentration Matrix: Concordance

Concentration by # Genes: Random

Concentration by # Genes: Ranked

Leave-one-out: Concentration: 50 mix RMSE 2^dCT Missing Gene

Leave-one-out: Signature: 10 mix RMSE 2^dCT Missing Gene

Leave-one-out: Signature: 50 mix RMSE 2^dCT Missing Gene

Future Work Bootstrapping to report a confidence interval of each estimated concentration and signature Show correlation between large CI and poor accuracy Mixing of heterogeneous technologies qPCR for single-cells, RNA-seq for mixtures Normalization (need to be linear) Whole-genome scale # genes to estimate 10,000+ signatures Data!

Conclusion Special Thanks to: Ion Mandoiu Craig Nelson Caroline Jakuba Mathew Gajdosik