Cell Cycle Analysis & Effect on scRNA-Seq Analysis Workflow

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Basic Gene Expression Data Analysis--Clustering
Outlines Background & motivation Algorithms overview
RNAseq.
Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”
A Multi-PCA Approach to Glycan Biomarker Discovery using Mass Spectrometry Profile Data Anoop Mayampurath, Chuan-Yih Yu Info-690 (Glycoinformatics) Final.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Principal Component Analysis
Fuzzy K means.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Statistical Analysis of Microarray Data
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
G Lecture 7 Confirmatory Factor Analysis
Sudhakar Jonnalagadda and Rajagopalan Srinivasan
Analyzing Expression Data: Clustering and Stats Chapter 16.
Principal Component Analysis (PCA)
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica ext. 1819
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
High-throughput genomic profiling of tumor-infiltrating leukocytes
David Amar, Tom Hait, and Ron Shamir
Unsupervised Learning
Cluster Analysis of Gene Expression Profiles
Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
Quality Control & Preprocessing of Metagenomic Data
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Research in Computational Molecular Biology , Vol (2008)
Analyzing Redistribution Matrix with Wavelet
Functional Genomics in Evolutionary Research
Machine Learning Basics
Dimension Reduction via PCA (Principal Component Analysis)
Hierarchical clustering approaches for high-throughput data
Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing  Graham Heimberg, Rajat.
Computer Science & Engineering Department University of Connecticut
Principal component analysis of the GO category composition of all genes in each genome/transcriptome and WGD paralogs. Principal component analysis of.
Volume 17, Issue 4, Pages (October 2015)
Computational Methods for Analysis of Single Cell RNA-Seq Data
Volume 2, Issue 4, Pages (April 2008)
Volume 6, Issue 5, Pages e5 (May 2018)
Descriptive Statistics vs. Factor Analysis
ATAC-seq provides the open chromatin landscape of LT-HSCs, ST-HSCs, and MPPs. (A) Representative flow cytometry results to assess the Sca1+c-Kit+ population.
Visualising and Exploring BS-Seq Data
EE513 Audio Signals and Systems
LincRNAs expressed in specific subpopulations of mESCs and NPCs.
Adult Mouse Liver Contains Two Distinct Populations of Cholangiocytes
Volume 6, Issue 1, Pages (January 2016)
Volume 22, Issue 6, Pages (February 2018)
Volume 14, Issue 4, Pages (February 2016)
Gene Expression Analysis
Volume 1, Issue 6, Pages (December 2013)
Sequence Analysis - RNA-Seq 2
Statistics for genomics
Dimension Reduction PCA and tSNE
CD4+CLA+CD103+ T cells from human blood and skin share a transcriptional profile. CD4+CLA+CD103+ T cells from human blood and skin share a transcriptional.
Inferring Cellular Processes from Coexpressing Genes
Volume 13, Issue 10, Pages (December 2015)
Unsupervised Learning
Volume 25, Issue 5, Pages e4 (May 2017)
The Technology and Biology of Single-Cell RNA Sequencing
Presentation transcript:

Cell Cycle Analysis & Effect on scRNA-Seq Analysis Workflow Marmar Moussa Computer Science & Engineering Department University of Connecticut

Cell Cycle Analysis G0?

Motivation Cell Type Effect vs. Cell Cycle Effect The variation in the gene expression profiles of single cells in different phases of the cell cycle can interfere with the functional analysis of the transcriptomic data. When the objective is identifying functional cell type:

Existing Methods: Cyclone (classifier, scoring for G1, S, and G2 cell cycle phases) ccRemover (cell cycle effect remover) Test on jurkat & 293 cell lines  Oscope (identifies oscillatory genes in unsynchronized single cell RNA-seq) reCAT (reconstructing cell cycle pseudo time-series)

Oscope/reCAT

Oscope/reCAT

WIP : PCA-tSNE-based Approach 1st few PCs of a set of annotated cell cycle marker genes is sufficient for constructing a cell to cell covariance matrix, reflecting the cell cycle induced correlation among cells1,2,3,4. Examine the idea of ordering the cells based on : first few PCAs of the cell cycle marker genes as features 3 component t-SNE transformation (capturing nearest neighbor relation) clustered/ordered cells using average/ward linkage algorithm.

Challenges Deciding on PCs to use Normalization Gene Lists CC genes based PCA vs. PC loadings analysis Normalization Centering (mean-based) & Scaling (sd-based) of cells (genes) Gene Lists Genes Correlation Filter

Dataset(s) - Labeled H1-Fucci hESC cell line: Fluorescent ubiquitination-based cell cycle indicator (Fucci) H1 hESCs isolated by sorting single cells by fluorescence activated cell sorting (FACS). G1, S or G2/M cell-cycle phases isolated by FACS into 91, 80 and 76 cells in G1, S and G2/M. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE64016 Rex1-GFP-expressing mESC (182) stained with Hoechst 33342 and Flow cytometry sorted for G1, S and G2M stages of cell cycle. Sequencing by Fluidigm C1 system and Nextera XT (Illumina) kit. https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-2805/

Dataset – Not Labeled T-cells (CD3+ cells), 10x Genomics. Additional challenges: Sparser data than C1 platform Cycling vs. non-cycling cells

Gene Lists Effect - hESC Cyclone,etc... CycleBase* CC Go Term Number of Genes 1180 324 640 Correlation Filter 0.25 G1 0.857142857 1 G2 0.828947368 0.881578947 0.868421053 S Micro Accuracy 0.894736842 0.963562753 0.95951417 Macro Accuracy 0.895363409 0.960526316 0.956140351 *CycleBase DB genes (human) are annotated by their peak phase (6 phases : G1, G2, S, G1/S, G2/M, M)

True Labels G1-Genes G1S-Genes S-Genes G2-Genes G2M-Genes M-Genes All Genes Avg G1 Phase cells 0.1160004 0.001255134 -0.01616238 -0.02431784 0.01816772 0.1118672 3.45E-02 G2 Phase cells -0.0768833 -0.1291946 -0.10372 0.1114958 0.1180861 0.04258149 -6.27E-03 S Phase cells -0.05891127 0.1213072 0.1169187 -0.07825951 -0.1328476 -0.1677013 -3.32E-02 All cells average -6.60E-03 -2.21E-03 -9.88E-04 2.97E-03 1.14E-03 -4.42E-03  

C5: Clusters: G1-Genes G1S-Genes S-Genes G2-Genes G2M-Genes M-Genes 1 -0.0908 -0.1212 -0.1185 -0.1024 -0.1304 -0.1134 2 -0.238 -0.18 -0.2014 -0.1783 -0.1754 -0.1879 3 -0.1383 -0.1671 -0.1568 -0.1464 -0.1466 -0.1359 4 0.1029 -0.0577 -0.049 -0.026 -0.0432 -0.0279 5 0.7584 1.082 1.0557 0.9449 1.0182 0.9695 6 -0.1853 -0.1664 -0.171 -0.1674 -0.1622 -0.165 7 -0.0087 -0.07 -0.0672 -0.0626 -0.0385 -0.0562 8 0.2174 0.0414 0.0717 0.0729 0.0615 0.0787 9 -0.1758 -0.1666 -0.1538 -0.1295 -0.1482 -0.1465 10 -0.1993 -0.1637 -0.1859 -0.1688 -0.1793 -0.1846 11 -0.1618 -0.162 -0.1523 -0.1614 -0.1448 12 -0.2236 -0.189 -0.1954 -0.1757 -0.1901

Dividing t-cells

Open Question: Can we distinguish cycling vs Open Question: Can we distinguish cycling vs. non-cycling cells, and/or assess cell order? IF we could directly assess the order,  without having to know the labels, without assuming a certain model for the genes (binary, bi-modal, sinusoidal etc), and without a 'perfect' list of the cc genes as a whole or per phase;  then we could use this to select the best order from potential orders.

Assessing cell order Defining Gene-Smoothness: GeneSmoothness(x) = {sd(diff(x))/abs(mean(diff(x)))} x = Gene as vector of expressions in cells Score interpretation: Lower scores mean less variance within the order  smoother signal

Assessing cell order Autocorrelation (serial correlation): correlation of a signal with a delayed copy of itself. Informally, it is the similarity between observations as a function of the time lag between them. Score interpretation: scores near 1 imply a smoothly varying series scores near 0 imply that there's no overall relationship between a data point and the following one. scores near -1 suggest that the series is jagged/rough in a particular way: if one point is above the mean, the next is likely to be below the mean by about the same amount, and vice versa.

References Leng, N., Chu, L.F., Barry, C., Li, Y., Choi, J., Li, X., Jiang, P., Stewart, R.M., Thomson, J.A., Kendziorski, C.: Oscope identies oscillatory genes in unsynchronized single-cell rna-seq experiments. Nature methods 12(10), 947 (2015) Scialdone, A., Natarajan, K.N., Saraiva, L.R., Proserpio, V., Teichmann, S.A., Stegle, O., Marioni, J.C., Buettner, F.: Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods 85, 54{61 (2015) Liu, Zehua, et al. "Reconstructing cell cycle pseudo time-series via single-cell transcriptome data." Nature communications 8.1 (2017): 22. Barron, Martin, and Jun Li. "Identifying and removing the cell-cycle effect from single-cell RNA-sequencing data." Scientific reports 6 (2016): 33892.

Thank You! Cell Cycle in SC1 tool : https://sc1.engr.uconn.edu/ Questions?

Future Work: Intron Retention & Cell Cycle IR measured for T cells sorted at different stages of the cell cycle: ~1K differentially retained introns with distinct patterns of retention for each stage of the cell cycle. These introns were retained from genes enriched for cell cycle (p = 8E-6). Reference: Middleton, Robert, et al. "IRFinder: assessing the impact of intron retention on mammalian gene expression." Genome biology 18.1 (2017): 51.

Correlation Filter – CycleBase Gene List 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5** G1 1 0.989011 G2 0.868421 0.855263 0.88158 0.894737 0.947368 S 0.9875 0.975 0.9375 0.925 Micro Accuracy 0.959514 0.955466 0.9636 0.95951 Macro Accuracy 0.95614 0.951754 0.960526 0.95636 0.956579 0.95796 0.957456