Computer Science & Engineering Department University of Connecticut

Computer Science & Engineering Department University of Connecticut
SC1 : Computational Methods for the Analysis of Single Cell RNA-Seq Data Marmar Moussa Ion Măndoiu Computer Science & Engineering Department University of Connecticut

Outline Motivation TF-IDF based Gene Selection & Clustering
Workflow overview QC Gene Selection Clustering Differential Expression Analysis Enrichment Analysis Visualizing cells , gene expression, gene pairs, DE analysis, clustering heatmaps

Motivation No widely adopted analysis workflow yet
An interactive web-based pipeline for scRNA-seq data analysis that allows researchers to analyze data in an efficient work-flow. Only few tools that allow researchers to analyze scRNA-seq data without considerable coding efforts. Provide support for more detailed analysis & new methods: several quality control (QC) options TF-IDF based gene selection & clustering LSImpute Cell Cycle Analysis Interactive choices, 3D visualization, informative plots …

scRNA-Seq Analysis Workflow
Data Exploration& QC Imputation Normalization/ Batch Effect Data Transformation Log Transform TF-IDF Feature Selection Markers Genes TF-IDF Genes Dimensionality Reduction PCA t-SNE … Optimal Number of Clusters, Within Cluster Homogeneity Clustering similarity… DE Analysis Enrichment Analysis, Annotation Visualization Cell Cycle Analysis Many methods available K-means Hierarchical clustering Graph based … Active area of research Reducing effect of confounders such as detection rate & cell cycle phase Discriminative similarity metrics Scalability to millions of cells…

QC Dashboard

Gene Selection & Clustering Single Cell RNA-Seq Data using TF-IDF based Methods
Goal  clustering cells into cell types … Moussa, M., and Mandoiu, I.: Clustering scRNA-Seq Data using TF-IDF. Bioinformatics Research and Applications. 13th International Symposium, ISBRA 2017, Honolulu, HI, USA. Lecture Notes in Computer Science book series ( Extended Abstract in LNCS, volume 10330). Moussa, M., and Mandoiu, I.: Single Cell RNA-seq Data Clustering using TF-IDF based Methods. (BMC Genomics, 2018).

TF-IDF Transformation
Term Frequency x Inverse Document Frequency Successfully employed in information retrieval Two parts: How many times a term occurs in a document? (Considers term frequency ) How ‘important’ the term is?(Considers document/collection frequency ) (Intuition: rare terms in a collection are more informative than frequent terms; Think stop-words!) Topics classification

TF-IDF Transformation
Term Frequency x Inverse Document Frequency for scRNA-Seq data: For gene i in cell j with count f : 𝑇 𝐹 𝑖𝑗 = 𝑓 𝑖𝑗 / max 𝑘 𝑓 𝑘𝑗 If gene i is detected in ni out of N cells: 𝐼𝐷 𝐹 𝑖 = log 2 (𝑁/ 𝑛 𝑖 ) TF-IDF score: 𝑇 𝐹 𝑖𝑗 ∗𝐼𝐷 𝐹 𝑖 Analogy to scRNA-seq data using UMIs Library/collection/corpus of documents – sample of cells Document – cell Term -- Genes

TF-IDF based Feature selection
Genes with highest avg. TF-IDF (highest mean GMM in red)

scRNA-Seq Clustering Methods
Cells QC, Genes QC, Gap-Statistics Analysis Data Transformation: Log2(x+1) or none Feature Selection: PCA, tSNE, highly variable genes* or none Seurat (K-means)* Seurat (SNN)* GMM K-means Sph. K-means HC (E/P) Louvain (E) Data Transformation: TF-IDF Feature Selection: High avg. TFIDF score (Top) or Highly variable TF-IDF (Var) HC (E/P/C) Data Binarization: Cutoff threshold per cell based on cell avg. TF-IDF(Bin) HC (E/P/C/J) Greedy (E/P/C/J) Louvain (E/P/C/J)

Pairs by ‘difficulty’: dissimilar  intermediate  similar
highly dissimilar: (b cells and cd14 monocytes) and (b cells and cd56 nk) highly similar : (memory t and naive cytotoxic) and (regulatory t and naive t) intermediate similarity: (memory t and naive t) and (regulatory t and naive cytotoxic)

PBMC, 10x 5: Monocytes, 6: Natural Killer 4: B cells
7,8: naïve cytotoxic, cytotoxic, activated cytotoxic 1: helper 2: regulatory

DRG-neurons, C1 42 Genes out of 47 hand-curated Markers List picked in top TF-IDF Genes

T-cells, 10x

Future Work Support for additional clustering & DE analysis methods, etc. Integrate imputation Moussa, M., Mandoiu, I.: Locality sensitive imputation for single-cell rna-seq data. ISBRA2018 Proceedings (to appear in ISBRA 2018 proceedings) . bioRxiv preprint available at doi: ( Cell Cycle Analysis

Thank You. Questions?

TF-IDF based gene selection
Genes with highest avg. TF-IDF, ranked (Top): Density fitted n-mixture GMM model to the distribution of TF-IDF gene averages, ranked by number of detecting cells Heavy tail Genes with highest avg. TF-IDF (highest mean GMM in red) Avg. Gene TF-IDF score for regulatory, memory cells mix

Computer Science & Engineering Department University of Connecticut

Similar presentations

Presentation on theme: "Computer Science & Engineering Department University of Connecticut"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Science & Engineering Department University of Connecticut

Similar presentations

Presentation on theme: "Computer Science & Engineering Department University of Connecticut"— Presentation transcript:

Similar presentations

About project

Feedback