Flow cytometry data analysis: SPADE for cell population identification and sample clustering 2013.11.17 Narahara.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering, DBSCAN The EM Algorithm
O(N 1.5 ) divide-and-conquer technique for Minimum Spanning Tree problem Step 1: Divide the graph into  N sub-graph by clustering. Step 2: Solve each.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Machine Learning and Data Mining Clustering
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
A Study of Approaches for Object Recognition
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
What is Cluster Analysis
Topological Data Analysis MATH 800 Fall Topological Data Analysis (TDA) An ε-chain is a finite sequence of points x 1,..., x n such that |x i –
Cluster Analysis: Basic Concepts and Algorithms
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004.
Lecture 09 Clustering-based Learning
Clustering Unsupervised learning Generating “classes”
~5,617,000 population in each state
Computer Vision James Hays, Brown
CSE 185 Introduction to Computer Vision Pattern Recognition.
More on Microarrays Chitta Baral Arizona State University.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
First topic: clustering and pattern recognition Marc Sobel.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Clustering.
Levels of Image Data Representation 4.2. Traditional Image Data Structures 4.3. Hierarchical Data Structures Chapter 4 – Data structures for.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Clustering (1) Clustering Similarity measure Hierarchical clustering
Big data classification using neural network
Semi-Supervised Clustering
CSE 4705 Artificial Intelligence
IMAGE PROCESSING RECOGNITION AND CLASSIFICATION
Research in Computational Molecular Biology , Vol (2008)
Unsupervised Learning - Clustering 04/03/17
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Unsupervised Learning - Clustering
K Nearest Neighbor Classification
CSE572, CBS598: Data Mining by H. Liu
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
Research Techniques Made Simple: Mass Cytometry Analysis Tools for Decrypting the Complexity of Biological Systems  Tiago R. Matos, Hongye Liu, Jerome.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Clustering The process of grouping samples so that the samples are similar within each group.
CSE572: Data Mining by H. Liu
Hairong Qi, Gonzalez Family Professor
“Traditional” image segmentation
Machine Learning and Data Mining Clustering
Presentation transcript:

Flow cytometry data analysis: SPADE for cell population identification and sample clustering 2013.11.17 Narahara

Flow cytometry (FCM) data Signals of multiple cell-surface markers are measured for each cell Single-cell measurement Multi-dimensional Up to 12 in standard flow cytometry >30 in next-generation mass cytometry Analysis 1: Cell population identification We want to identify a particular cell population (e.g. CD8+ T-cells). Analysis 2: Sample clustering We want to predict a phenotype from FCM pattern. We want to measure similarity between two different samples. http://en.wikipedia.org/wiki/Flow_cytometry

spanning-tree progression analysis of density-normalized events (SPADE) Nature Biotechnology (2011) First described for cell population identification (Qui et al. Nature Biotechnology, 2011) Unsupervised approach to identify either known or unexpected cell types. No need of prior knowledge Effective 2D visualization of multi-dimensional FCM data in tree structure Extension for sample classification (Aghaeepour et al., Nature Methods, 2013) Meta-SPADE tree and Earth Mover's Distance (EMD)

SPADE for cell population identification For each sample SPADE for cell population identification

Traditional methods Manual gating Automated gating methods subject to user’s knowledge unsuitable for high-throughput data analysis Automated gating methods clustering-based often miss continuity (progression of cellular differentiation) and a population of rare cell types Gating

SPADE outline FCM data & manual gating Simulated 2-marker FCM data Equal density rare cell types contributes to clustering equally to abundant types Minimum spanning tree Clustering connects all clusters.  map all cells to clusters

Three or more markers Output is in 2D tree structure

1. Down-sampling Cells form a high-dimensional point cloud #points = #cells #dimension = #markers Many other clustering methods tend to capture the most abundant cell populations, whereas rare cell types are either excluded as outliers or absorbed by larger clusters. Equalizing density of cloud increases chance to identify rare cell types. LDi: local density for cell i (#cells within its neighbor) L1 distance (Manhattan distance) between cells User-defined parameters OD: outlier density (such as 1st percentile of all LDs) cells with local density lower than OD are discarded as noise. TD: target density (such as 5th percentile of all LDs)

2. Clustering Agglomerative (“bottom-up”) method Each cell forms its own cluster Iteratively merge with the nearest cluster Single linkage L1 distance Repeat iterative grouping until the number of clusters reduced to the user-defined target number (such as 50). #clusters is not the expected number of cell types you want to differentiate. #clusters defines how much you want to simplify the point cloud. note: single linkage minimum distance between two points in two different clusters note: #clusters 50 for 8 markers 300 for 13 markers

3. Minimum spanning tree Construction of MST MST is a tree that links nodes with the minimal total length of edges. Each cell cluster = node the median marker expression represents the cluster Edges are weighted by the distance between nodes SPADE uses Boruvka’s algorithm

4. Up-sampling Mapping each cell to one cluster (node) assign each cell (cell A) to its nearest down-sampled cell (cell B) assign cell A to the cluster that cell B belongs to.

5. visualization & identification of cell types SPADE visualizes the resulting tree in 2D structure a modified Fruchterman-Reingold algorithm for layout Coloring nodes based on intensity for each marker One colored tree per marker Identification of cell types is manual.

SPADE for sample clustering For comparing multiple data sets SPADE for sample clustering

Procedure Down-sampling separately for each sample adjust TD such that each data set contribute the same number of cells Pool the down-sampled data into a meta-down-sampled data set, which shapes a meta-cloud Clustering and MST construction as described for single-sample SPADE. Feature extraction for each data set For each data set, calculate the percentage of cells in each cluster.

Classifying samples PCA using the cellular distribution Distance (dissimilarity) between a pair of samples cellular distribution + tree structure  Earth Mover’s Distance (EMD)

Earth mover’s distance Measure of the distance between two probability distribution over a region Intuitively, EMD is a minimal work (cost) to transform a mass of earth spread over a region to another shape http://homepages.inf.ed.ac.uk/rbf/CVDICT/cve.htm

Edge weighted by distance EMD for clusters Transportation problem Node Edge weighted by distance http://csie-data.com/transportation_problem.AxCMS?channel=print

Software Matlab R/Bioconductor http://odin.mdacc.tmc.edu/~pqiu/software/SPADE2/index.html. R/Bioconductor http://cytospade.org/