Flow cytometry data analysis: SPADE for cell population identification and sample clustering 2013.11.17 Narahara
Flow cytometry (FCM) data Signals of multiple cell-surface markers are measured for each cell Single-cell measurement Multi-dimensional Up to 12 in standard flow cytometry >30 in next-generation mass cytometry Analysis 1: Cell population identification We want to identify a particular cell population (e.g. CD8+ T-cells). Analysis 2: Sample clustering We want to predict a phenotype from FCM pattern. We want to measure similarity between two different samples. http://en.wikipedia.org/wiki/Flow_cytometry
spanning-tree progression analysis of density-normalized events (SPADE) Nature Biotechnology (2011) First described for cell population identification (Qui et al. Nature Biotechnology, 2011) Unsupervised approach to identify either known or unexpected cell types. No need of prior knowledge Effective 2D visualization of multi-dimensional FCM data in tree structure Extension for sample classification (Aghaeepour et al., Nature Methods, 2013) Meta-SPADE tree and Earth Mover's Distance (EMD)
SPADE for cell population identification For each sample SPADE for cell population identification
Traditional methods Manual gating Automated gating methods subject to user’s knowledge unsuitable for high-throughput data analysis Automated gating methods clustering-based often miss continuity (progression of cellular differentiation) and a population of rare cell types Gating
SPADE outline FCM data & manual gating Simulated 2-marker FCM data Equal density rare cell types contributes to clustering equally to abundant types Minimum spanning tree Clustering connects all clusters. map all cells to clusters
Three or more markers Output is in 2D tree structure
1. Down-sampling Cells form a high-dimensional point cloud #points = #cells #dimension = #markers Many other clustering methods tend to capture the most abundant cell populations, whereas rare cell types are either excluded as outliers or absorbed by larger clusters. Equalizing density of cloud increases chance to identify rare cell types. LDi: local density for cell i (#cells within its neighbor) L1 distance (Manhattan distance) between cells User-defined parameters OD: outlier density (such as 1st percentile of all LDs) cells with local density lower than OD are discarded as noise. TD: target density (such as 5th percentile of all LDs)
2. Clustering Agglomerative (“bottom-up”) method Each cell forms its own cluster Iteratively merge with the nearest cluster Single linkage L1 distance Repeat iterative grouping until the number of clusters reduced to the user-defined target number (such as 50). #clusters is not the expected number of cell types you want to differentiate. #clusters defines how much you want to simplify the point cloud. note: single linkage minimum distance between two points in two different clusters note: #clusters 50 for 8 markers 300 for 13 markers
3. Minimum spanning tree Construction of MST MST is a tree that links nodes with the minimal total length of edges. Each cell cluster = node the median marker expression represents the cluster Edges are weighted by the distance between nodes SPADE uses Boruvka’s algorithm
4. Up-sampling Mapping each cell to one cluster (node) assign each cell (cell A) to its nearest down-sampled cell (cell B) assign cell A to the cluster that cell B belongs to.
5. visualization & identification of cell types SPADE visualizes the resulting tree in 2D structure a modified Fruchterman-Reingold algorithm for layout Coloring nodes based on intensity for each marker One colored tree per marker Identification of cell types is manual.
SPADE for sample clustering For comparing multiple data sets SPADE for sample clustering
Procedure Down-sampling separately for each sample adjust TD such that each data set contribute the same number of cells Pool the down-sampled data into a meta-down-sampled data set, which shapes a meta-cloud Clustering and MST construction as described for single-sample SPADE. Feature extraction for each data set For each data set, calculate the percentage of cells in each cluster.
Classifying samples PCA using the cellular distribution Distance (dissimilarity) between a pair of samples cellular distribution + tree structure Earth Mover’s Distance (EMD)
Earth mover’s distance Measure of the distance between two probability distribution over a region Intuitively, EMD is a minimal work (cost) to transform a mass of earth spread over a region to another shape http://homepages.inf.ed.ac.uk/rbf/CVDICT/cve.htm
Edge weighted by distance EMD for clusters Transportation problem Node Edge weighted by distance http://csie-data.com/transportation_problem.AxCMS?channel=print
Software Matlab R/Bioconductor http://odin.mdacc.tmc.edu/~pqiu/software/SPADE2/index.html. R/Bioconductor http://cytospade.org/