Faithful Sampling for Spectral Clustering to Analyze High Throughput Flow Cytometry Data Parisa Shooshtari School of Computing Science, Simon Fraser University, Burnaby Brinkman’s Lab, Terry Fox Laboratory, BC Cancer Agency, Vancouver
Outline: Flow Cytometry (FCM) Data Clustering of FCM data Spectral Clustering Faithful Sampling for Spectral Clustering Result Summary
Basics of Flow Cytometry Technique Sample Int-1 MHC-II MHC-II Intensity MHC-II CD-11c Wave Length CD-11c Intensity Int-2 MHC-II Int-2 Int-1 MHC-II CD-11c Wave Length
Cell Population Identification in Flow Cytometry (FCM) Parameter 2 Parameter 1 X% Parameter 3 Parameter 4 Now think that this cell is just one of thousands of cells flowing pass through a tube one cell at a time. These cells can be differentiated using the fluorescence intensity indicating, for example, presence or absence of a particular cell surface protein. CLICK Here each dot represent individual cell. Axes indicate intensity at different wavelengths. A gate can then be drawn to select a particular subset of cell population with common intensities. Further sub-setting can be done based on 1-D and 2-D projections of data Adapted from the Science Creative Quarterly (2)
Importance of FCM Data Clustering Manual Gating is Subjective Error-prone Time-Consuming It ignores the multi-variation nature of the data Analyzing large size FCM data sets (with up to 19 dimensions and 1000,000 points) is impractical without the aim of automated techniques
Which Clustering Algorithm Is Suitable? Model-Based algorithms like FlowClust, FlowMerge and FLAME are not suitable for non-elliptical shape clusters. A Good Clustering FlowMerge GFP
Our Motivation for Using Spectral Clustering Spectral clustering does not require any priori assumption on cluster size, shape or distribution It is not sensitive to outliers, noise and shape of clusters
Spectral Clustering in One Slide Represent data sets by a similarity graph Construct the Graph: Vertices: data points p1, p2, …, pn Weights of edges: similarity values Si, j as Clustering: Find a cut through the graph Define a cut objective function Solve it
The Bottleneck of Spectral Clustering Serious empirical barriers when applying this algorithm to large datasets Time complexity: O(n3) ---- > 2 years for 300,000 data points (cells) Required memory: O(n2) ---- > 5 terabytes for 300,000 data points (cells)
Faithful Sampling: Our Solution for Applying Spectral Clustering to Large Data Uniform Sampling: Low density populations close to dense ones may not remain distinguishable Faithful Sampling: Tends to choose more samples from non-dense parts of the data.
How Does Our Faithful Sampling Preserve Information? Space Uniform Sampling: It preserves low-density parts of the data by selecting more samples from them compared to the uniform sampling. Keeping the list of points in neighbourhood of samples: This will be used to define similarities between communities.
Clustering Result Low density populations surrounded by dense ones
Clustering Result Populations with Non-elliptical Shapes Subpopulations of a major population SamSPECTRAL flowMerge FLAME
Dependency of SamSPECTRAL Results to Scaling Factor (σ) Monocytes Dendritic Cells σ = 100 σ = 200 B Cells σ = 300 σ = 400
Block Diagram of Clustering Ensemble Method σ1 σ2 σr . . . . . SamSPECTRAL SamSPECTRAL SamSPECTRAL Build New Feature Vectors Compute Similarities Between Categorical Feature Vectors SamSPECTRAL for Categorical Data Final Results
Results After Applying Clustering Ensemble Method CD14 MHC-II Final Result after Applying Clustering Ensemble Method Manual Gating Monocytes Monocytes CD14 B Cells B Cells Dendritic Cells Dendritic Cells MHC-II
Advantages of Using Clustering Ensemble Method No need for manual setting of initial parameters Higher quality and stability of clustering results F-measure between manual gating and original SamSPECTRAL is in average 0.77 (sd=0.07) F-measure between manual gating and our clustering ensemble method is 0.91
Summary Spectral clustering can now be applied to large size data by our proposed Faithful (Information Preserving) sampling. This sampling method can be used in combination with other graph-based clustering algorithms with different objective functions to reduce size of the data. We have shown that SamSPECTRAL has advantage over model-based clusterings in identification of Cell populations with non-elliptical shapes Low-density populations surrounded by dense ones Sub-populations of a major population
Acknowledgement Committee: Co-authors on SamSPECTRAL Data Providers Dr. Arvind Gupta Dr. Ryan Brinkman Dr. Tobias Kollman Co-authors on SamSPECTRAL Habil Zare Data Providers Connie Eaves Peter Landsdrop Keith Humphries
Thanks for Your Attention!
Cell Population Identification in Flow Cytometry (FCM) Parameter 2 Parameter 1 X% Parameter 3 Parameter 4 Now think that this cell is just one of thousands of cells flowing pass through a tube one cell at a time. These cells can be differentiated using the fluorescence intensity indicating, for example, presence or absence of a particular cell surface protein. CLICK Here each dot represent individual cell. Axes indicate intensity at different wavelengths. A gate can then be drawn to select a particular subset of cell population with common intensities. Further sub-setting can be done based on 1-D and 2-D projections of data Adapted from the Science Creative Quarterly (2)
SamSPECTRAL Algorithm
SamSPECTRAL Algorithm