Faithful Sampling for Spectral Clustering to Analyze High Throughput Flow Cytometry Data Parisa Shooshtari School of Computing Science, Simon Fraser University,

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Clustering II.
Ulrike Naumann High-throughput flow cytometry data and how to load, transform and visualise data and gate populations in Bioconductor.
Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Frequent Closed Pattern Search By Row and Feature Enumeration
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Automatic Identification of Bacterial Types using Statistical Image Modeling Sigal Trattner, Dr. Hayit Greenspan, Prof. Shimon Abboud Department of Biomedical.
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
Clustering II.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Locally Constraint Support Vector Clustering
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
A Study of Approaches for Object Recognition
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
1 Visualization Tool for Flow Cytometry Data Standards Project Evgeny Maksakov CS533C Department of Computer Science, UBC in collaboration.
University at BuffaloThe State University of New York WaveCluster A multi-resolution clustering approach qApply wavelet transformation to the feature space.
Geometric Approaches to Reconstructing Times Series Project Outline 15 February 2007 CSC/Math 870 Computational Discrete Geometry Connie Phong.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
1. cluster the data. 2. for the data of a cluster, set up the network. 3. begin at a random vertex as source/sink s, choose its farthest vertex as the.
Gwangju Institute of Science and Technology Intelligent Design and Graphics Laboratory Multi-scale tensor voting for feature extraction from unstructured.
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Similarity measuress Laboratory of Image Analysis for Computer Vision and Multimedia Università di Modena e Reggio Emilia,
Algorithms for Triangulations of a 3D Point Set Géza Kós Computer and Automation Research Institute Hungarian Academy of Sciences Budapest, Kende u
Maryam Sadeghi 1,3, Majid Razmara 1, Martin Ester 1, Tim K. Lee 1,2,3 and M. Stella Atkins 1 1: School of Computing Science, Simon Fraser University 2:
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Maryam Sadeghi 1,3, Majid Razmara 1, Martin Ester 1, Tim K. Lee 1,2,3 and M. Stella Atkins 1 1: School of Computing Science, Simon Fraser University 2:
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
Enhancing Interactive Visual Data Analysis by Statistical Functionality Jürgen Platzer VRVis Research Center Vienna, Austria.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Fig.1. Flowchart Functional network identification via task-based fMRI To identify the working memory network, each participant performed a modified version.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
On the Topology of Wireless Sensor Networks Sen Yang, Xinbing Wang, Luoyi Fu Department of Electronic Engineering, Shanghai Jiao Tong University, China.
1 A System for Outlier Detection and Cluster Repair Ying Liu Dr. Sprague Oct 21, 2005.
Spectral Clustering Jianping Fan Dept of Computer Science UNC, Charlotte.
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
A new initialization method for Fuzzy C-Means using Fuzzy Subtractive Clustering Thanh Le, Tom Altman University of Colorado Denver July 19, 2011.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Adaptive FIR Neural Model for Centroid Learning in Self-Organizing.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Mesh Resampling Wolfgang Knoll, Reinhard Russ, Cornelia Hasil 1 Institute of Computer Graphics and Algorithms Vienna University of Technology.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
1 Visualization Tool for Flow Cytometry Data Standards Project Evgeny Maksakov CS533C Department of Computer Science, UBC in collaboration.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Nonlinear Mapping for Data Structure Analysis John W.
Flow cytometry data analysis: SPADE for cell population identification and sample clustering Narahara.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Clustering (1) Clustering Similarity measure Hierarchical clustering
Data Mining Soongsil University
CHU HAI COLLEGE OF HIGHER EDUCATION DEPARTMENT OF COMPUTER SCIENCE Preparation of Mid-Term Progress Report Bachelor of Science in Computer Science.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Jianping Fan Dept of CS UNC-Charlotte
A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:
CSE572, CBS598: Data Mining by H. Liu
Automatic Segmentation of Data Sequences
Taufik Abidin and William Perrizo
Data Analysis – Part1: The Initial Questions of the AFCS
CSE572: Data Mining by H. Liu
Case Injected Genetic Algorithms
Most slides are from Eyal David’s presentation
Presentation transcript:

Faithful Sampling for Spectral Clustering to Analyze High Throughput Flow Cytometry Data Parisa Shooshtari School of Computing Science, Simon Fraser University, Burnaby Brinkman’s Lab, Terry Fox Laboratory, BC Cancer Agency, Vancouver

Outline: Flow Cytometry (FCM) Data Clustering of FCM data Spectral Clustering Faithful Sampling for Spectral Clustering Result Summary

Basics of Flow Cytometry Technique Sample Int-1 MHC-II MHC-II Intensity MHC-II CD-11c Wave Length CD-11c Intensity Int-2 MHC-II Int-2 Int-1 MHC-II CD-11c Wave Length

Cell Population Identification in Flow Cytometry (FCM) Parameter 2 Parameter 1 X% Parameter 3 Parameter 4 Now think that this cell is just one of thousands of cells flowing pass through a tube one cell at a time. These cells can be differentiated using the fluorescence intensity indicating, for example, presence or absence of a particular cell surface protein. CLICK Here each dot represent individual cell. Axes indicate intensity at different wavelengths. A gate can then be drawn to select a particular subset of cell population with common intensities. Further sub-setting can be done based on 1-D and 2-D projections of data Adapted from the Science Creative Quarterly (2)

Importance of FCM Data Clustering Manual Gating is Subjective Error-prone Time-Consuming It ignores the multi-variation nature of the data Analyzing large size FCM data sets (with up to 19 dimensions and 1000,000 points) is impractical without the aim of automated techniques

Which Clustering Algorithm Is Suitable? Model-Based algorithms like FlowClust, FlowMerge and FLAME are not suitable for non-elliptical shape clusters. A Good Clustering FlowMerge GFP

Our Motivation for Using Spectral Clustering Spectral clustering does not require any priori assumption on cluster size, shape or distribution It is not sensitive to outliers, noise and shape of clusters

Spectral Clustering in One Slide Represent data sets by a similarity graph Construct the Graph: Vertices: data points p1, p2, …, pn Weights of edges: similarity values Si, j as Clustering: Find a cut through the graph Define a cut objective function Solve it

The Bottleneck of Spectral Clustering Serious empirical barriers when applying this algorithm to large datasets Time complexity: O(n3) ---- > 2 years for 300,000 data points (cells) Required memory: O(n2) ---- > 5 terabytes for 300,000 data points (cells)

Faithful Sampling: Our Solution for Applying Spectral Clustering to Large Data Uniform Sampling: Low density populations close to dense ones may not remain distinguishable Faithful Sampling: Tends to choose more samples from non-dense parts of the data.

How Does Our Faithful Sampling Preserve Information? Space Uniform Sampling: It preserves low-density parts of the data by selecting more samples from them compared to the uniform sampling. Keeping the list of points in neighbourhood of samples: This will be used to define similarities between communities.

Clustering Result Low density populations surrounded by dense ones

Clustering Result Populations with Non-elliptical Shapes Subpopulations of a major population SamSPECTRAL flowMerge FLAME

Dependency of SamSPECTRAL Results to Scaling Factor (σ) Monocytes Dendritic Cells σ = 100 σ = 200 B Cells σ = 300 σ = 400

Block Diagram of Clustering Ensemble Method σ1 σ2 σr . . . . . SamSPECTRAL SamSPECTRAL SamSPECTRAL Build New Feature Vectors Compute Similarities Between Categorical Feature Vectors SamSPECTRAL for Categorical Data Final Results

Results After Applying Clustering Ensemble Method CD14 MHC-II Final Result after Applying Clustering Ensemble Method Manual Gating Monocytes Monocytes CD14 B Cells B Cells Dendritic Cells Dendritic Cells MHC-II

Advantages of Using Clustering Ensemble Method No need for manual setting of initial parameters Higher quality and stability of clustering results F-measure between manual gating and original SamSPECTRAL is in average 0.77 (sd=0.07) F-measure between manual gating and our clustering ensemble method is 0.91

Summary Spectral clustering can now be applied to large size data by our proposed Faithful (Information Preserving) sampling. This sampling method can be used in combination with other graph-based clustering algorithms with different objective functions to reduce size of the data. We have shown that SamSPECTRAL has advantage over model-based clusterings in identification of Cell populations with non-elliptical shapes Low-density populations surrounded by dense ones Sub-populations of a major population

Acknowledgement Committee: Co-authors on SamSPECTRAL Data Providers Dr. Arvind Gupta Dr. Ryan Brinkman Dr. Tobias Kollman Co-authors on SamSPECTRAL Habil Zare Data Providers Connie Eaves Peter Landsdrop Keith Humphries

Thanks for Your Attention!

Cell Population Identification in Flow Cytometry (FCM) Parameter 2 Parameter 1 X% Parameter 3 Parameter 4 Now think that this cell is just one of thousands of cells flowing pass through a tube one cell at a time. These cells can be differentiated using the fluorescence intensity indicating, for example, presence or absence of a particular cell surface protein. CLICK Here each dot represent individual cell. Axes indicate intensity at different wavelengths. A gate can then be drawn to select a particular subset of cell population with common intensities. Further sub-setting can be done based on 1-D and 2-D projections of data Adapted from the Science Creative Quarterly (2)

SamSPECTRAL Algorithm

SamSPECTRAL Algorithm