Mining and Visualization of Flow Cytometry Data ANGELA CHIN UNIVERSITY OF HOUSTON RESEARCH EXPERIENCE FOR UNDERGRADUATES JULY 3,
Contents 1.Introduction to Flow Cytometry 2.The Problem 3.Current Approaches & Results 4.Future Work 2
Flow Cytometry MEDICAL TECHNIQUE USED FOR CELL COUNTING AND CELL SORTING 3
How it Works Picture from: Abcam 4
Flow Cytometry Application Determine whether a person has b-cell lymphoma Based on the number of clusters that result from flow cytometry Two clusters : cancer patient Three clusters : healthy individual 5
Example: Flow Cytometry Results 6 Cancer PatientHealthy Patient
Problems with Current Methods The process for determining if there are two or three clusters is manual Doctors’ time could be better spent on other tasks 7
The Problem CREATING AN AUTOMATED METHOD TO DETERMINING THE NUMBER OF CLUSTERS 8
Past Approaches Many ways to determine number of clusters Most need to know the number of clusters ahead of time Most popular is k-means, but there are some problems Need to give the algorithm the number of clusters beforehand Has difficulty when clusters are close, different sizes, etc. 9
Further Defining the Problem We want to be able to determine the number of clusters when: The distance between clusters is very small The ratio of cluster sizes is large (100:1 to 1000:1) We decided to further constrain the problem such that we could determine: 1 cluster vs 2 clusters when the size ratio was up to 1000:1 10
Current Approaches & Results 11
Two Approaches Approach #1: Transformation Find the center of the data Take each point and find its angle from the horizontal line located at the center (new x-value) and distance from the center (new y-value) Use transformed data to determine number of clusters Approach #2: Testing Normal Fit Project 2D data onto line to create 1D data Apply normal distribution fit Compare the Bayesian Information Criterion (BIC) of the fit to a cut-off limit If the BIC is above the limit, there are two clusters; otherwise, there is one 12
Approach #1: Transformation 13
Approach #1: Transformation 14
Approach #1: Transformation Process 15
Approach #1: Transformation 16
Approach #2: Testing Normal Fit 17
Approach #2: Testing Normal Fit 3 standard deviations apart, ratio 1:99 ONE CLUSTER BEST FITSTWO CLUSTER BEST FITS 18
Approach #2: Testing Normal Fit Comparing BIC of the one cluster versus two clusters All data was generated using points and the same standard deviations The ratios between clusters and distance between two clusters (if applicable) was varied Ratios: 199:1 to 63:1 Distance: 1.5 to 5 Standard Deviations apart 19
Approach #2: Testing Normal Fit 20 Comparing BIC of the one cluster versus two clusters All data was generated using points and the same standard deviations The ratios between clusters and distance between two clusters (if applicable) was varied Ratios: 199:1 to 63:1 Distance: 1.5 to 5 Standard Deviations apart
Future Work 21
Future Work Approach #1: Determine if there is a way to detect the second cluster in the transformation Approach #2: Use real data to see if a cut-off can be determined Overall: After figuring out how to distinguish one and two clusters, extend the method to two versus three clusters 22
Limitations Assume the data will have Gaussian distribution Number of clusters limited to two or three 23
Acknowledgements I would like to thank my research advisor, Dr. Stephen Huang, and Mitch Shih for their guidance on this project. I would also like to thank the University of Houston Computer Science Department and the National Science Foundation for providing me with the opportunity to participate in the REU. 24