Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April
Outline Brief overview of clinical proteomics What do we intend to achieve with machine learning? Modelling profiles through mixture models Evaluation Conclusions
What is proteomics anyway? Proteomics – The study of Proteins and how they affect one’s state of health. Think Genomics, but with proteins instead of genes. It may be much more difficult to map the human proteome than it was to map the human genome. A relatively new field of research. Lots of techniques, lots of ideas, only 25 hours in a day.
Why is proteomics useful? Primary reason: Efficient, early detection and diagnosis of disease Invasive techniques such as biopsies are relatively high-risk. Not to mention, expensive! Proteomic profiling allows for a non or minimally-invasive way of detecting a malady in a patient. More affordable (for now), allowing for more opportunities for screening. Alternative reason: Prediction of response or non- response to a treatment Often times, getting the treatment is worse than simply living with the disease. Allows for a screening process to determine which treatment is best for a particular patient.
Vacuum Tube Laser Detector Lens Spectral View Chip Spots OK, I’m interested. How does proteomics work? It’s Spectrometry, my dear Watson. Crocodile
Proteomic Profiles Some examples from Pancreatic cancer patients. In this dataset: 57 Healthy patients (controls) 59 Cancerous patients (cases) Dataset is from UPCI Control Case
Feature Reduction Proteomic profiles can have anywhere from 15,000 to 370,000 intensities reported. The pancreatic dataset has 60,270 m/z values Too much for a statistical ML model to parameterize each intensity. The goal of feature reduction is to select the parts of profiles which are the most informative about class membership. Feature = an individual intensity measurement. Some features may be redundant Some features may be noise
Feature Construction As opposed to the feature filtering approaches above, a new set of features can be constructed to represent the profiles. Techniques such as Principal Component Analysis (PCA) or Independent Component Analysis (ICA) are suited towards this task. PCA finds projections of the high-dimensional proteomic data into a low- dimensional subspace. The variance retained in the projection is maximal, so that there is a greater amount of dispersion between classes in which a decision boundary can be drawn. An additional benefit of PCA is that it identifies orthogonal sets of correlated features and constructs new features (components) that are uncorrelated, yet explain most of the variance in the data.
Creating clustered relations
Mixture Models Let X = {x 1,…, x n } be a set of n datapoints. Assume each x is generated from a mixture of m components M = {c 1,…,c m }, so that This is a mixture model with m components.
Determining component responsibility Using Bayes’ theorem, Interpret P(c j ) as prior probability of component j being “turned on” Interpret P(x|c j ) as a basis function which describes the behavior of x given by the component c j
Component Responsibility = Clustering Idea: Use the component responsibilities as features
Changing the basis functions Easy thing to do: Say x is computed as a confluence of m Gaussians Plug it back into the mixture model equation “Mixture of Gaussians” model
Mixture of Gaussians Computation of the posterior P(c j |x) is dependant on μ j and Σ j May not assign proper “credit” to the jth component. Solution: Incorporate a hidden indicator variable z j, which indicates whether or not x was generated by component c j Interpretation:
Mixture of Gaussians & EM Algorithm Since z is unknown, we can use the EM algorithm to compute the values of z which maximize the ODL. In the M-step, we calculate the most likely values for the parameters of the m components.
Mixture of Gaussians: M-Step Mean (Co)Variance
Slight modification… Assume that the Gaussian components are all hyperspherical, that is, And let z c = The result? K-means algorithm The features? The values where I(c|x) = 1
ML Factor Analysis Now, let x be a linear combination of j factors z = {z 1,…,z j } + some noise u Columns of Λ represent sources from which x is generated This is “normal” factor analysis.
Mixture of Factor Analyzers Let x be generated from the z factors, but allow the factors to spread across m loading matrices Here, the component c j is something of an indicator variable, so we search for E c,z (c j, z|x) The features are then computed as the weighted posteriors of c j conditoned on x, with P(z|x) as the weight.
Evaluation Step 1: Divide data into training/testing set Step 2: Compute clustered features on the training set Step 3: Reduce the samples in the testing set to the appropriate clusters Step 4: Classify the samples using an SVM
Mixture of Gaussians
K-means
Mixture of Factor Analyzers
Summary & Comparison PCA is given as a baseline for “good performance” Mixture of Gaussians does well, but still unsure about the behavior after adding features K-means is somewhat competitive MFA is likely too complicated for this task
Conclusions There are many ways you can cluster features in order to discover regulated sources Sources can be examined for domain-specific importance Choosing the number of sources is an open problem Still, the performance of these techniques were not substantially better than simple PCA. Save yourself time and effort, go with a simple model