Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

Slides:



Advertisements
Similar presentations
A Fast PTAS for k-Means Clustering
Advertisements

Pattern Recognition and Machine Learning
Variations of the Turing Machine
EE384y: Packet Switch Architectures
1 Geometric Methods for Learning and Memory A thesis presented by Dimitri Nowicki To Universite Paul Sabatier in partial fulfillment for the degree of.
Copyright © 2006 Pearson Addison-Wesley. All rights reserved. Lecture 2: Econometrics (Chapter 2.1–2.7)
Multicriteria Decision-Making Models
Basic Steps 1.Compute the x and y image derivatives 2.Classify each derivative as being caused by either shading or a reflectance change 3.Set derivatives.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Chapter 6 File Systems 6.1 Files 6.2 Directories
1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.
Spectral Clustering Eyal David Image Processing seminar May 2008.
Correctness of Gossip-Based Membership under Message Loss Maxim GurevichIdit Keidar Technion.
Chapter 7 Sampling and Sampling Distributions
Solve Multi-step Equations
Simple Linear Regression 1. review of least squares procedure 2
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
Pole Placement.
Stationary Time Series
Announcements Homework 6 is due on Thursday (Oct 18)
Factoring Quadratics — ax² + bx + c Topic
EE, NCKU Tien-Hao Chang (Darby Chang)
CS525: Special Topics in DBs Large-Scale Data Management
Detection Chia-Hsin Cheng. Wireless Access Tech. Lab. CCU Wireless Access Tech. Lab. 2 Outlines Detection Theory Simple Binary Hypothesis Tests Bayes.
Chapter 10: Applications of Arrays and the class vector
Briana B. Morrison Adapted from William Collins
1 Slides revised The overwhelming majority of samples of n from a population of N can stand-in for the population.
1 COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. On the Capacity of Wireless CSMA/CA Multihop Networks Rafael Laufer and Leonard Kleinrock Bell.
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Computer vision: models, learning and inference
…We have a large reservoir of engineers (and scientists) with a vast background of engineering know-how. They need to learn statistical methods that can.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
The simplex algorithm The simplex algorithm is the classical method for solving linear programs. Its running time is not polynomial in the worst case.
Statistical Inferences Based on Two Samples
DTU Informatics Introduction to Medical Image Analysis Rasmus R. Paulsen DTU Informatics TexPoint fonts.
Essential Cell Biology
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Exponents and Radicals
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Future Internet Technology Building
9. Two Functions of Two Random Variables
Commonly Used Distributions
Chapter 5 The Mathematics of Diversification
Adaptive Segmentation Based on a Learned Quality Metric
Probabilistic Reasoning over Time
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Lecture 14 Nonlinear Problems Grid Search and Monte Carlo Methods.
Panel: New Opportunities in High Performance Data Analytics (HPDA) and High Performance Computing (HPC) The 2014 International Conference.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Radial Basis Function Networks
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
Remarks on Big Data Clustering (and its visualization) Big Data and Extreme-scale Computing (BDEC) Charleston SC May Geoffrey Fox
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.
Simulated Annealing.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
Randomized Algorithms for Bayesian Hierarchical Clustering
Deterministic Annealing Dimension Reduction and Biology Indiana University Environmental Genomics April Geoffrey.
Deterministic Annealing and Robust Scalable Data mining for the Data Deluge Petascale Data Analytics: Challenges, and Opportunities (PDAC-11) Workshop.
Optimization Indiana University July Geoffrey Fox
CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.
DA Algorithms Analytics February 2017 Software: MIDAS HPC-ABDS
Overview Identify similarities present in biological sequences and present them in a comprehensible manner to the biologists Objective Capturing Similarity.
Hidden Markov Models Part 2: Algorithms
Robust Parallel Clustering Algorithms
Presentation transcript:

Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October IEEE International Conference on Big Data 2013 (IEEE BigData 2013) Santa Clara CA Geoffrey Fox, D. R. Mani, Saumyadipta Pyne School of Informatics and Computing Indiana University Bloomington

Challenge Deterministic Annealing introduced ~1990 for clustering but no broadly available implementations although most tests rate well 2000 Extended to non metric spaces (Hofmann and Buhmann) Applied to Dimension reduction, PLSI, Gaussian mixtures etc. (Fox et al.) 2011 Applied to model dependent single cluster at a time “peak matching" problem of the precise identification of the common LC- MS peaks across a cohort of multiple biological samples in proteomic biomarker discovery for data from a human tuberculosis cohort. (Frühwirth, Mani and Pyne) Here apply “multi-cluster”, “annealed models”, “Continuous Clustering”, “parallelism” to proteomics case giving high performance automatic robust method to quickly analyze proteomics samples such as those taken in rapidly spreading epidemics 2

Deterministic Annealing Algorithms 3

Some Motivation Big Data requires high performance – achieve with parallel computing Big Data sometimes requires robust algorithms as more opportunity to make mistakes Deterministic annealing (DA) is one of better approaches to robust optimization – Started as “Elastic Net” by Durbin for Travelling Salesman Problem TSP – Tends to remove local optima – Addresses overfitting – Much Faster than simulated annealing Physics systems find true lowest energy state if you anneal i.e. you equilibrate at each temperature as you cool Uses mean field approximation, which is also used in “Variational Bayes” and “Variational inference”

(Deterministic) Annealing Find minimum at high temperature when trivial Small change avoiding local minima as lower temperature Typically gets better answers than standard libraries- R and Mahout And can be parallelized and put on GPU’s etc. 5

Basic Deterministic Annealing H(  ) is objective function to be minimized as a function of parameters  Gibbs Distribution at Temperature T P(  ) = exp( - H(  )/T) /  d  exp( - H(  )/T) Or P(  ) = exp( - H(  )/T + F/T ) Minimize Free Energy combining Objective Function and Entropy F = =  d  {P(  )H + T P(  ) lnP(  )} Simulated annealing performs these integrals by Monte Carlo Deterministic annealing corresponds to doing integrals analytically (by mean field approximation) and is much much faster In each case temperature is lowered slowly – say by a factor 0.95 to at each iteration Start with one cluster (all there is at T = ∞ ), others emerge automatically as T decreases

Implementation of DA Central Clustering Here points to be clustered are in a metric space Clustering variables are M i (k) where this is probability that point i belongs to cluster k and  k=1 K M i (k) = 1 In Central or PW Clustering, take H 0 =  i=1 N  k=1 K M i (k)  i (k) – Linear form allows DA integrals to be done analytically Central clustering has  i (k) = (X(i)- Y(k)) 2 and M i (k) determined by Expectation E step – H Central =  i=1 N  k=1 K M i (k) (X(i)- Y(k)) 2 = exp( -  i (k)/T ) /  k=1 K exp( -  i (k)/T ) Centers Y(k) are determined in M step of EM method 7

Trimmed Clustering Clustering with position-specific constraints on variance: Applying redescending M-estimators to label-free LC-MS data analysis (Rudolf Frühwirth, D R Mani and Saumyadipta Pyne) BMC Bioinformatics 2011, 12:358 H TCC =  k=0 K  i=1 N M i (k) f(i,k) – f(i,k) = (X(i) - Y(k)) 2 /2  (k) 2 k > 0 – f(i,0) = c 2 / 2 k = 0 The 0’th cluster captures (at zero temperature) all points outside clusters (background) Clusters are trimmed (X(i) - Y(k)) 2 /2  (k) 2 < c 2 / 2 Relevant when well defined errors T ~ 0 T = 1 T = 5 Distance from cluster center

Key Features of Proteomics 2-dimensional data arising from a list of peaks specified by points (m/Z, RT), where m/Z is the mass to charge ratio, and RT the retention time for the peak representing by a peptide. Measurement errors  (m/Z) = m/Z and  (RT) = 2.35 – Ratio of errors drastically different from ratio of dimensions of (m/Z, RT) space which could distort high temperature limit – solve by annealing  (m/Z)  2D (x) = ((m/Z| cluster center – m/Z| x )/  (m/Z)) 2 + ((RT| cluster center – RT| x )/  (RT)) 2 is model 9

General Features of DA In many problems, decreasing temperature is classic multiscale – finer resolution (√T is “just” distance scale) – We have factors like (X(i)- Y(k)) 2 / T In clustering, one then looks at second derivative matrix (can derive analytically) of Free Energy wrt each cluster position and as temperature is lowered this develops negative eigenvalue – Or have multiple clusters at each center and perturb – Trade-off depends on problem – high dimension takes time to find eigenvector; we use eigenvectors here as 2D This is a phase transition and one splits cluster into two and continues EM iteration till desired resolution reached One can start with just one cluster 10

System becomes unstable as Temperature lowered and there is a phase transition and one splits cluster into two and continues EM iteration One can start with just one cluster and need NOT specify desired # clusters; rather specify cluster resolution Rose, K., Gurewitz, E., and Fox, G. C. ``Statistical mechanics and phase transitions in clustering,'' Physical Review Letters, 65(8): , August My #6 most cited article (456 cites including 16 in 2013)

Proteomics 2D DA Clustering T= with 60 Clusters (will be 30,000 at T=0.025)

The brownish triangles are sponge peaks outside any cluster. The colored hexagons are peaks inside clusters with the white hexagons being determined cluster center 13 Fragment of 30,000 Clusters Points

Continuous Clustering This is a very useful subtlety introduced by Ken Rose but not widely known although it greatly improves algorithm Take a cluster k to be split into 2 with centers Y(k) A and Y(k) B with initial values Y(k) A = Y(k) B at original center Y(k) Then typically if you make this change and perturb the Y(k) A and Y(k) B, they will return to starting position as F at stable minimum (positive eigenvalue) But instability (the negative eigenvalue) can develop and one finds Implement by adding arbitrary number p(k) of centers for each cluster Z i =  k=1 K p(k) exp(-  i (k)/T) and M step gives p(k) = C(k)/N Halve p(k) at splits; can’t split easily in standard case p(k) = 1 Show weighting in sums like Z i now equipoint not equicluster as p(k) proportional to points C(k) in cluster 14 Free Energy F Y(k) A and Y(k) B Y(k) A + Y(k) B Free Energy F Y(k) A - Y(k) B

Deterministic Annealing for Proteomics 15

Proteomics Clustering Methods DAVS(c) is Parallel Trimmed DA clustering with clusters satisfying  2D (x) ≤ c 2 ; c annealed from large value to given value at T~2 DA2D scales m/Z and RT so clusters “circular” but does NOT trim them; t here are no “sponge points”; there are clusters with 1 point All use start with 1 cluster center and – “Continuous Clustering” – Anneal  (m/Z) from ~1 at T = ∞ to m/Z at T~10 – Anneal c from T=40 to 2 (only introduce trimming at low temperatures) Mclust uses standard model-based non deterministic annealing clustering Landmarks are a collection of reference peaks (obtained by identifying a subset of peaks using MS/MS peptide sequencing). 16

Cluster Count v. Temperature for 2 Runs All start with one cluster at far left T=1 special as measurement errors divided out DA2D counts clusters with 1 member as clusters. DAVS(2) does not

18 Landmark Histograms of number of peaks in clusters for 4 clustering methods and the landmark set. Note lowest bin is clusters with one member peak, i.e. unclustered singletons. For DAVS these are Sponge peaks.

Basic Statistics Error is mean squared deviation of points from center in each dimension – sensitive to cut in  2D (x) DAVS(3) produces most large clusters; Mclust fewest Mclust has many more clusters with just one member DA2D similar to DAVS(2) except has some (probably false) associations of points far from center to cluster 19 ChargeMethod Number Clusters Number of Clusters with occupation count Count > 1 Scaled Error Count > 30 Scaled Error 1 (Sponge) 2>2>30m/zRTm/zRT 2DAVS(1) DAVS(2) DAVS(3) DA2D Mclust

DAVS(2) and DA2D discover 1020 of 1023 Landmark peaks with modest error 20

21 Histograms of  2D (x) for 4 different clusters methods, and the landmark set plus expectation for a Gaussian distribution with standard deviations given as  (m/z)/3 and  (RT)/3 in two directions. The “Landmark” distribution correspond to previously identified peaks used as a control set. Note DAVS(1) and DAVS(2) have sharp cut offs at  2D (x) = 1 and 4 respectively. Only clusters with more than 50 members are plotted

Basic Equations N Number of Points and K Clusters NK Unknowns determined by  i (k) = (X i - Y(k)) 2 = p(k) exp( -  i (k)/T ) /  k=1 K p(k) exp( -  i (k)/T ) C(k) =  i=1 N Number of points in Cluster k Y(k) =  i=1 N X i / C(k) p(k) = C(k) / N Iterate T = “∞” to is probability that point i in cluster k 22

Simple Parallelism as in k-means Decompose points i over processors Equations either pleasingly parallel “maps” over i Or “All-Reductions” summing over i for each cluster Parallel Algorithm: – Each process holds all clusters and calculates contributions to clusters from points in node – e.g. Y(k) =  i=1 N X i / C(k) Runs well in MPI or MapReduce – See all the MapReduce k-means papers 23

Better Parallelism The previous model is correct at start but each point does not really contribute to each cluster as damped exponentially by exp( - (X i - Y(k)) 2 /T ) For Proteomics problem, on average only 6.45 clusters needed per point if require (X i - Y(k)) 2 /T ≤ ~40 (as exp(-40) small) So only need to keep nearby clusters for each point As average number of Clusters ~ 20,000, this gives a factor of ~3000 improvement Further communication is no longer all global; it has nearest neighbor components and calculated by parallelism over clusters 24

25 Speedups for several runs on Tempest from 8-way through 384 way MPI parallelism with one thread per process. We look at different choices for MPI processes which are either inside nodes or on separate nodes

Online/Realtime Clustering Given a existing clustering, one can add new data in two ways Simplest is of course to interpolate new points to nearest existing cluster Better is to add new points and rerun full algorithm starting at T~1 where “convergence” is in range T=0.1 to 0.01 – Takes 20% to 30% original execution time 26

Summary Deterministic Annealing provides quality results keeping us healthy and running in model DAVS(c) or unconstrained fashion DA2D – User can choose trade offs given by cut off c Parallel version gives a fast automatic initial analysis of LC-MS peaks with no user input needed including no input on final number of clusters Little known “Continuous Clustering” useful Current open source code available but best wait till we finish conversion from C# to Java Parallel approach subtle as like particle in cell codes, have parallelism over clusters (cells) and/or points (particles) – ? Useful different benchmark for compilers etc. Similar ideas relevant for other clustering and deterministic annealing fields such as non metric spaces, MDS 27

Extras 28

29 Start at T= “  ” with 1 Cluster Decrease T, Clusters emerge at instabilities

30

31

Clusters v. Regions In Lymphocytes clusters are distinct In Pathology, clusters divide space into regions and sophisticated methods like deterministic annealing are probably unnecessary 32 Pathology 54D Lymphocytes 4D

Protein Universe Browser for COG Sequences with a few illustrative biologically identified clusters 33

Proteomics 2D DA Clustering T=0.1 small sample of ~30,000 Clusters Count >=2 34 Orange sponge points Outliers not in cluster Yellow triangles Centers

Remarks on Clustering and MDS The standard data libraries (R, Matlab, Mahout) do not have best algorithms/software in either functionality or scalable parallelism A lot of algorithms are built around “classic full matrix” kernels Clustering, Gaussian Mixture Models, PLSI (probabilistic latent semantic indexing), LDA (Latent Dirichlet Allocation) similar Multi-Dimensional Scaling (MDS) classic information visualization algorithm for high dimension spaces (map preserving distances) Vector O(N) and Non Vector semimetric O(N 2 ) space cases for N points; “all” apps are points in spaces – not all “Proper linear spaces” Trying to release ~most powerful (in features/performance) available Clustering and MDS library although unfortunately in C# Supported Features: Vector, Non-Vector, Deterministic annealing, Hierarchical, sharp (trimmed) or general cluster sizes, Fixed points and general weights for MDS, (generalized Elkans algorithm) 35

General Deterministic Annealing For some cases such as vector clustering and Mixture Models one can do integrals by hand but usually that will be impossible So introduce Hamiltonian H 0 ( ,  ) which by choice of  can be made similar to real Hamiltonian H R (  ) and which has tractable integrals P 0 (  ) = exp( - H 0 (  )/T + F 0 /T ) approximate Gibbs for H R F R (P 0 ) = | 0 = | 0 + F 0 (P 0 ) Where | 0 denotes  d  P o (  ) Easy to show that real Free Energy (the Gibb’s inequality) F R (P R ) ≤ F R (P 0 ) (Kullback-Leibler divergence) Expectation step E is find  minimizing F R (P 0 ) and Follow with M step (of EM) setting  = | 0 =  d   P o (  ) (mean field) and one follows with a traditional minimization of remaining parameters 36 Note 3 types of variables  used to approximate real Hamiltonian  subject to annealing The rest – optimized by traditional methods

Implementation of DA-PWC Clustering variables are again M i (k) (these are  in general approach) where this is probability point i belongs to cluster k Pairwise Clustering Hamiltonian given by nonlinear form H PWC = 0.5  i=1 N  j=1 N  (i, j)  k=1 K M i (k) M j (k) / C(k)  (i, j) is pairwise distance between points i and j with C(k) =  i=1 N M i (k) as number of points in Cluster k Take same form H 0 =  i=1 N  k=1 K M i (k)  i (k) as for central clustering  i (k) determined to minimize F PWC (P 0 ) = | 0 where integrals can be easily done And now linear (in M i (k)) H 0 and quadratic H PC are different Again = exp( -  i (k)/T ) /  k=1 K exp( -  i (k)/T ) 37

Some Ideas  Deterministic annealing is better than many well-used optimization problems  Started as “Elastic Net” by Durbin for Travelling Salesman Problem TSP  Basic idea behind deterministic annealing is mean field approximation, which is also used in “Variational Bayes” and “Variational inference”  Markov chain Monte Carlo (MCMC) methods are roughly single temperature simulated annealing Less sensitive to initial conditions Avoid local optima Not equivalent to trying random initial starts

Some Uses of Deterministic Annealing Clustering – Vectors: Rose (Gurewitz and Fox) – Clusters with fixed sizes and no tails (Proteomics team at Broad) – No Vectors: Hofmann and Buhmann (Just use pairwise distances) Dimension Reduction for visualization and analysis – Vectors: GTM Generative Topographic Mapping – No vectors SMACOF: Multidimensional Scaling) MDS (Just use pairwise distances) Can apply to HMM & general mixture models (less study) – Gaussian Mixture Models – Probabilistic Latent Semantic Analysis with Deterministic Annealing DA-PLSA as alternative to Latent Dirichlet Allocation for finding “hidden factors”

40 Histograms of  2D (x) for 4 different clusters methods, and the landmark set plus expectation for a Gaussian distribution with standard deviations given as 1/3 in the two directions. The “Landmark” distribution correspond to previously identified peaks used as a control set. Note DAVS(1) and DAVS(2) have sharp cut offs at  2D (x) = 1 and 4 respectively. Only clusters with more than 5 peaks are plotted

Some Problems Analysis of Mass Spectrometry data to find peptides by clustering peaks (Broad Institute) – ~0.5 million points in 2 dimensions (one experiment) -- ~ 50,000 clusters summed over charges Metagenomics – 0.5 million (increasing rapidly) points NOT in a vector space – hundreds of clusters per sample Pathology Images >50 Dimensions Social image analysis is in a highish dimension vector space – million images; 1000 features per image; million clusters Finding communities from network graphs coming from Social media contacts etc. – No vector space; can be huge in all ways 41

42 Speedups for several runs on Madrid from sequential through 128 way parallelism defined as product of number of threads per process and number of MPI processes. We look at different choices for MPI processes which are either inside nodes or on separate nodes. For example 16- way parallelism shows 3 choices with thread count 1:16 processes on one node (the fastest), 2 processes on 8 nodes and 8 processes on 2 nodes

43 Parallelism within a Single Node of Madrid Cluster. A set of runs on peak data with a single node with 16 cores with either threads or MPI giving parallelism. Parallelism is either number threads or number of MPI processes. Parallelism (#threads or #processes)

Proteomics 2D DA Clustering T=0.1 small sample of ~30,000 Clusters Count >=2 44 Sponge Peaks Centers