Deterministic Annealing and Robust Scalable Data mining for the Data Deluge Petascale Data Analytics: Challenges, and Opportunities (PDAC-11) Workshop.

Slides:



Advertisements
Similar presentations
Future Internet Technology Building
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Scalable High Performance Dimension Reduction
SALSA HPC Group School of Informatics and Computing Indiana University.
Panel: New Opportunities in High Performance Data Analytics (HPDA) and High Performance Computing (HPC) The 2014 International Conference.
Proteomics Metagenomics and Deterministic Annealing Indiana University October Geoffrey Fox
High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis Jong Youl Choi, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Principal Component Analysis
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Lecture 5: Learning models using EM
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
SVD and PCA COS 323. Dimensionality Reduction Map points in high-dimensional space to lower number of dimensionsMap points in high-dimensional space to.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Presented By Wanchen Lu 2/25/2013
Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.
Deterministic Annealing Indiana University CS Theory group January Geoffrey Fox
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Generative Topographic Mapping in Life Science Jong Youl Choi School of Informatics and Computing Pervasive Technology Institute Indiana University
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
Remarks on Big Data Clustering (and its visualization) Big Data and Extreme-scale Computing (BDEC) Charleston SC May Geoffrey Fox
Presenter: Yang Ruan Indiana University Bloomington
Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.
Deterministic Annealing Networks and Complex Systems Talk 6pm, Wells Library 001 Indiana University November Geoffrey.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
SALSA HPC Group School of Informatics and Computing Indiana University.
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Service Aggregated Linked Sequential Activities: GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data.
Deterministic Annealing Dimension Reduction and Biology Indiana University Environmental Genomics April Geoffrey.
Parallel Applications And Tools For Cloud Computing Environments SC 10 New Orleans, USA Nov 17, 2010.
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
X-Informatics MapReduce February Geoffrey Fox Associate Dean for Research.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox
SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 
Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering.
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
Optimization Indiana University July Geoffrey Fox
Yang Ruan PhD Candidate Salsahpc Group Community Grid Lab Indiana University.
A new approach for the gamma tracking clustering by the deterministic annealing method François Didierjean IPHC, Strasbourg.
SPIDAL Java High Performance Data Analytics with Java on Large Multicore HPC Clusters
Service Aggregated Linked Sequential Activities
DA Algorithms Analytics February 2017 Software: MIDAS HPC-ABDS
Clustering (3) Center-based algorithms Fuzzy k-means
Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.
Geoffrey Fox, Huapeng Yuan, Seung-Hee Bae Xiaohong Qiu
Applying Twister to Scientific Applications
Biology MDS and Clustering Results
DACIDR for Gene Analysis
Hidden Markov Models Part 2: Algorithms
Probabilistic Models with Latent Variables
SPIDAL and Deterministic Annealing
Adaptive Interpolation of Multidimensional Scaling
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Clouds from FutureGrid’s Perspective
Feature space tansformation methods
Expectation-Maximization & Belief Propagation
Robust Parallel Clustering Algorithms
Indiana University July Geoffrey Fox
MDS and Visualization September Geoffrey Fox
Presentation transcript:

Deterministic Annealing and Robust Scalable Data mining for the Data Deluge Petascale Data Analytics: Challenges, and Opportunities (PDAC-11) Workshop at SC11 Seattle November Geoffrey Fox Director, Digital Science Center, Pervasive Technology Institute Associate Dean for Research and Graduate Studies, School of Informatics and Computing Indiana University Bloomington

Goal We are building a library of parallel data mining tools that have best known (to me) robustness and performance characteristics – Big data needs super algorithms? A lot of statistics tools (e.g. in R) are not the best algorithm and not well parallelized Deterministic annealing (DA) is one of better approaches to global optimization – Removes local minima – Addresses overfitting – Faster than simulated annealing Return to my heritage (physics) in an approach I called Physical Computation (23 years ago) (methods based on analogies to nature) Physics systems find true lowest energy state if you anneal – i.e. you equilibrate at each temperature as you cool

Five Ideas  Deterministic annealing (mean field) is better than many well-used global optimization problems  No-vector problems are O(N 2 )  For no-vector case, an develop new O(N) or O(NlogN) methods as in “Fast Multipole and OctTree methods”  Map high dimensional data to 3D and use classic methods developed to speed up O(N 2 ) 3D particle dynamics problems  Low dimension mappings of data to both visualize and apply geometric hashing  Can run expectation maximization on clouds and HPC

Uses of Deterministic Annealing Clustering – Vectors: Rose (Gurewitz and Fox) – Clusters with fixed sizes and no tails (Proteomics team at Broad) – No Vectors: Hofmann and Buhmann (Just use pairwise distances) Dimension Reduction for visualization and analysis – Vectors: GTM – No vectors: MDS (Just use pairwise distances) Can apply to (but little practical work) – Gaussian Mixture Models – Latent Dirichlet Allocation (typical informational retrieval/global inference) done as Probabilistic Latent Semantic Analysis with Deterministic Annealing

Deterministic Annealing I Gibbs Distribution at Temperature T P(  ) = exp( - H(  )/T) /  d  exp( - H(  )/T) Or P(  ) = exp( - H(  )/T + F/T ) Minimize Free Energy combining Objective Function and Entropy F = =  d  {P(  )H + T P(  ) lnP(  )} Where  are (a subset of) parameters to be minimized Simulated annealing corresponds to doing these integrals by Monte Carlo Deterministic annealing corresponds to doing integrals analytically (by mean field approximation) and is naturally much faster than Monte Carlo In each case temperature is lowered slowly – say by a factor 0.99 at each iteration

Minimum evolving as temperature decreases Movement at fixed temperature going to local minima if not initialized “correctly Solve Linear Equations for each temperature Nonlinear effects mitigated by initializing with solution at previous higher temperature Deterministic Annealing F({y}, T) Configuration {y}

Deterministic Annealing II For some cases such as vector clustering and Gaussian Mixture Models one can do integrals by hand but usually will be impossible So introduce Hamiltonian H 0 ( ,  ) which by choice of  can be made similar to real Hamiltonian H R (  ) and which has tractable integrals P 0 (  ) = exp( - H 0 (  )/T + F 0 /T ) approximate Gibbs for H F R (P 0 ) = | 0 = | 0 + F 0 (P 0 ) Where | 0 denotes  d  P o (  ) Easy to show that real Free Energy (the Gibb’s inequality) F R (P R ) ≤ F R (P 0 ) (Kullback-Leibler divergence) Expectation step E is find  minimizing F R (P 0 ) and Follow with M step (of EM) setting  = | 0 =  d   P o (  ) (mean field) and one follows with a traditional minimization of remaining parameters 7

Implementation of DA-PWC Clustering variables are M i (k) (these are  in general approach) where this is probability point i belongs to cluster k In Central or PW Clustering, take H 0 =  i=1 N  k=1 K M i (k)  i (k) Central clustering has  i (k) = (X(i)- Y(k)) 2 and M i (k) determined by Expectation step – H Central =  i=1 N  k=1 K M i (k) (X(i)- Y(k)) 2 – H central and H 0 are identical – Centers Y(k) are determined in M step = exp( -  i (k)/T ) /  k=1 K exp( -  i (k)/T ) Pairwise Clustering Hamiltonian given by nonlinear form H PC = 0.5  i=1 N  j=1 N  (i, j)  k=1 K M i (k) M j (k) / C(k) with C(k) =  i=1 N M i (k) as number of points in Cluster k  (i, j) is pairwise distance between points i and j And now linear (in M i (k)) H 0 and quadratic H PC are different 8

General Features of DA Deterministic Annealing DA is related to Variational Inference or Variational Bayes methods – Markov Chain Monte Carlo is simulated annealing In many problems, decreasing temperature is classic multiscale – finer resolution (√T is “just” distance scale) – We have factors like (X(i)- Y(k)) 2 / T In clustering, one then looks at second derivative matrix of F R (P 0 ) wrt  and as temperature is lowered this develops negative eigenvalue corresponding to instability – Or have multiple clusters at each center and perturb This is a phase transition and one splits cluster into two and continues EM iteration One can start with just one cluster 9

10 Start at T= “  ” with 1 Cluster Decrease T, Clusters emerge at instabilities

11

12

13 Rose, K., Gurewitz, E., and Fox, G. C. ``Statistical mechanics and phase transitions in clustering,'' Physical Review Letters, 65(8): , August My #5 most cited article (387 cites)

1)A(k) =  i=1 N  j=1 N  (i, j) / 2 2)B  (k) =  i=1 N  (i,  ) / 3)   (k) = (B  (k) + A(k)) 4) = p(k) exp( -  i (k)/T )/  k=1 K p(k) exp(-  i (k)/T) 5)C(k) =  i=1 N 6)p(k) = C(k) / N Loop to converge variables; decrease T from  ; split centers by halving p(k) DA-PWC EM Steps (E is red, M Black) k runs over clusters; i,j points 14 Steps 1 global sum (reduction) Step 1, 2, 5 local sum if broadcast

Multidimensional Scaling MDS Map points in high dimension to lower dimensions Many such dimension reduction algorithms (PCA Principal component analysis easiest); simplest but perhaps best is MDS Minimize Stress  (X) =  i<j =1 n weight(i,j) (  (i, j) - d(X i, X j )) 2  (i, j) are input dissimilarities and d(X i, X j ) the Euclidean distance squared in embedding space (3D usually) SMACOF or Scaling by minimizing a complicated function is clever steepest descent (expectation maximization EM) algorithm Computational complexity goes like N 2 * Reduced Dimension There is Deterministic annealed version of it which is much better Could just view as non linear  2 problem (Tapia et al. Rice) – Slower but more general All parallelize with high efficiency

Implementation of MDS One tractable form was linear Hamiltonians Another is Gaussian H 0 =  i=1 n (X(i) -  (i)) 2 / 2 Where X(i) are vectors to be determined as in formula for Multidimensional scaling H MDS =  i< j=1 n weight(i,j) (  (i, j) - d(X(i), X(j) )) 2 Where  (i, j) are observed dissimilarities and we want to represent as Euclidean distance between points X(i) and X(j) H MDS is quartic or involves square roots, so we need the idea of an approximate Hamiltonian H 0 The E step is minimize  i< j=1 n weight(i,j) (  (i, j) – constant.T - (  (i) -  (j)) 2 ) 2 with solution  (i) = 0 at large T Points pop out from origin as Temperature lowered 16

Trimmed Clustering Clustering with position-specific constraints on variance: Applying redescending M-estimators to label-free LC-MS data analysis (Rudolf Frühwirth, D R Mani and Saumyadipta Pyne) BMC Bioinformatics 2011, 12:358 H TCC =  k=0 K  i=1 N M i (k) f(i,k) – f(i,k) = (X(i) - Y(k)) 2 /2  (k) 2 k > 0 – f(i,0) = c 2 / 2 k = 0 The 0’th cluster captures (at zero temperature) all points outside clusters (background) Clusters are trimmed (X(i) - Y(k)) 2 /2  (k) 2 < c 2 / 2 Another case when H 0 is same as target Hamiltonian Proteomics Mass Spectrometry T = 0 T = 1 T = 5 Distance from cluster center

High Performance Dimension Reduction and Visualization Need is pervasive – Large and high dimensional data are everywhere: biology, physics, Internet, … – Visualization can help data analysis Visualization of large datasets with high performance – Map high-dimensional data into low dimensions (2D or 3D). – Need Parallel programming for processing large data sets – Developing high performance dimension reduction algorithms: MDS(Multi-dimensional Scaling), used earlier in DNA sequencing application GTM(Generative Topographic Mapping) DA-MDS(Deterministic Annealing MDS) DA-GTM(Deterministic Annealing GTM) – Interactive visualization tool PlotViz We are supporting drug discovery by browsing 60 million compounds in PubChem database with 166 features each

Pairwise and MDS are O(N 2 ) Problems 100,000 sequences takes a few days on 768 cores 32 nodes Windows Cluster Tempest Could just run 440K on larger machine but lets try to be “cleverer” and use hierarchical methods Start with 100K sample run fully Divide into “megaregions” using 3D projection Interpolate full sample into megaregions and analyze latter separately See 19

20 Use Barnes Hut OctTree originally developed to make O(N 2 ) astrophysics O(NlogN)

21 OctTree for 100K sample of Fungi We use OctTree for logarithmic interpolation

440K Interpolated 22

A large cluster in Region 0 23

26 Clusters in Region 4 24

13 Clusters in Region 6 25

26 Understanding the Octopi

27 The octopi are globular clusters distorted by length dependence of dissimilarity measure Sequences are 200 to 500 base pairs long We restarted project using local (SWG) not global (NW) alignment

28 Note mapped (Euclidean 3D shown as red) and abstract dissimilarity (blue) are very similar

What was/can be done where? Dissimilarity Computation (largest time) – Done using Twister on HPC – Have running on Azure and Dryad – Used Tempest (24 cores per node, 32 nodes) with MPI as well (MPI.NET failed(!), Twister didn’t) Full MDS – Done using MPI on Tempest – Have running well using Twister on HPC clusters and Azure Pairwise Clustering – Done using MPI on Tempest – Probably need to change algorithm to get good efficiency on cloud but HPC parallel efficiency high Interpolation (smallest time) – Done using Twister on HPC – Running on Azure 29

Twister Streaming based communication Intermediate results are directly transferred from the map tasks to the reduce tasks – eliminates local files Cacheable map/reduce tasks Static data remains in memory Combine phase to combine reductions User Program is the composer of MapReduce computations Extends the MapReduce model to iterative computations Data Split D MR Driver User Program Pub/Sub Broker Network D File System M R M R M R M R Worker Nodes M R D Map Worker Reduce Worker MRDeamon Data Read/Write Communication Reduce (Key, List ) Iterate Map(Key, Value) Combine (Key, List ) User Program Close() Configure() Static data Static data δ flow Different synchronization and intercommunication mechanisms used by the parallel runtimes

Expectation Maximization and Iterative MapReduce Clustering and Multidimensional Scaling are both EM (expectation maximization) using deterministic annealing for improved performance EM tends to be good for clouds and Iterative MapReduce – Quite complicated computations (so compute largish compared to communicate) – Communication is Reduction operations (global sums in our case) – See also Latent Dirichlet Allocation and related Information Retrieval algorithms similar EM structure 31

May Need New Algorithms DA-PWC (Deterministically Annealed Pairwise Clustering) splits clusters automatically as temperature lowers and reveals clusters of size O(√T) Two approaches to splitting 1.Look at correlation matrix and see when becomes singular which is a separate parallel step 2.Formulate problem with multiple centers for each cluster and perturb ever so often spitting centers into 2 groups; unstable clusters separate Current MPI code uses first method which will run on Twister as matrix singularity analysis is the usual “power eigenvalue method” (as is page rank) – However not super good compute/communicate ratio Experiment with second method which “just” EM with better compute/communicate ratio (simpler code as well) 32

Next Steps Finalize MPI and Twister versions of Deterministically Annealed Expectation Maximization for – Vector Clustering – Vector Clustering with trimmed clusters – Pairwise non vector Clustering – MDS SMACOF Extend O(NlogN) Barnes Hut methods to all codes Allow missing distances in MDS (Blast gives this) and allow arbitrary weightings (Sammon’s method) – Have done for  2 approach to MDS Explore DA-PLSI as alternative to LDA Exploit better Twister and Twister4Azure runtimes 33