Service Aggregated Linked Sequential Activities: GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data mining algorithms with good multicore and cluster performance; understand software runtime and parallelization method. Use managed code (C#) and package algorithms as services to encourage broad use assuming experts parallelize core algorithms. CURRENT RESUTS: Microsoft CCR supports MPI, dynamic threading and via DSS Service model of computing; detailed performance measurements Speedups of 7.5 or above on 8-core systems for “large problems” with deterministic annealed (avoid local minima) algorithms for clustering, Gaussian Mixtures, GTM (dimensional reduction); extending to new algorithms/applications SALSA Team Geoffrey Fox Xiaohong Qiu Huapeng Yuan Seung-Hee Bae Indiana University Technology Collaboration George Chrysanthakopoulos Henrik Frystyk Nielsen Microsoft Application Collaboration Cheminformatics Rajarshi Guha David Wild Bioinformatics Haiku Tang Demographics (GIS) Neil Devadasan Indianan University and IUPUI SALSASALSA
Speedup = Number of cores/(1+f) f = (Sum of Overheads)/(Computation per core) Computation Grain Size n. # Clusters K Overheads are Synchronization: small with CCR Load Balance: good Memory Bandwidth Limit: 0 as K Cache Use/Interference: Important Runtime Fluctuations: Dominant large n,K All our “real” problems have f ≤ 0.05 and speedups on 8 core systems greater than 7.6 MPI Exchange Latency in µ s (20-30 µ s computation between messaging) MachineOSRuntimeGrainsParallelismMPI Latency Intel8c:gf12 (8 core 2.33 Ghz) (in 2 chips) RedhatMPJE(Java)Process8181 MPICH2 (C)Process840.0 MPICH2:FastProcess839.3 NemesisProcess84.21 Intel8c:gf20 (8 core 2.33 Ghz) FedoraMPJEProcess8157 mpiJavaProcess8111 MPICH2Process864.2 Intel8b (8 core 2.66 Ghz) VistaMPJEProcess8170 FedoraMPJEProcess8142 FedorampiJavaProcess8100 VistaCCR (C#)Thread820.2 AMD4 (4 core 2.19 Ghz) XPMPJEProcess4185 RedhatMPJEProcess4152 mpiJavaProcess499.4 MPICH2Process439.3 XPCCRThread416.3 Intel(4 core)XPCCRThread425.8 Fractional Overhead f K=10 Clusters 20 Clusters 10000/Grain Size 30 Clusters DA Clustering Performance Runtime Fluctuations 2% to 5% overhead “ Main Thread” and Memory M 1m11m1 0m00m0 2m22m2 3m33m3 4m44m4 5m55m5 6m66m6 7m77m7 Subsidiary threads t with memory m t Use Data Decomposition as in classic distributed memory but use shared memory for read variables. Each thread uses a “local” array for written variables to get good cache performance Parallel Programming Strategy SALSASALSA
Resolution T 0.5 r: Renters a:Asian h : Hispanic p: Total Resolution T 0.5 Deterministic Annealing Clustering of Indiana Census Data Decrease temperature (distance scale) to discover more clusters GTM Projection of 2 clusters of 335 compounds in 155 dimensions Stop Press: GTM Projection of PubChem: 10,926,94 compounds in 166 dimension binary property space takes 4 days on 8 cores. 64X64 mesh of GTM clusters interpolates PubChem. Could usefully use 1024 cores! David Wild will use for GIS style 2D browsing interface to chemistry Bioinformatics: Annealed Clustering and Euclidean embedding for repetitive sequences, gene/protein families. Use GTM to replace PCA in structure analysis PCAGTM Linear PCA v. nonlinear GTM on 6 Gaussians in 3D SALSASALSA
SALSASALSA N data points E(x) in D dim. space and Minimize F by EM Link of CCR and MPI (or cross cluster CCR) Linear Algebra for C#: (Multiplication, SVD, Equation Solve) High Performance C# Math Libraries Deterministic Annealing Clustering (DAC) a(x) = 1/N or generally p(x) with p(x) =1 g(k)=1 and s(k)=0.5 T is annealing temperature varied down from with final value of 1 Vary cluster center Y(k) but can calculate P k and (k) (even for matrix (k)) using IDENTICAL formulae for Gaussian mixtures K starts at 1 and is incremented by algorithm Generative Topographic Mapping (GTM) a(x) = 1 and g(k) = (1/K)( /2 ) D/2 s(k) = 1/ and T = 1 Y(k) = m=1 M W m m (X(k)) Choose fixed m (X) = exp( (X- m ) 2 / 2 ) Vary W m and but fix values of M and K a priori Y(k) E(x) W m are vectors in original high D dimension space X(k) and m are vectors in 2 dim. mapped space We need: Large Windows Cluster Deterministic Annealing Gaussian mixture models (DAGM) a(x) = 1 g(k)={P k /(2 (k) 2 ) D/2 } 1/T s(k)= (k) 2 (taking case of spherical Gaussian) T is annealing temperature varied down from with final value of 1 Vary Y(k) P k and (k) K starts at 1 and is incremented by algorithm DAGTM: GTM has several natural annealing versions based on either DAC or DAGM: under investigation Traditional Gaussian mixture models GM As DAGM but set T=1 and fix K Principal Component Analysis (PCA) Near Term Future Work: Parallel Algorithms for Random Projection Metric Embedding (Bourgain) MDS Dimensional Scaling (EM like SMACOF) Marquardt Algorithm for Newton’s Method Later: HMM and SVM, Other embedding Parallel Dimensional Scaling and Metric embedding; Generalized Cluster analysis