S CALABLE H IGH P ERFORMANCE D IMENSION R EDUCTION Seung-Hee Bae.

Slides:

Advertisements

Similar presentations

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Advertisements

Scalable High Performance Dimension Reduction

Analyzing large-scale cheminformatics and chemogenomics datasets through dimension reduction David J. Wild Assistant Professor & Director, Cheminformatics.

Yue Han and Lei Yu Binghamton University.

Panel: New Opportunities in High Performance Data Analytics (HPDA) and High Performance Computing (HPC) The 2014 International Conference.

Optimal Design Laboratory | University of Michigan, Ann Arbor 2011 Design Preference Elicitation Using Efficient Global Optimization Yi Ren Panos Y. Papalambros.

High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis Jong Youl Choi, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox.

Two Technique Papers on High Dimensionality Allan Rempel December 5, 2005.

Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets April 27, 2011.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Mutual Information Mathematical Biology Seminar

Reduced Support Vector Machine

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

FLANN Fast Library for Approximate Nearest Neighbors

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

Radial Basis Function Networks

Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

Introduction to Monte Carlo Methods D.J.C. Mackay.

Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Generative Topographic Mapping in Life Science Jong Youl Choi School of Informatics and Computing Pervasive Technology Institute Indiana University

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Presenter: Yang Ruan Indiana University Bloomington

Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.

Yang Ruan PhD Candidate Computer Science Department Indiana University.

A Survey of Distributed Task Schedulers Kei Takahashi (M1)

Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.

Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.

Service Aggregated Linked Sequential Activities: GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data.

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

Deterministic Annealing Dimension Reduction and Biology Indiana University Environmental Genomics April Geoffrey.

Parallel Applications And Tools For Cloud Computing Environments SC 10 New Orleans, USA Nov 17, 2010.

Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,

Designing Factorial Experiments with Binary Response Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research Hovav.

SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.

Optimization Indiana University July Geoffrey Fox

CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Yang Ruan PhD Candidate Salsahpc Group Community Grid Lab Indiana University.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Nonlinear Mapping for Data Structure Analysis John W.

Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL

SPIDAL Java High Performance Data Analytics with Java on Large Multicore HPC Clusters

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Spectral Methods for Dimensionality

Fast nearest neighbor searches in high dimensions Sami Sieranoja

Parallel Density-based Hybrid Clustering

Geoffrey Fox, Huapeng Yuan, Seung-Hee Bae Xiaohong Qiu

Applying Twister to Scientific Applications

DACIDR for Gene Analysis

Overview Identify similarities present in biological sequences and present them in a comprehensible manner to the biologists Objective Capturing Similarity.

Adaptive Interpolation of Multidimensional Scaling

Towards High Performance Data Analytics with Java

Dimension reduction : PCA and Clustering

Presentation transcript:

S CALABLE H IGH P ERFORMANCE D IMENSION R EDUCTION Seung-Hee Bae

O UTLINE Motivation Background and Related Work Research Issues Experimental Analysis Contributions References 2

M OTIVATION Data deluge era Biological sequence, Chemical Compound data Data intensive computing Data analysis or mining is getting important. High-dimensional data Dimension reduction alg. helps people to investigate distribution of the original data. For some dataset, it is hard to represent with feature vectors but proximity information. Multidimensional Scaling (MDS) Find a mapping in the target dimension w.r.t. the proximity (dissimilarity) information. Non-linear optimization problem. Require O(N 2 ) memory and computation. 3

M OTIVATION (2) How to deal with large high-dimensional scientific data? Parallelization MPI based parallelization to utilize distributed memory system, e.g. cluster systems, in order to deal with large amount of data. Hybrid (MPI-Threading) parallelism for multicore cluster systems. Interpolation ( Out-of-Sample approach) Reduce required memory and computation based on pre-mapping result of a subset of data (called in- sampled data). 4

O UTLINE Motivation Background and Related Work Research Issues Experimental Analysis Contributions References 5

B ACKGROUND AND R ELATED W ORK Multidimensional Scaling (MDS) [1] o Given the proximity information among points. o Optimization problem to find mapping in target dimension of the given data based on pairwise proximity information while minimize the objective function. o Objective functions: STRESS (1) or SSTRESS (2) o Only needs pairwise dissimilarities  ij between original points o (not necessary to be Euclidean distance) o d ij ( X ) is Euclidean distance between mapped (3D) points 6 [1] I. Borg and P. J. Groenen. Modern Multidimensional Scaling: Theory and Applications. Springer, New York, NY, U.S.A., 2005.

B ACKGROUND (2) Scaling by MAjorizing a COmplicated Function. ( SMACOF ) Iterative majorizing algorithm to solve MDS problem. EM-like hill-climbing approach. Decrease STRESS value monotonically. Tend to be trapped in local optima. Computational complexity and memory requirement is O(N 2 ). 7

B ACKGROUND (3) Deterministic Annealing ( DA ) Simulated Annealing ( SA ) applies Metropolis algorithm to find a solution by random walk. Gibbs Distribution at T (computational temperature). Minimize Free Energy ( F ) DA tries to avoid local optima w/o random walking. DA finds the expected solution which minimize F by calculating exactly or approximately. As T decreases, more structure of problem space is getting revealed. 8

O UTLINE Motivation Background and Related Work Research Issues DA-SMACOF Interpolation (MI-MDS) Parallelization Non-uniform Weight Experimental Analysis Contributions References 9

DA-SMACOF If we use STRESS objective function as an expected energy function of MDS problem, then Also, we define P 0 and F 0 as following: minimize F MDS ( P 0 ) = | 0 + F 0 ( P 0 ) w.r.t. μ i − | 0 + F 0 ( P 0 ) is independent to μ i part is necessary to be minimized w.r.t. μ i. 10

DA-SMACOF (2) Apply = μ i μ i + TL to Use EM-like SMACOF alg. to calculate expected mapping which minimize at T. New STRESS is following: 11

DA-SMACOF (3) The MDS problem space could be smoother with higher T than with the lower T. T represents the portion of entropy to the free energy F. Generally DA approach starts with very high T, but if T 0 is too high, then all points are mapped at the origin. We need to find appropriate T 0 which makes at least one of the points is not mapped at the origin. 12

MI-MDS Why do we need interpolation? MDS requires O(N 2 ) memory and computation. For SMACOF, six N * N matrices are necessary. N = 100,000  480 GB of main memory required N = 200,000  1.92 TB ( > TB) of memory required Data deluge era PubChem database contains 60 millions chemical compounds Biology sequence data are also produced very fast. How to construct a mapping in a target dimension with millions points by MDS? Interpolation ( Out-of-Sample approach) 13

MI-MDS (2) Majorizing Interpolation MDS (MI-MDS) Based on pre-mapped MDS result of n sample data. Find a new mapping of the new point based on the position of k nearest neighbors ( k-NN ) among n sample data. Iterative majorization method is used. Given N data in high-dimensional space ( D -dimension) Proximity information of the N data ( ∆ = [ δ ij ]) Pre-mapped result of n sample data in target dimensional space ( L -dimension). ( x 1,…, x n in R L ) 14

MI-MDS (3) MI-MDS procedures Select k-NN w.r.t. δ ix, x represents newly added point. p 1, …, p k in P Say S = P U { x }. Finding new mapping position for a new point ( x ) is considered as a minimization problem of STRESS eq. as similar as normal MDS problem with k+1 points. However, ONLY one point x is movable among k+1 pts. We can summarize STRESS eq. as belows: 15

MI-MDS (4) We can deploy 2 nd term of the above STRESS: In order to establish majorizing inequality, apply Cauchy-Schwarz inequality to – d ix of the 3 rd term. Apply those to the STRESS eq. for MI-MDS, then 16

MI-MDS (5) Through setting the derivative of τ ( x, z ) equal to 0, we can obtain minimum of it; that is If we substitute z with x [t-1], then we generate an iterative majorizing eq. like following: 17

P ARALLELIZATION Why do we need to parallelize MDS algorithm? For the large data set, like 100k points, a data mining alg. is no more cpu-bounded but memory-bounded. For instance, SMACOF algorithm requires six N x N matrices, so at least 480 GB of memory is required to run SMACOF with 100k data points. So, we have to utilize distributed memory to overcome memory shortage. Main issue of parallelization is Load Balance and efficiency. How to decompose a matrix to blocks? m by n block decomposition, where m * n = p. 18

P ARALLELIZATION (2) Parallelize followings: Computing STRESS, updating B(X) and matrix multiplication. 19

P ARALLELIZATION (3) Hybrid Parallel Model After multicore chip invented, most of cluster systems are now multicore cluster systems in that each compute node contains multi computing units. MPI only model is still used a lot and shows good performance some cases but not all cases with multicore cluster systems. Hybrid (combine MPI and Threading) model used other data mining algorithms, and outperforms in some cases. I will apply hybrid parallel model to the parallel SMACOF and investigate which parallel model is better for it. 20

P ARALLELIZATION (4) Parallel MI-MDS Though MI-MDS ( O(Mn) ) is much faster than SMACOF algorithm ( O(N 2 ) ), it still needs to be parallelize since it deals with millions of points. MI-MDS is pleasingly parallel, since interpolated points ( out-of-sample points ) are totally independent each other. 21

N ON - UNIFORM W EIGHT STRESS eq. could be considered as the sum of weighted squared error. Due to simplicity and easy computation, uniform weight is assumed in many case. However, mapping can be different w.r.t. weight function. Uniform weight will do more focus on larger distance entries than small distances, but reciprocal weight method will do the opposite way. I will investigate various weight functions e.g.) 1/δ 2, 1/√δ, as well as uniform and 1/δ. 22

O UTLINE Motivation Background and Related Work Research Issues Experimental Analysis SMACOF vs. DA-SMACOF Parallel Performance of MPI-SMACOF Interpolation Performance Contributions References 23

E XPERIMENTAL A NALYSIS Data Biological Sequence data ALU sequences, Meta genomics sequences It is hard to find a feature vector of sequences, but it is possible to measure pairwise dissimilarity of two sequences. Chemical Compound data About 60 millions of compounds are in PubChem database. 166 binary finger print of each compound is used as a feature vector of it. Euclidean distance of the finger prints is used as dissimilarity information of them. 24

SMACOF VS. DA-SMACOF 25

SMACOF VS. DA-SMACOF (2) 26

P ARALLEL P ERFORMANCE Experimental Environments 27

P ARALLEL P ERFORMANCE (2) Performance comparison w.r.t. how to decompose 28

P ARALLEL P ERFORMANCE (3) Scalability Analysis 29

MI-MDS P ERFORMANCE N = 100k points 30

MI-MDS P ERFORMANCE (2) 31 STRESS values 100k: M+100k: M+100k:

MI-MDS P ERFORMANCE (3) 32

O UTLINE Motivation Background and Related Work Research Issues Experimental Analysis Contributions References 33

C ONTRIBUTIONS Main Goal : construct low dimensional mapping of the given large high-dimensional data as good as possible and as many as possible. Apply DA approach to MDS problem to prevent trapping local optima. The proposed DA-SMACOF outperforms SMACOF in quality and shows consistent result. Parallelize both SMACOF and DA-SMACOF via MPI only model and hybrid parallel model. it is possible to run faster and to deal with larger data. Propose interpolation algorithm based on iterative majorizing method, called MI-MDS. To deal with even more points, like millions of data, which is not eligible to run normal MDS algorithm in cluster systems. Also, I will investigate and analyze how weight values affect to the MDS mapping. 34

R EFERENCES Seung-Hee Bae. Parallel multidimensional scaling performance on multicore systems. In Proceedings of the Advances in High-Performance E-Science Middleware and Applications workshop (AHEMA) of Fourth IEEE International Conference on eScience, pages 695–702, Indianapolis, Indiana, Dec IEEE Computer Society. G. Fox, S. Bae, J. Ekanayake, X. Qiu, and H. Yuan. Parallel data mining from multicore to cloudy grids. In Proceedings of HPC 2008 High Performance Computing and Grids workshop, Cetraro, Italy, July Jong Youl Choi, Seung-Hee Bae, Xiaohong Qiu and Geoffrey Fox. High Performance Dimension Reduction and Visualization for Large High- dimensional Data Analysis. to appear in the Proceedings of the The 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010), Melbourne, Australia, May Seung-Hee Bae, Jong Youl Choi, Judy Qiu, Geoffrey Fox. Dimension Reduction Visualization of Large High-dimensional Data via Interpolation. to appear in the Proceedings of The ACM International Symposium on High Performance Distributed Computing (HPDC), Chicago, IL, June

T HANKS ! Q UESTIONS ? 36