Adaptive Interpolation of Multidimensional Scaling

Slides:

Advertisements

Similar presentations

Indiana University School of David Wild – CICC Quarterly Meeting, Jan Page 1 Projects 1-4 update David Wild CICC Quarterly Meeting January 27.

Advertisements

Scalable High Performance Dimension Reduction

DIMENSIONALITY REDUCTION: FEATURE EXTRACTION & FEATURE SELECTION Principle Component Analysis.

Analyzing large-scale cheminformatics and chemogenomics datasets through dimension reduction David J. Wild Assistant Professor & Director, Cheminformatics.

Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.

Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.

High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis Jong Youl Choi, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox.

Two Technique Papers on High Dimensionality Allan Rempel December 5, 2005.

Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets April 27, 2011.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Localization from Mere Connectivity Yi Shang (University of Missouri - Columbia); Wheeler Ruml (Palo Alto Research Center ); Ying Zhang; Markus Fromherz.

A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.

Dimensionality Reduction

Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.

SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Generative Topographic Mapping in Life Science Jong Youl Choi School of Informatics and Computing Pervasive Technology Institute Indiana University

General Tensor Discriminant Analysis and Gabor Features for Gait Recognition by D. Tao, X. Li, and J. Maybank, TPAMI 2007 Presented by Iulian Pruteanu.

Presenter: Yang Ruan Indiana University Bloomington

Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.

Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.

S CALABLE H IGH P ERFORMANCE D IMENSION R EDUCTION Seung-Hee Bae.

Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.

Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.

Service Aggregated Linked Sequential Activities: GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data.

ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

Deterministic Annealing Dimension Reduction and Biology Indiana University Environmental Genomics April Geoffrey.

An Efficient Linear Time Triple Patterning Solver Haitong Tian Hongbo Zhang Zigang Xiao Martin D.F. Wong ASP-DAC’15.

Parallel Applications And Tools For Cloud Computing Environments SC 10 New Orleans, USA Nov 17, 2010.

SALSA HPC Group School of Informatics and Computing Indiana University.

Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &

SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING

Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,

SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : STEPHEN T. O’ROURKE, RAFAEL A. CALVO and Danielle S. McNamara 2011, EST Visualizing.

FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.

ViSOM － A Novel Method for Multivariate Data Projection and Structure Visualization Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Hujun Yin.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Yang Ruan PhD Candidate Salsahpc Group Community Grid Lab Indiana University.

A Sampling-based Estimator for Top-k Selection Query Chung-Min ChenYibei Ling ICDE 2002 Presented by Kan Kin Fai.

Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL

Localization for Anisotropic Sensor Networks

Digital Science Center II

Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox

Visual exploratory data analysis:

Service Aggregated Linked Sequential Activities

Visual exploratory data analysis:

Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions Introduction.

Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.

Computer-Generated Force Acceleration using GPUs: Next Steps

Geoffrey Fox, Huapeng Yuan, Seung-Hee Bae Xiaohong Qiu

Applying Twister to Scientific Applications

Biology MDS and Clustering Results

DACIDR for Gene Analysis

Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE

Overview Identify similarities present in biological sequences and present them in a comprehensible manner to the biologists Objective Capturing Similarity.

A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:

Lecture 22 Clustering (3).

CSc4730/6730 Scientific Visualization

Segmentation of Images By Color

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.

Towards High Performance Data Analytics with Java

INTRODUCTION TO Machine Learning

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

MDS and Visualization September Geoffrey Fox

Presentation transcript:

Adaptive Interpolation of Multidimensional Scaling Seung-Hee Bae, Judy Qiu, and Geoffrey C. Fox School of Informatics and Computing Pervasive Technology Institute Indiana University

Outline Data Visualization Multidimensional Scaling (MDS) Interpolation of MDS Adaptive Interpolation of MDS Experimental Analysis Conclusion

Data Visualization Visualize high-dimensional data as points in 2D or 3D by dimension reduction. Distances in target dimension approximate to the distances in the original HD space. Interactively browse data Easy to recognize clusters or groups An example of Biological Sequence data MDS Visualization of 73885 biological sequence data colored by clustering results. The number of cluster centers is 26.

Multidimensional Scaling Given the proximity information [Δ] among points. Optimization problem to find mapping in target dimension. Objective functions: STRESS (1) or SSTRESS (2) Only needs pairwise dissimilarities ij between original points (not necessary to be Euclidean distance) dij(X) is Euclidean distance between mapped (3D) points Various MDS algorithms have been proposed: Classical MDS, SMACOF, force-based algorithms, …

Interpolation of MDS Why do we need interpolation? MDS requires O(N2) memory and computation. For SMACOF, six N * N matrices are necessary. N = 100,000  480 GB of main memory required N = 200,000  1.92 TB ( > 1.536 TB) of memory required Data deluge era PubChem database contains millions chemical compounds Biology sequence data are also produced very fast. How to construct a mapping in a target dimension with millions of points by MDS?

Interpolation Approach Two-step procedure A dimension reduction alg. constructs a mapping of n sample data (among total N data) in target dimension. Remaining (N-n) out-of-samples are mapped in target dimension w.r.t. the constructed mapping of the n sample data w/o moving sample mappings. Prior Mapping n In-sample N-n Out-of-sample Total N data Training Interpolation Interpolated map

Interpolation of MDS Merits Cost Reduce time complexity O(N2)  O(n(N – n)) Reduce memory requirement Pleasingly parallel application Cost Quality degradation of the mapping due to the approximation. How to reduce the quality gap between full MDS and Interpolation of MDS?

Adaptive Interpolation of MDS Distance ratio : avg. of distances : avg. of dissimilarities 1/r 1/r > 1.0 : 96% 1.0 < 1/r < 5.0: 75%

Adaptive Interpolation of MDS Adaptive Interpolation of MDS (AI-MDS) Interpolate points based on prior mappings of the sample data in terms of the adaptive dissimilarities between interpolated points and k-NNs. Adaptive dissimilarity:

AI-MDS Algorithm

Experimental Environment

AI-MDS Performance N = 100k points

AI-MDS Performance

PubChem data visualization by using AI-MDS and MI-MDS (2M+100k). MDS Interpolation Map PubChem data visualization by using AI-MDS and MI-MDS (2M+100k).

Conclusion MDS is computation and memory intensive algorithm. MI-MDS was proposed for reducing time complexity with minor quality loss. This paper proposes an adaptive interpolation of MDS (AI-MDS) to reduce the quality loss by adapting the dissimilarity based on distance ratio. AI-MDS configures millions of points with more than 40% improvement. The proposed AI-MDS generates better mappings of the tested data during faster running time than MI-MDS.

Acknowledgement NIH Grant No. 5 RC2 HG 005806- 02 Microsoft for supporting experimental environment. Prof. Wild and Dr. Zhu at Indiana University for providing pubchem data.

Email me at sebae@cs.indiana.edu Thanks! Questions? Email me at sebae@cs.indiana.edu

Data Visualization Visualize high-dimensional data as points in 2D or 3D by dimension reduction. Distances in target dimension approximate to the distances in the original HD space. Interactively browse data Easy to recognize clusters or groups An example of Solvent data MDS Visualization of 215 solvent data (colored) with 100k PubChem dataset (gray) to navigate chemical space.