Visual exploratory data analysis:

Slides:

Advertisements

Similar presentations

M. Belkin and P. Niyogi, Neural Computation, pp. 1373–1396, 2003.

Advertisements

Aggregating local image descriptors into compact codes

DIMENSIONALITY REDUCTION: FEATURE EXTRACTION & FEATURE SELECTION Principle Component Analysis.

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.

Non-linear Dimensionality Reduction CMPUT 466/551 Nilanjan Ray Prepared on materials from the book Non-linear dimensionality reduction By Lee and Verleysen,

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Graph Based Semi- Supervised Learning Fei Wang Department of Statistical Science Cornell University.

Principal Component Analysis

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Region Segmentation. Find sets of pixels, such that All pixels in region i satisfy some constraint of similarity.

Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan.

Kyle Heath, Natasha Gelfand, Maks Ovsjanikov, Mridul Aanjaneya, Leo Guibas Image Webs Computing and Exploiting Connectivity in Image Collections.

A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.

Y. Weiss (Hebrew U.) A. Torralba (MIT) Rob Fergus (NYU)

Oral Defense by Sunny Tang 15 Aug 2003

Clustering Vertices of 3D Animated Meshes

3D Thinning on Cell Complexes for Computing Curve and Surface Skeletons Lu Liu Advisor: Tao Ju Master Thesis Defense Dec 18 th, 2008.

Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.

Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University

Hubert CARDOTJY- RAMELRashid-Jalal QURESHI Université François Rabelais de Tours, Laboratoire d'Informatique 64, Avenue Jean Portalis, TOURS – France.

Autonomous Learning of Object Models on Mobile Robots Xiang Li Ph.D. student supervised by Dr. Mohan Sridharan Stochastic Estimation and Autonomous Robotics.

Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,

Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.

Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.

IEEE Int'l Symposium on Signal Processing and its Applications 1 An Unsupervised Learning Approach to Content-Based Image Retrieval Yixin Chen & James.

Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.

Last Words DM 1. Mining Data Steams / Incremental Data Mining / Mining sensor data (e.g. modify a decision tree assuming that new examples arrive continuously,

Manifold learning: MDS and Isomap

CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.

An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)

Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering.

Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.

Community detection via random walk Draft slides.

FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.

Data Visualization Fall The Data as a Quantity Quantities can be classified in two categories: Intrinsically continuous (scientific visualization,

CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.

NN k Networks for browsing and clustering image collections Daniel Heesch Communications and Signal Processing Group Electrical and Electronic Engineering.

CSC321: Extra Lecture (not on the exam) Non-linear dimensionality reduction Geoffrey Hinton.

Nonlinear Dimensionality Reduction

Visualizing High-Dimensional Data

Intrinsic Data Geometry from a Training Set

Big Data A Quick Review on Analytical Tools

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Visual exploratory data analysis:

We propose a method which can be used to reduce high dimensional data sets into simplicial complexes with far fewer points which can capture topological.

School of Computer Science & Engineering

Unsupervised Riemannian Clustering of Probability Density Functions

CS 2750: Machine Learning Dimensionality Reduction

Dipartimento di Ingegneria «Enzo Ferrari»,

Classification with Perceptrons Reading:

Non-manifold Multiresolution Modeling (some preliminary results)

Structure learning with deep autoencoders

Machine Learning Dimensionality Reduction

Jianping Fan Dept of CS UNC-Charlotte

Boosting Nearest-Neighbor Classifier for Character Recognition

CMPT 733, SPRING 2016 Jiannan Wang

Learning with information of features

Fast Nearest Neighbor Search on Road Networks

Adaptive Interpolation of Multidimensional Scaling

Introduction of MATRIX CAPSULES WITH EM ROUTING

Department of Computer Science University of York

Image Classification Painting and handwriting identification

Scale-Space Representation for Matching of 3D Models

Lecture 15: Least Square Regression Metric Embeddings

Using Manifold Structure for Partially Labeled Classification

CMPT 733, SPRING 2017 Jiannan Wang

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Presentation transcript:

Visual exploratory data analysis: Big Data Analysis and Data Mining, Paris, 7-8 September, 2017 Visual exploratory data analysis: data embedding (DE) & graph visualization (GV) Witold Dzwinel

Visual data mining (VDM) [Felizardo et al. 2012] Hypotheses verification ML algorithms adaptation and tuning Matching the best data representation

The problem How to preserve in 2-D the main topological features of these data representations? the neigborhood (fine grained) the cluster structure (coarse grained) Visualization of two data representations in 2-D (3-D) Euclidean space: high dimensional data (HD) ↔ M, N-D feature vectors Y (data embedding DE) complex networks G(V,E,W) (graph visualization GV) N, M, #V, #E → are huge

HD Embedding: Y → X dissimilarity matrix representation of data

Bottlenecks Storage and Computational complexities Manifold problem Curse of dimensionality (O(M2) & O(MlogM) e.g. based on stochastic neighbor embedding: bh-SNE, q-SNE, w-SNE, LargeVis etc. and forceatlas based GV algorithms )

Computational complexity Existing VE and GV methods based on distances are strongly overdetermined. in 2-D at least: ~2 ◦ M distances can define the stable solution for rigid graphs Which distances?????

Manifold problem

HDD ↔ graph representation k-nearest neighbor graphs ↔ DE k-NN graph is not rigid!! Other distances are necessary for k-NN graph visualization

Computational complexity We propose a drastic simplification of distances matrix i (i data vector or graph vertex), find the small sets of for DE: NN(i) of the k-nearest and RN(i) r - random neighbors for GV: all-connected NN(i) and r-disconnected RN(i) vertices We assume that k+r ~ N (dimensionality of Y) It gives O(M) linear-time & memory complexity of both DE and GV algorithms

Curse of dimensionality 1. Increase the contrast between the nearest (connected) and the random neighbors (vertices) 2. Use force-directed method for minimization of the stress function

Examples: MNIST MNIST N=784 M= 70000 C= 10 T=11 min T=30 min T<1 min MNIST N=784 M= 70000 C= 10 Well balanced set of gray-scale handwritten digit images (http://yann.lecun.com/exdb/mnist/).

Examples: NORB (small) M=43600, N=2048 The NORB dataset (NYU Object Recognition Benchmark) contains stereo image pairs of 50 uniform-colored toys under 18 azimuths, 9 elevations, and 6 lighting conditions

DBN - autoencoder 30 min [Snoek et al., 2012]

Autoencoder, Snoek et al.2012 NORB: 1m Van der Maaten, 2014

Examples: Reuters t-SNE (M~58000, N=2000) 5h Reuters 2000 266931 8 Strongly imbalanced text corpus known as RCV1. We used a subset of this repository consisting of 8 clusters (http://about.reuters.com/researchandstandards/corpus/).

Examples: Reuters 5 min

Complex networks visualization Historic articles from Wikipedia and links between them.

Fine structure of historic graph

Big graphs (social networks)

State_of_the_art 2167.88 sec. 250 sec. http://yifanhu.net/index.html AT&T Labs -- Research

Internet topology

Internet topology

Patent database

Patents database

Conclusions 1. Low memory complexity O(nM) 2. Low computational complexity O((n+r)M) 3. High level of parallelization (PM) 4. Easy implementation on Big data platforms (Hadoop, Apache Spark) 5. Near neighbors (NeN) instead of NN! 6. Big graphs visualization

We have ... 1. Desktop versions with GUI for interactive visualization of large HD data (IVTA) and GV (IVGA). 2. Ultrafast methods for k-NN neighbor search implemented in CUDA. 3. GV parallel (CUDA, MPI) software employing B- matrices and algebraic graph representations. 4. Feature extraction software (CUDA) based on DBNs.

Future work 1. Developing VE and GV systems for distributed data visualization involving big data architectures (Hadoop, Spark …). 2. Employing algebraic descriptors for data analytics, and new data manipulation techniques 3. Using our DBN software for data preprocessing, i.e., feature extraction for big distributed data repositories

Acknowledgments. This research is supported by the Polish National Center of Science (NCN) project #DEC-2013/09/B/ST6/01549.