Download presentation
Presentation is loading. Please wait.
Published byBelinda Bailey Modified over 9 years ago
1
IIIT Hyderabad Interactive Visualization and Tuning of Multi-Dimensional Clusters for Indexing Dasari Pavan Kumar (MS by Research Thesis) Centre for Visual Information Technology
2
IIIT Hyderabad Overview Provide a framework to generate better clusters for high dimensional data points Provide a fast cluster analysis/generation tool
3
IIIT Hyderabad Data, Data, Data ! Digital data creation at an unprecedented rate Data is collected to extract/search “valuable” information – A difficult task however! Data generation in previous decade consisted mostly of textual information – Inverted Index, suffix trees, N-grams, etc
4
IIIT Hyderabad More data ! Flickr, Youtube, etc changed the game – Non-textual information (images) – Huge amounts of data! New methods! (Content based Image Retrieval) – Underlying processes remain similar Why image search? – Copyright Infringement, Offensive, Education, etc
5
IIIT Hyderabad Multi-dimensional Multi-variate data Stock markets Weather/climate Business Huge datasets – multiple dimensions. Finding “insights” can’t be fully automated.
6
IIIT Hyderabad Data Visualization Human intelligence/cognition is unmatchable by computers Cluster analysis – descriptive modeling Information Visualizations to support analysis – Identify important features/patterns
7
IIIT Hyderabad What if you have millions of high- dimensional data points? XMDV tool (M. Ward) – Scatter-plot matrix – Parallel Coordinate Plot Cluster tree (Stuetzle) Cone trees (Robertson et. al) Past Attempts!
8
IIIT Hyderabad Indexing images/videos Extract feature vectors from images Apply clustering to compute bag of words Generate feature histogram and perform some ML methods
9
IIIT Hyderabad Indexing images/videos Extract feature vectors from images Apply clustering to compute bag of words Generate feature histogram and perform some ML methods
10
IIIT Hyderabad Using SIFT features The fundamental problem – sheer volume of data No. of dimensions – 128 No. of data points – in millions Other low-level image features exist – GLOH, steerable filter, spin images
11
IIIT Hyderabad Clusters + visualization The problem – choosing the right bag of words (clusters) Better visual words lead to better classification
12
IIIT Hyderabad Cluster analysis Provide a framework for user to – Identify better subspaces – Efficiently/quickly compute clusters – Compare clustering schemas
13
Extracted low-level image descriptors Manageable size (high dimensional) Statistical sampling Priority/Weight assignment to features Clustering (Visual Words) Visualization system Automatic weight recommendation 1 Automatic weight recommendation N User defined weight re-assignment Verification Cluster entire set Good Bad Output Schema Framework
14
IIIT Hyderabad Tool
15
IIIT Hyderabad Extracted low-level image descriptors Manageable size (high dimensional) Statistical sampling Priority/Weight assignment to features Clustering (Visual Words) Visualization system Automatic weight recommendation 1 Automatic weight recommendation N User defined weight re-assignment Verification Cluster entire set Good Bad Output Schema Framework
16
IIIT Hyderabad Why prioritize dimensions? Dimensionality reduction !! – Feature transformation – Feature selection
17
IIIT Hyderabad Why not feature transformation? Dimensions can be redundant/irrelevant – Hence PCA cant be trivially applied Clusters could be lost in cloud of dimensions (curse of dimensionality) Difficult to interpret the combination
18
IIIT Hyderabad Feature selection Wrapper model – “wrap” selection process around the mining algorithm – Go hand in hand giving little control Filter model – Examine intrinsic properties
19
IIIT Hyderabad “Interesting” dimensions Without any rank – Analyze density distribution based on grids – Difficult to compare since its highly dependent on density parameter Rank dimensions – Based on distribution of data Uniformity (Entropy) No. of outliers No. of unique values d>(Q 3 +1.5*IQR) || d<(Q 1 -1.5*IQR)
20
IIIT Hyderabad Ranked dimensions Assign weights based on the amount of “interestingness” – 1D Histogram of distribution – 2D correlations - PCP How do we assign weights? Manual – Automatic suggestions !
21
IIIT Hyderabad Glyph view Standard SIFT glyph Bar chart – Length – rank – Color - weight Colormap
22
Extracted low-level image descriptors Manageable size (high dimensional) Statistical sampling Priority/Weight assignment to features Clustering (Visual Words) Visualization system Automatic weight recommendation 1 Automatic weight recommendation N User defined weight re-assignment Verification Cluster entire set Good Bad Output Schema Framework
23
IIIT Hyderabad Data clustering Sample data set – 1.3 million points with 128 dimensions Cluster such data on a commodity pc – Almost impossible
24
IIIT Hyderabad Data clustering Plug-in for any cluster technique – Currently using k-means (GPU) Currently 200 iterations for 1.3 million SIFT vectors – 12 sec for each iteration for 1000 clusters
25
Extracted low-level image descriptors Manageable size (high dimensional) Statistical sampling Priority/Weight assignment to features Clustering (Visual Words) Visualization system Automatic weight recommendation 1 Automatic weight recommendation N User defined weight re-assignment Verification Cluster entire set Good Bad Output Schema Framework
26
IIIT Hyderabad Cluster Viz. Visualizing clusters over 128 dimensions – Not feasible Re-project into 2D space – Necessity for some sort of layout Plug-in any graph drawing – Current – 2D force based
27
IIIT Hyderabad Graph representation Compute cluster tree of nearest neighbor density – Similar nodes must be close – Can be estimated using MST Generate minimum spanning tree (MST) of cluster centers – Single linkage dendogram – Prim’s method
28
IIIT Hyderabad Graph drawing Use a GPU implementation of force based graph layout – Takes 0.2 sec for 1000 nodes Drill-down “visual word” to actually see the “sift” interest points to understand the similarity MST without layout MST with layout
29
IIIT Hyderabad Similar looking regions clustered into the same id
30
IIIT Hyderabad Cluster validation Two clustering schemas – Visually not feasible to compare Three basic strategies – Internal – compare schema C with proximity matrix – External – build an independent partition according to our intuition Comparison with schema C or proximity matrix. – Relative – choose the one that best fits !! Computationally not feasible
31
IIIT Hyderabad Relative validity Some indices – RS value – Davies-Bouldin index – SD index Around 1 minute for each schema C on CPU GPU implementation takes 1 second
32
IIIT Hyderabad Validity indices Indices plotted over a line graph – Obtain min/max of the graph – optimal clusters N c Iteration Index
33
Extracted low-level image descriptors Manageable size (high dimensional) Statistical sampling Priority/Weight assignment to features Clustering (Visual Words) Visualization system Automatic weight recommendation 1 Automatic weight recommendation N User defined weight re-assignment Verification Cluster entire set Good Bad Output Schema Framework
34
IIIT Hyderabad Automatic weight recommendation Only a suggestive process Final decision left to user
35
IIIT Hyderabad Results on UIUC image collection A total of 4485 images 15 categories Mean classification accuracy of 57.6% for SIFT with DoG
36
IIIT Hyderabad Interesting observation 135◦, 215◦, 270◦ – Lower weights assigned by automatic schemas Same with corner cells Ds = {4, 12, 22, 43, 44, 54, 55, 71, 78, 79, 83, 84, 110, 116} 1D histograms corresponding to dimensions (a)84, (b) 110, (c) 124
37
IIIT Hyderabad Results on UIUC image collection More clusters does not mean better classification Fei-Fei et al. report a mean accuracy of 52.5% VW = Number of visual words, EW = K-means using uniform weights, IW = K- means with weights adjusted interactively, IW-Ds = K-means with Ds dimensions given a weight zero and weights of other dimensions adjusted interactively.
38
IIIT Hyderabad Results on UIUC image collection More clusters does not necessarily mean better classification Fei-Fei et al. report a mean accuracy of 52.5%
39
IIIT Hyderabad Summary Provide a framework for better cluster generation Provide fast cluster analysis/generation tool for a commodity pc enabled with GPU Able to analyze distributions across dimensions – Identified redundant dimensions Able to achieve higher classification ratios with relative ease
40
IIIT Hyderabad Publications Interactive Visualization and Tuning of SIFT Indexing, Dasari Pavan Kumar and P.J.Narayanan, Vision, Modelling and Visualization, 2010, Siegen, Germany
41
IIIT Hyderabad Limitations Limited by GPU and CPU memory User needs to get familiarized with the tool Visual decoding of data is sometimes difficult Cluster generation still depends on parameters like K (no. of clusters).
42
IIIT Hyderabad Future Work Provide a brush for PCP view Incorporate support for subspace clustering Conduct experiments based on wrapper clustering methods
43
IIIT Hyderabad Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.