Download presentation
Published byJordan Lyons Modified over 9 years ago
1
Community Architectures for Network Information Systems
Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map Daniel X. Pape Community Architectures for Network Information Systems CSNA’98 6/18/98
2
Overview Self-Organizing Map (SOM) Algorithm
U-Matrix Algorithm for SOM Visualization SOM Navigation Application Document Representation and Collection Examples Problems and Optimizations Future Work
3
Basic SOM Algorithm Input Number (n) of Feature Vectors (x) format:
vector name: a, b, c, d examples: 1: 0.1, 0.2, 0.3, 0.4 2: 0.2, 0.3, 0.3, 0.2
4
Basic SOM Algorithm Output Neural network Map of (M) Nodes
Each node has an associated Weight Vector (m) of the same dimensionality as the input feature vectors Examples: m1: 0.1, 0.2, 0.3, 0.4 m2: 0.2, 0.3, 0.3, 0.2
5
Basic SOM Algorithm Output (cont.) Nodes laid out in a grid:
6
Basic SOM Algorithm Other Parameters Number of timesteps (T)
Learning Rate (eta)
7
Basic SOM Algorithm SOM() { foreach timestep t {
foreach feature vector fv { wnode = find_winning_node(fv) update_local_neighborhood(wnode) } find_winning_node() { foreach node n { compute distance of m to feature vector } return node with the smallest distance update_local_neighborhood(wnode) { foreach node n { m = m + eta [x - m] }
8
U-Matrix Visualization
Provides a simple way to visualize cluster boundaries on the map Simple algorithm: for each node in the map, compute the average of the distances between its weight vector and those of its immediate neighbors Average distance is a measure of a node’s similarity between it and its neighbors
9
U-Matrix Visualization
Interpretation one can encode the U-Matrix measurements as greyscale values in an image, or as altitudes on a terrain landscape that represents the document space: the valleys, or dark areas are the clusters of data, and the mountains, or light areas are the boundaries between the clusters
10
U-Matrix Visualization
Example: dataset of random three dimensional points, arranged in four obvious clusters
11
U-Matrix Visualization
Four (color-coded) clusters of three-dimensional points
12
U-Matrix Visualization
Oblique projection of a terrain derived from the U-Matrix
13
U-Matrix Visualization
Terrain for a real document collection
14
Current Labeling Procedure
Feature vectors are encoded as 0’s and 1’s Weight vectors have real values from 0 to 1 Sort weight vector dimensions by element value dimension with greatest value is “best” noun phrase for that node Aggregate nodes with the same “best” noun phrase into groups
15
Umatrix Navigation 3D Space-Flight Hierarchical Navigation
16
Document Data Noun phrases extracted
Set of unique noun phrases computed each noun phrase becomes a dimension of the data set Each document represented by a binary vector with a 1 or a 0 denoting the existence or absence of each noun phrase
17
Document Data Example: 10 total noun phrases:
alexander, king, macedonians, darius, philip, horse, soldiers, battle, army, death each element of the feature vector will be a 1 or a 0: 1: 1, 1, 0, 0, 1, 1, 0, 0, 0, 0 2: 0, 1, 0, 1, 0, 0, 1, 1, 1, 1
18
Document Collection Examples
19
Problems As document sets get larger, the feature vectors get longer, use more memory, etc. Execution time grows to unrealistic lengths
20
Solutions? Need algorithm refinements for sparse feature vectors
Need a faster way to do the find_winning_node() computation Need a better way to do the update_local_neighborhood() computation
21
Sparse Vector Optimization
Intelligent support for sparse feature vectors saves on memory usage greatly improves speed of the weight vector update computation
22
Faster find_winning_node()
SOM weight vectors become partially ordered very quickly
23
Faster find_winning_node()
U-Matrix Visualization of an Initial, Unordered SOM
24
Faster find_winning_node()
Partially Ordered SOM after 5 timesteps
25
Faster find_winning_node()
Don’t do a global search for the winner Start search from last known winner position Pro: usually finds a new winner very quickly Con: this new search for a winner can sometimes get stuck in a local minima
26
Better Neighborhood Update
Nodes get told to “update” quite often Weight vector is made public only during a find_winner() search With local find_winning_node() search, a lazy neighborhood weight vector update can be performed
27
Better Neighborhood Update
Cache update requests each node will store the winning node and feature vector for each update request The node performs the update computations called for by the stored update requests only when asked for its weight vector Possible reduction of number of requests by averaging the feature vectors in the cache
28
New Execution Times
29
Future Work Parallelization Label Problem
30
Label Problem Current Procedure not very good Cluster boundaries
Term selection
31
Cluster Boundaries Image processing Geometric
32
Cluster Boundaries Image processing example:
33
Term Selection Too many unique noun phrases “Knee” of frequency curve
Too many dimensions in the feature vector data “Knee” of frequency curve
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.