Presentation is loading. Please wait.

Presentation is loading. Please wait.

Community Architectures for Network Information Systems

Similar presentations


Presentation on theme: "Community Architectures for Network Information Systems"— Presentation transcript:

1 Community Architectures for Network Information Systems
Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map Daniel X. Pape Community Architectures for Network Information Systems CSNA’98 6/18/98

2 Overview Self-Organizing Map (SOM) Algorithm
U-Matrix Algorithm for SOM Visualization SOM Navigation Application Document Representation and Collection Examples Problems and Optimizations Future Work

3 Basic SOM Algorithm Input Number (n) of Feature Vectors (x) format:
vector name: a, b, c, d examples: 1: 0.1, 0.2, 0.3, 0.4 2: 0.2, 0.3, 0.3, 0.2

4 Basic SOM Algorithm Output Neural network Map of (M) Nodes
Each node has an associated Weight Vector (m) of the same dimensionality as the input feature vectors Examples: m1: 0.1, 0.2, 0.3, 0.4 m2: 0.2, 0.3, 0.3, 0.2

5 Basic SOM Algorithm Output (cont.) Nodes laid out in a grid:

6 Basic SOM Algorithm Other Parameters Number of timesteps (T)
Learning Rate (eta)

7 Basic SOM Algorithm SOM() { foreach timestep t {
foreach feature vector fv { wnode = find_winning_node(fv) update_local_neighborhood(wnode) } find_winning_node() { foreach node n { compute distance of m to feature vector } return node with the smallest distance update_local_neighborhood(wnode) { foreach node n { m = m + eta [x - m] }

8 U-Matrix Visualization
Provides a simple way to visualize cluster boundaries on the map Simple algorithm: for each node in the map, compute the average of the distances between its weight vector and those of its immediate neighbors Average distance is a measure of a node’s similarity between it and its neighbors

9 U-Matrix Visualization
Interpretation one can encode the U-Matrix measurements as greyscale values in an image, or as altitudes on a terrain landscape that represents the document space: the valleys, or dark areas are the clusters of data, and the mountains, or light areas are the boundaries between the clusters

10 U-Matrix Visualization
Example: dataset of random three dimensional points, arranged in four obvious clusters

11 U-Matrix Visualization
Four (color-coded) clusters of three-dimensional points

12 U-Matrix Visualization
Oblique projection of a terrain derived from the U-Matrix

13 U-Matrix Visualization
Terrain for a real document collection

14 Current Labeling Procedure
Feature vectors are encoded as 0’s and 1’s Weight vectors have real values from 0 to 1 Sort weight vector dimensions by element value dimension with greatest value is “best” noun phrase for that node Aggregate nodes with the same “best” noun phrase into groups

15 Umatrix Navigation 3D Space-Flight Hierarchical Navigation

16 Document Data Noun phrases extracted
Set of unique noun phrases computed each noun phrase becomes a dimension of the data set Each document represented by a binary vector with a 1 or a 0 denoting the existence or absence of each noun phrase

17 Document Data Example: 10 total noun phrases:
alexander, king, macedonians, darius, philip, horse, soldiers, battle, army, death each element of the feature vector will be a 1 or a 0: 1: 1, 1, 0, 0, 1, 1, 0, 0, 0, 0 2: 0, 1, 0, 1, 0, 0, 1, 1, 1, 1

18 Document Collection Examples

19 Problems As document sets get larger, the feature vectors get longer, use more memory, etc. Execution time grows to unrealistic lengths

20 Solutions? Need algorithm refinements for sparse feature vectors
Need a faster way to do the find_winning_node() computation Need a better way to do the update_local_neighborhood() computation

21 Sparse Vector Optimization
Intelligent support for sparse feature vectors saves on memory usage greatly improves speed of the weight vector update computation

22 Faster find_winning_node()
SOM weight vectors become partially ordered very quickly

23 Faster find_winning_node()
U-Matrix Visualization of an Initial, Unordered SOM

24 Faster find_winning_node()
Partially Ordered SOM after 5 timesteps

25 Faster find_winning_node()
Don’t do a global search for the winner Start search from last known winner position Pro: usually finds a new winner very quickly Con: this new search for a winner can sometimes get stuck in a local minima

26 Better Neighborhood Update
Nodes get told to “update” quite often Weight vector is made public only during a find_winner() search With local find_winning_node() search, a lazy neighborhood weight vector update can be performed

27 Better Neighborhood Update
Cache update requests each node will store the winning node and feature vector for each update request The node performs the update computations called for by the stored update requests only when asked for its weight vector Possible reduction of number of requests by averaging the feature vectors in the cache

28 New Execution Times

29 Future Work Parallelization Label Problem

30 Label Problem Current Procedure not very good Cluster boundaries
Term selection

31 Cluster Boundaries Image processing Geometric

32 Cluster Boundaries Image processing example:

33 Term Selection Too many unique noun phrases “Knee” of frequency curve
Too many dimensions in the feature vector data “Knee” of frequency curve


Download ppt "Community Architectures for Network Information Systems"

Similar presentations


Ads by Google