Download presentation
Presentation is loading. Please wait.
Published byJuliana Pierce Modified over 9 years ago
1
Mapping document collections in non-standard geometries Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish Academy of Sciences Warsaw
2
SAWM 2004Mining Document Maps Agenda Motivation Our approach Architecture User interface Visualization Map creation Clustering Experimental results Future directions
3
SAWM 2004Mining Document Maps Motivation The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore A good way of presenting massive document sets in an understandable form will be crucial in the near future The BEATCA project targets at creation a full-fledged search engine for moderate size document collections (millions of documents) capable of representing on- line replies to queries in user-friendly graphical form on a document map (based on WebSOM approach)
4
SAWM 2004Mining Document Maps Our approach The presentation method is based on the WebSOM's map idea and is enriched with novel methods of document analysis, clustering and visualization. A special architecture has been elaborated to enable experiments with various brands of map creation, visualization, clustering and labelling algorithms B ayesian E volutionary A pproach to T ext C onnectivity A nalysis
5
SAWM 2004Mining Document Maps BEATCA architecture The preparation of documents is done by an indexer, which turns the HTML etc. representation of a document into a vector-space model representation Indexer also identifies frequent phrases in document set for clustering and labelling purposes Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation ‘The best’ (wrt some similarity measure) map is used by the query processor in response to the user’s query
6
SAWM 2004Mining Document Maps BEATCA architecture
7
SAWM 2004Mining Document Maps User interface Search results are presented on a document map Compact (fuzzy) topical areas are extracted Query-related summaries are generated on-line Maps can have one of the following topologies: the traditional flat map (quadratic or hexagonal cells) rotating 3D map (torus, sphere, cylinder) hyperbolic map (Poincarre or Klein projections) growing map (Growing Neural Gas)
8
SAWM 2004Mining Document Maps User interface
9
SAWM 2004Mining Document Maps Map visualizations in 3D
10
SAWM 2004Mining Document Maps Kohonen learning overview Unsupervised learning neural network model Neuron represented by reference vector in document space Vector element (term dimension) equals TFxIDF Iterative regression of reference vectors onto document vector space Similiarity is computed as cosine of angle between corresponding vectors
11
SAWM 2004Mining Document Maps How are the maps created A modified WebSOM method is used: compact reference vectors representation broad-topic initialization method joint winner search method multi-level (hierarchical) maps three-phase document clustering: initial grouping via PLSA/PHITS WEBSOM on document groups fuzzy cell clusters extraction and labelling
12
SAWM 2004Mining Document Maps Reference vector representation Vectors are sparse by nature During learning process they become even sparser Represented as a balanced red-black trees Tolerance threshold imposed Terms (dimensions) below threshold are removed Significant complexity reduction without negative quality impact
13
SAWM 2004Mining Document Maps Topic-sensitive initialization Inter-topic similarities important both for map learning and visualization/cluster extraction Simple approach: Use LSI to select K main broad topics Select K map cells (evenly spread over the map) as the fixpoints for individual topics Initialize selected fixpoints with broad topics Initialize remaining cells with „in-between values”
14
SAWM 2004Mining Document Maps Joint winner search Global winner search: accurate but slow Local winner search: faster but can be inaccurate during rapid changes Start with single phase of global search Document movements become more smooth during learning process: usually local search is enough Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease)
15
SAWM 2004Mining Document Maps Hierarchical maps Bottom-up approach Feasible (with joint winner search method) Start with most detailed map Compute weighted centroids of map areas: #WZÓR# Use them as seeds for coarser map Top-down approach is possible but requires fixpoints
16
SAWM 2004Mining Document Maps Clustering document groups Numerous methods exists but none of them directly applicable: Extremely fuzzy structure of topical groups in SOM cells Neccesity of taking into account similiarity measures both in original document space and in the map space Outlier-handling problem during cluster formation No a priori estimation of the number of topical groups Fuzzy C-MEANS on lattice of map cells applied Graph theoretical approach (density- and distance- based MST) combined with fuzzy clustering Clustered documents are labeled by weighted centroids of cell reference vectors scaled with between-group entropy
17
SAWM 2004Mining Document Maps Experiments with map convergence We examined the convergence of the maps to a stable state depending on: type of alpha function (search radius reduction) type of winner search method type of initialization method
18
SAWM 2004Mining Document Maps Convergence – alpha functions
19
SAWM 2004Mining Document Maps Convergence – winner search
20
SAWM 2004Mining Document Maps Experiments with execution time The impact of the following factors on the speed of map creation was investigated: Map size (total number of cells) Optimization methods: dictionary optimization reference vector representation Map quality assessment: Compare with ‘ideal’ map (e.g. without optimizations) Identical initialization and learning parameters Compute sum of squared distances of location of each document on both maps
21
SAWM 2004Mining Document Maps Execution time - map size
22
SAWM 2004Mining Document Maps Execution time - optimizations
23
SAWM 2004Mining Document Maps Future research Maps for joint term-citation model, taking into account between-group link flow direction Fully distributed map creation Adaptive document retrieval and clustering: Bayesian network based relevance measure Survival models for document update rate estimation Dead link propagation methods for page freshness estimation We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects
24
SAWM 2004Mining Document Maps Future research Bayesian networks will be applied in particular to: measure relevance and classify documents accelerate document clustering processes construct a thesaurus supporting query enrichment keyword extraction between-topic dependencies estimation
25
SAWM 2004Mining Document Maps Thank you! Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.