Download presentation
Presentation is loading. Please wait.
Published byKelly Marsh Modified over 9 years ago
1
Mining document maps Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish Academy of Sciences Warsaw
2
SAWM 2004Mining Document Maps Agenda Motivation Our approach Architecture User interface Visualization Map creation Clustering Experimental results Future directions
3
SAWM 2004Mining Document Maps Motivation The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore A good way of presenting massive document sets in an understandable form will be crucial in the near future The BEATCA project targets at creation a full-fledged search engine for moderate size document collections (millions of documents) capable of representing on- line replies to queries in user-friendly graphical form on a document map (based on WebSOM approach)
4
SAWM 2004Mining Document Maps Our approach XXXX The presentation method is based on the WebSOM's map idea and is enriched with novel methods of document analysis, clustering and visualization. A special architecture has been elaborated to enable experiments with various brands of map creation, visualization, clustering and labelling algorithms B ayesian E volutionary A pproach to T ext C onnectivity A nalysis
5
SAWM 2004Mining Document Maps BEATCA architecture XXXXX The preparation of documents is done by an indexer, which turns the HTML etc. representation of a document into a vector-space model representation Indexer also identifies frequent phrases in document set for clustering and labelling purposes Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation ‘The best’ (wrt some similarity measure) map is used by the query processor in response to the user’s query
6
SAWM 2004Mining Document Maps BEATCA architecture
7
SAWM 2004Mining Document Maps Example: summaries of documents KONIEC
8
SAWM 2004Mining Document Maps Example: S&W frequent phrases KONIEC sheep fiends dairy goats black sheep special thanks sheep and goats university medical center public health medical informatics information departments pharmacy related drugs information health care
9
SAWM 2004Mining Document Maps User interface XXXX Search results are presented on a document map Compact (fuzzy) topical areas are extracted Query-related summaries are generated on-line Maps can have one of the following topologies: the traditional flat map (quadratic or hexagonal cells) rotating 3D map (torus, sphere, cylinder) hyperbolic map (Poincarre or Klein projections) growing map (Growing Neural Gas)
10
SAWM 2004Mining Document Maps User interface
11
SAWM 2004Mining Document Maps Map visualizations in 3D
12
SAWM 2004Mining Document Maps Hyperbolic map visualizations triangular tesselation hexagonal tesselation
13
SAWM 2004Mining Document Maps Kohonen learning overview XXXX Unsupervised learning neural network model Neuron represented by reference vector in document space Vector element (term dimension) equals TFxIDF Iterative regression of reference vectors onto document vector space: #WZÓR# Similiarity is computed as cosine of angle between corresponding vectors
14
SAWM 2004Mining Document Maps How are the maps created A modified WebSOM method is used: compact reference vectors representation broad-topic initialization method joint winner search method multi-level (hierarchical) maps three-phase document clustering: initial grouping via PLSA/PHITS WEBSOM on document groups fuzzy cell clusters extraction and labelling
15
SAWM 2004Mining Document Maps Reference vector representation Vectors are sparse by nature During learning process they become even sparser Represented as a balanced red-black trees Tolerance threshold imposed Terms (dimensions) below threshold are removed Significant complexity reduction without negative quality impact
16
SAWM 2004Mining Document Maps Topic-sensitive initialization Inter-topic similarities important both for map learning and visualization/cluster extraction Simple approach: Use LSI to select K main broad topics Select K map cells (evenly spread over the map) as the fixpoints for individual topics Initialize selected fixpoints with broad topics Initialize remaining cells with the following rule: #WZÓR#
17
SAWM 2004Mining Document Maps Joint winner search Global winner search: accurate but slow Local winner search: faster but can be inaccurate during rapid changes Start with single phase of global search Document movements become more smooth during learning process: usually local search is enough Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease)
18
SAWM 2004Mining Document Maps Hierarchical maps Bottom-up approach Feasible (with joint winner search method) Start with most detailed map Compute weighted centroids of map areas: #WZÓR# Use them as seeds for coarser map Top-down approach is possible but requires fixpoints
19
SAWM 2004Mining Document Maps Clustering document groups Numerous methods exists but none of them directly applicable: Extremely fuzzy structure of topical groups in SOM cells Neccesity of taking into account similiarity measures both in original document space and in the map space Outlier-handling problem during cluster formation No a priori estimation of the number of topical groups Fuzzy C-MEANS on lattice of map cells applied Graph theoretical approach (density- and distance- based MST) combined with fuzzy clustering Clustered documents are labeled by weighted centroids of cell reference vectors scaled with between-group entropy
20
SAWM 2004Mining Document Maps Example: biomedical documents
21
SAWM 2004Mining Document Maps Term Rank Cluster #1 sci.math Cluster #2 sci.med / sci.math Cluster #3 talk.religion misc Cluster #4 soc.culture. israel Cluster #5 comp. windows.x Cluster #6 talk. politics.misc 1 DieCipherMenIsraelBootFunding 2 ProbableBlockWomenPalestinianWindowsStudy 3 TheoryStreamRapedGunFilesTaxes 4 RegistersKeyChildrenAzizMenusStock 5 MathematicsAlgorithmsChildIraqisLibHealth 6 EquationCombinationsSexKoppelIconMarket 7 CosDistinctionSocIsraeliLabelSocial 8 SequenceEncryptionFatherJewsFolderMercer 9 TexEpimethiusPaternityResolutionMsvcrtdGoverning 10 SpaceRandomnessFeministOliverShortcutVaccinations 11 GravitationalSmartcardTrollingUtahNetzeroMeasurement 12 WaveEntropyWhiteFirearmsTabBushes 13 LatexYahooEnglandSettlementsKernelComputer 14 FilesModelSupportPalestineInstalledCompanies 15 UnsignedLotteryBlackPermittedBackupDiabetes Label candidates (5 newsgroups) XXX
22
SAWM 2004Mining Document Maps Experiments with execution time XXX The impact of the following factors on the speed of map creation was investigated: Map size (total number of cells) Optimization methods: dictionary optimization reference vector representation Map quality assessment: Compare with ‘ideal’ map (e.g. without optimizations) Identical initialization and learning parameters Compute sum of squared distances of location of each document on both maps
23
SAWM 2004Mining Document Maps Execution time - map size
24
SAWM 2004Mining Document Maps Execution time - optimizations
25
SAWM 2004Mining Document Maps Experiments with map convergence XXX We examined the convergence of the maps to a stable state depending on: type of alpha function (search radius reduction) type of winner search method type of initialization method
26
SAWM 2004Mining Document Maps Convergence – alpha functions
27
SAWM 2004Mining Document Maps Convergence – winner search
28
SAWM 2004Mining Document Maps Future research Maps for joint term-citation model, taking into account between-group link flow direction Fully distributed map creation Adaptive document retrieval and clustering: Bayesian network based relevance measure Survival models for document update rate estimation Dead link propagation methods for page freshness estimation We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects
29
SAWM 2004Mining Document Maps Future research XXXXXX Bayesian networks will be applied in particular to: measure relevance and classify documents accelerate document clustering processes construct a thesaurus supporting query enrichment keyword extraction between-topic dependencies estimation Immuno-genetic systems will be used for: adaptive document clustering by referring to the mechanism of so-called metadynamics extraction of compact characteristics of document groups by exploitation of the mechanism of construction of universal and specialized antibodies visualization and resolution adjustment of document maps
30
SAWM 2004Mining Document Maps Thank you! Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.