Mining document maps Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish.

Mining document maps Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish Academy of Sciences Warsaw

SAWM 2004Mining Document Maps Agenda Motivation Our approach Architecture User interface Visualization Map creation Clustering Experimental results Future directions

SAWM 2004Mining Document Maps Motivation The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore A good way of presenting massive document sets in an understandable form will be crucial in the near future The BEATCA project targets at creation a full-fledged search engine for moderate size document collections (millions of documents) capable of representing on- line replies to queries in user-friendly graphical form on a document map (based on WebSOM approach)

SAWM 2004Mining Document Maps Our approach XXXX The presentation method is based on the WebSOM's map idea and is enriched with novel methods of document analysis, clustering and visualization. A special architecture has been elaborated to enable experiments with various brands of map creation, visualization, clustering and labelling algorithms B ayesian E volutionary A pproach to T ext C onnectivity A nalysis

SAWM 2004Mining Document Maps BEATCA architecture XXXXX The preparation of documents is done by an indexer, which turns the HTML etc. representation of a document into a vector-space model representation Indexer also identifies frequent phrases in document set for clustering and labelling purposes Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation ‘The best’ (wrt some similarity measure) map is used by the query processor in response to the user’s query

SAWM 2004Mining Document Maps BEATCA architecture

SAWM 2004Mining Document Maps Example: summaries of documents KONIEC

SAWM 2004Mining Document Maps Example: S&W frequent phrases KONIEC sheep fiends dairy goats black sheep special thanks sheep and goats university medical center public health medical informatics information departments pharmacy related drugs information health care

SAWM 2004Mining Document Maps User interface XXXX Search results are presented on a document map Compact (fuzzy) topical areas are extracted Query-related summaries are generated on-line Maps can have one of the following topologies: the traditional flat map (quadratic or hexagonal cells) rotating 3D map (torus, sphere, cylinder) hyperbolic map (Poincarre or Klein projections) growing map (Growing Neural Gas)

SAWM 2004Mining Document Maps User interface

SAWM 2004Mining Document Maps Map visualizations in 3D

SAWM 2004Mining Document Maps Hyperbolic map visualizations triangular tesselation hexagonal tesselation

SAWM 2004Mining Document Maps Kohonen learning overview XXXX Unsupervised learning neural network model Neuron represented by reference vector in document space Vector element (term dimension) equals TFxIDF Iterative regression of reference vectors onto document vector space: #WZÓR# Similiarity is computed as cosine of angle between corresponding vectors

SAWM 2004Mining Document Maps How are the maps created A modified WebSOM method is used: compact reference vectors representation broad-topic initialization method joint winner search method multi-level (hierarchical) maps three-phase document clustering: initial grouping via PLSA/PHITS WEBSOM on document groups fuzzy cell clusters extraction and labelling

SAWM 2004Mining Document Maps Reference vector representation Vectors are sparse by nature During learning process they become even sparser Represented as a balanced red-black trees Tolerance threshold imposed Terms (dimensions) below threshold are removed Significant complexity reduction without negative quality impact

SAWM 2004Mining Document Maps Topic-sensitive initialization Inter-topic similarities important both for map learning and visualization/cluster extraction Simple approach: Use LSI to select K main broad topics Select K map cells (evenly spread over the map) as the fixpoints for individual topics Initialize selected fixpoints with broad topics Initialize remaining cells with the following rule: #WZÓR#

SAWM 2004Mining Document Maps Joint winner search Global winner search: accurate but slow Local winner search: faster but can be inaccurate during rapid changes Start with single phase of global search Document movements become more smooth during learning process: usually local search is enough Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease)

SAWM 2004Mining Document Maps Hierarchical maps Bottom-up approach Feasible (with joint winner search method) Start with most detailed map Compute weighted centroids of map areas: #WZÓR# Use them as seeds for coarser map Top-down approach is possible but requires fixpoints

SAWM 2004Mining Document Maps Clustering document groups Numerous methods exists but none of them directly applicable: Extremely fuzzy structure of topical groups in SOM cells Neccesity of taking into account similiarity measures both in original document space and in the map space Outlier-handling problem during cluster formation No a priori estimation of the number of topical groups Fuzzy C-MEANS on lattice of map cells applied Graph theoretical approach (density- and distance- based MST) combined with fuzzy clustering Clustered documents are labeled by weighted centroids of cell reference vectors scaled with between-group entropy

SAWM 2004Mining Document Maps Example: biomedical documents

SAWM 2004Mining Document Maps Term Rank Cluster #1 sci.math Cluster #2 sci.med / sci.math Cluster #3 talk.religion misc Cluster #4 soc.culture. israel Cluster #5 comp. windows.x Cluster #6 talk. politics.misc 1 DieCipherMenIsraelBootFunding 2 ProbableBlockWomenPalestinianWindowsStudy 3 TheoryStreamRapedGunFilesTaxes 4 RegistersKeyChildrenAzizMenusStock 5 MathematicsAlgorithmsChildIraqisLibHealth 6 EquationCombinationsSexKoppelIconMarket 7 CosDistinctionSocIsraeliLabelSocial 8 SequenceEncryptionFatherJewsFolderMercer 9 TexEpimethiusPaternityResolutionMsvcrtdGoverning 10 SpaceRandomnessFeministOliverShortcutVaccinations 11 GravitationalSmartcardTrollingUtahNetzeroMeasurement 12 WaveEntropyWhiteFirearmsTabBushes 13 LatexYahooEnglandSettlementsKernelComputer 14 FilesModelSupportPalestineInstalledCompanies 15 UnsignedLotteryBlackPermittedBackupDiabetes Label candidates (5 newsgroups) XXX

SAWM 2004Mining Document Maps Experiments with execution time XXX The impact of the following factors on the speed of map creation was investigated: Map size (total number of cells) Optimization methods: dictionary optimization reference vector representation Map quality assessment: Compare with ‘ideal’ map (e.g. without optimizations) Identical initialization and learning parameters Compute sum of squared distances of location of each document on both maps

SAWM 2004Mining Document Maps Execution time - map size

SAWM 2004Mining Document Maps Execution time - optimizations

SAWM 2004Mining Document Maps Experiments with map convergence XXX We examined the convergence of the maps to a stable state depending on: type of alpha function (search radius reduction) type of winner search method type of initialization method

SAWM 2004Mining Document Maps Convergence – alpha functions

SAWM 2004Mining Document Maps Convergence – winner search

SAWM 2004Mining Document Maps Future research Maps for joint term-citation model, taking into account between-group link flow direction Fully distributed map creation Adaptive document retrieval and clustering: Bayesian network based relevance measure Survival models for document update rate estimation Dead link propagation methods for page freshness estimation We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects

SAWM 2004Mining Document Maps Future research XXXXXX Bayesian networks will be applied in particular to: measure relevance and classify documents accelerate document clustering processes construct a thesaurus supporting query enrichment keyword extraction between-topic dependencies estimation Immuno-genetic systems will be used for: adaptive document clustering by referring to the mechanism of so-called metadynamics extraction of compact characteristics of document groups by exploitation of the mechanism of construction of universal and specialized antibodies visualization and resolution adjustment of document maps

SAWM 2004Mining Document Maps Thank you! Any questions?

Mining document maps Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish.

Similar presentations

Presentation on theme: "Mining document maps Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining document maps Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish.

Similar presentations

Presentation on theme: "Mining document maps Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish."— Presentation transcript:

Similar presentations

About project

Feedback