Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering medical and biomedical texts – document map based approach

Similar presentations


Presentation on theme: "Clustering medical and biomedical texts – document map based approach"— Presentation transcript:

1 Clustering medical and biomedical texts – document map based approach
Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, PAS, Warsaw University of Podlasie, Siedlce Białystok University of Technology

2 Agenda Motivation Architecture User interface Visualization
Hierarchical maps Clustering Experimental results Future directions

3 Motivation The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore A good way of presenting massive document sets in an understandable form will be crucial in the near future The BEATCA project targets at creation a full-fledged search engine for moderate size document collections (millions of documents) capable of representing on-line replies to queries in user-friendly graphical form on a document map (based on WebSOM approach)

4 BEATCA architecture The preparation of documents is done by an indexer, which turns the HTML etc. representation of a document into a vector-space model representation Indexer also identifies frequent phrases in document set for clustering and labelling purposes Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation ‘The best’ (wrt some similarity measure) map is used by the query processor in response to the user’s query

5 BEATCA architecture

6 User interface Search results are presented on a document map
Compact (fuzzy) topical areas are extracted Query-related summaries are generated on-line Maps can have one of the following topologies: the traditional flat map (quadratic or hexagonal cells) rotating 3D map (torus, sphere, cylinder) hyperbolic map (Poincarre or Klein projections) growing map (Growing Neural Gas)

7 User interface

8 The traditional flat map - hexagonal

9 The traditional flat map - quadratic

10 Map visualizations in 3D

11 Hierarchical maps Bottom-up approach
Feasible (with joint winner search method) Start with most detailed map Compute weighted centroids of map areas Use them as seeds for coarser map Top-down approach is possible but requires fixpoints

12 Hierarchical maps Level 3 Level 2 Level 1

13 Clustering document groups
Numerous methods exists but none of them directly applicable: Extremely fuzzy structure of topical groups in SOM cells Neccesity of taking into account similiarity measures both in original document space and in the map space Outlier-handling problem during cluster formation No a priori estimation of the number of topical groups Fuzzy C-MEANS on lattice of map cells applied Graph theoretical approach (density- and distance- based MST) combined with fuzzy clustering Clustered documents are labeled by weighted centroids of cell reference vectors scaled with between-group entropy

14 Experimental results Query: defect of sight On she question it is in
including group following etiquettes answer: glaucoma contrast eyes cataracts ratina detachment

15 Future research Maps for joint term-citation model, taking into account between-group link flow direction Fully distributed map creation Adaptive document retrieval and clustering: Bayesian network based relevance measure Survival models for document update rate estimation Dead link propagation methods for page freshness estimation We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects

16 Future research Bayesian networks will be applied in particular to:
measure relevance and classify documents accelerate document clustering processes construct a thesaurus supporting query enrichment keyword extraction between-topic dependencies estimation

17 Thank you! Any questions?


Download ppt "Clustering medical and biomedical texts – document map based approach"

Similar presentations


Ads by Google