Presentation is loading. Please wait.

Presentation is loading. Please wait.

Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE Bruno Pinheiro Renato Correa

Similar presentations


Presentation on theme: "Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE Bruno Pinheiro Renato Correa"— Presentation transcript:

1 Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE Bruno Pinheiro bfp@cin.ufpe.br Renato Correa renato.correa@ufpe.br

2 Guide Information Retrieval Systems (IRS) IRS + SOM Related Works Document Collection System Architecture Methodology Results

3 Information Retrieval Systems (IRS) Indexing, Searching, classifying textual documents. User’s information needs Matching user’s queries and system’s vocabulary.

4 IRS + SOM Self- Organized Maps Information Retrieval System

5 IRS + SOM Navigation Interface build trough document maps Document’s maps – Self-Organizing Map trained with document vectors

6 Related Works First Works ( 1991 - 1995) – Lin / Merkl Great projects(1996 -2000) – Arizona Digital Library, WEBSOM, SOMLib Diversification (2001 - 2005) – LiGHtSOM, GHSOM, H2SOM Convergence (2006)

7 Document Collection UFPE Digital Library of Theses and Dissertations(BDTD-UFPE) – Offers in full all the theses and dissertations produced on the graduate programs of the university. – Approximately 6000 documents. – Linked to Brazilian BDTD and to NDLTD (Networked Digital Library of Theses and Dissertations)

8 Document Representation Dimensionality Reduction Volume Reduction Construction of Document Map Construction of User Interface Document Vectors Reduced Vectors Prototype Vectors Document Map Document Indexing Inverted Index Document Acquisition Documents’ content System Architecture

9 Methodology Document Acquisition – Harvesting process through the OAI-PMH protocol – XMLs containing document’s metadata – Data extraction through the java library JColtrane

10 Methodology Indexing – Java library, Lucene. – Stemming operations, digits and stopwords elimination. – Inverted index built through vectorial space model.

11 Methodology Document representation – Documents are represented by vectors, where terms are the indexes and the corresponding values are functions of term’s frequency of occurrence in the document.

12 Methodology Dimensionality reduction – Feature selection based on words’ frequency – Stopwords elimination – Final dimensionality: 13095 terms Volume reduction – Not used. – Volume : 4781 documents

13 Methodology Document’s map construction – Single stage – somtoolbox functions for MATLAB – Document’s vectors normalized before training – SOM map with rectangular structure (10 x 12) and hexagonal neighborhood

14 Methodology Document’s map construction – Weights initialized linearly along the two greatest eigenvectors – Batch-type SOM algorithm with dot product metric – Gaussian neighborhood function – Neighborhood size linearly decreasing with the number of epochs

15 Methodology Document’s map construction – Parameters Number of epochs – Rough phase : 10 epochs – Fine-tuning phase : 10 epoch Neighborhood size – Rough phase » Initial: [(biggest dimension units number )/2 ]+ 1 » Final: 2 – Fine-tuning phase: » Initial: 2 » Final: 0.8

16 Methodology User’s interface construction – Documents are mapped to the node with the closest model vector in terms of cosine distance – Each map node is labeled according to the category Knowledge areas (CHLA, CBS, TCEN) Graduate programs

17 Results CategoriesAccuracyF1 microF1 macro Topographic error 30.96 0.01 610.66 0.440.01

18 Results Knowledge AreasGraduate Programs

19 Acknowledgement

20 Bruno Pinheiro bfp@cin.ufpe.br Renato Correa renato.correa@ufpe.br


Download ppt "Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE Bruno Pinheiro Renato Correa"

Similar presentations


Ads by Google