Download presentation
Presentation is loading. Please wait.
Published byRachel Williamson Modified over 9 years ago
1
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
2
Outline Motivation Objective Introduction Self-Organizing Map Statistical Models of Documents Rapid Construction of Large Document Maps The Document Map of All Electronic Patent Abstracts Conclusion Personal opinion
3
Motivation To improve the WEBSOM and to organize vast document collections according to textual similarities.
4
Objective The main goal has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data.
5
Introduction From Simple Searches to Browsing of Self-Organized Data Collections. Scope of This Work. WEBSOM Dimensionality Latent semantic indexing, LSI. Clustering of words into semantic categories. By a random projection method.
6
Self Organizing Map The original SOM algorithm.
7
Self Organizing Map Batch-map SOM : to accelerate the computation of the SOM.
8
Self Organizing Map Let Vi be the set of all x(t) that have as their closest model.Called the Voronoi set. The number of samples x(t) falling into Vi is called.
9
Statistical Models of Documents The histograms formed over word clusters using self-organizing semantic maps. This system was called the WEBSOM. The overview of the WEBSOM2 system.
11
Statistical Models of Documents A. The Primitive Vector-Space Model Inverse document frequency(IDF). Shannon entropy. B. Latent Semantic Indexing(LSI) Sigular-value decomposition(SVD).
12
Statistical Models of Documents C. Randomly Projected Histograms Original document vector Rectangular random matrix R Projections D. Histograms on the Word Category Map The original version of the WEBSOM. The new method is random projection of the word histograms.
13
Statistical Models of Documents E. Validation of the Random Projection Method by Small-Scale Preliminary Experiments 13742 patents from the whole corpus of 6840568 abstracts. Equal number of patents from each of the 21 subsections. 1814 words or word forms. With full 1344 D histograms as document vectors.
14
Statistical Models of Documents F. Construction of Random Projections of Word Histograms by Pointers. Thresholding(+1 or -1). Sparse matrices(1 and 0).
15
Statistical Models of Documents Hash table and pointer. The computing time was about 20% of that of the usual matrix-product method. Computational complexity of the random projection with pointers is only In contrast, the big O of the LSI is
16
Rapid Construction of Large Document Maps A. Fast Distance Computation To tabulate the indexes of the nonzero components of each input vector. Euclidean distances between sparse vectors. We must use low-dimensional models.
17
Rapid Construction of Large Document Maps B. Estimation of Larger Maps Based on Carefully Constructed Smaller Ones Increasing the number of nodes of the SOM during its construction. The new idea is to estimate good initial values for the model vectors of a very large map on the basis of asymptotic values of the model vectors of a much smaller map.
18
Rapid Construction of Large Document Maps dense sparse
19
Rapid Construction of Large Document Maps C. Rapid Fine-Tuning of the Large Maps 1) Addressing Old Winners: This idea is same with LAB! 2) Initialization of the Pointers: The size of the maps is increased stepwise during learning.~using formula (10). The winner is the map unit for which the inner product with the data vector is the largest.
20
Rapid Construction of Large Document Maps
21
3) Parallelized Batch Map Algorithm: The winner search can be implemented in parallel process. 4) Saving Memory by Reducing Representation Accuracy: The sufficient accuracy can be maintained during the computation.
22
Rapid Construction of Large Document Maps D. Performance Evaluation of the New Methods 1) Numerical Comparison with the Traditional SOM Algorithm: Two performance indexes to measure the quality of the maps:Average quantization error and Classification accuracy Experiments:Two sets of maps
23
Rapid Construction of Large Document Maps 2) Comparison of the Computational Complexity:, stems from the computation of the small map., results from the VQ step(6) of the batch map algorithm., refers to the estimation of the pointers. N:Data Samples; M:Map Units; d:dimensionality.
24
The Document Map of All Electronic Patent Abstracts A. Preprocessing We first extracted the titles and the texts for further processing. We removed nontextual information. Mathematical symbols and numbers were converted into special dummy symbols. Contained 733179 different words. A set of common words were removed. The remaining vocabulary consisted of 43222 words. Finally, we omitted the 122524 abstracts in which less than five words remained.
25
The Document Map of All Electronic Patent Abstracts B. Formation of Statistical Models The final dimensionality we selected 500 and five random pointers were used for each word. The words were weighted using the Shannon entropy of their distribution of occurrence among the subsections of the patent classification system.
26
The Document Map of All Electronic Patent Abstracts The weight is a measure of the unevenness of the distribution of the word in the subsections. The weights were calculated as follows: be the probability of a randomly chosen instance of the word w occurring in subsection g, and Ng the number of subsections. Shannon entropy Weight
27
The Document Map of All Electronic Patent Abstracts C. Formation of the Document Map 500-dimensional document vectors. The map was increased twice sixteenfold and one ninefold. Each of the enlarged, estimated maps(cf. Section IV-B) was then fine-tuned by five batch map iteration cycles.
28
The Document Map of All Electronic Patent Abstracts D. Results When each map node was labeled according to the majority of the subsections in the node. The resulting accuracy was 64%.
31
Conclusion In this paper the emphasis has been on the up scalability of the methods relating to very large text collections. Contributions: Larger than our previous one. A new method of forming statistical models of documents. Several new fast computing methods.
32
Personal Opinion Put SOM into a domain knowledge,e.g.IR or …?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.