Download presentation
Presentation is loading. Please wait.
Published byCharles Parker Modified over 9 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems Advisor : Dr. Hsu Graduate : Kuo-min Wang Authors : X. Luo, A. Nur Zincir-Heywood 2003 IEEE.
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Architecture Overview Performance Evaluation Conclusions Personal Opinion
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Document categorization systems can solution two problems Information overload Describe the constant influx of new information, which causes user to be overwhelmed by the subject and system knowledge required to access this information Vocabulary differences automatic selection and weighing of keywords in text documents may well bias the nature of the clusters found at later stage. Motivation
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective This paper describes the development and evaluation of two unsupervised learning mechanisms for solving the automatic document categorization problem. Vector space model Code-books model word 1 word 2 ……. word n
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction A common approach among existing systems is to cluster based upon their word distributions, while word clustering is determined by document co-occurrence. Vector Space Model (VSM) The frequency of occurrence of each word in each document is recorded Generally weighted using the Term Frequency (TF) multiplied by the Inverse Document Frequency (IDF)
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 First clustering system Built is based on the VSM and makes use of topological ordering property of SOMs. Second clustering system Make use of the SOM based architecture as an encoder for data representation By finding a smaller set of prototypes from a large input space – without using the typical information retrieval pre-processing Consider the relationships between characters, then words and finally word co-occurrences Introduction (cont.)
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 First step is the identification of an encoding of the original information such that pertinent features may be decoded most efficiently. SOM acts as an encoder to represent a large input space by finding a smaller set of prototypes. Document Clustering With Self- Organizing Maps (cont.) Input vector x Encoder c(x) Σ Noise v Decoder x’(x) Reconstruction vector x’ BMU Weight vector, w j
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 There are two main parts to the vector space model Parsing 1)Converts documents into a succession of words 2)Using a basic Stop-List of common English words 3)A stemming algorithm is then applied to the remaining words ex: story 和 stories, 或者 First 和 first Indexing 1)Each document is represented in the Vector Space Model 2)The frequency of occurrence of each word in each document is recorded 3)using TF multiplied by IDF to generate weighted value Architecture Overview Data collect Data preprocess Data reduction Pattern Discovery
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Architecture-1 : Emphasizing Density Matching Property Data Reduction Randomly selected to reduced data dimensions x’ = Rx x − the original data vector, where x R N R − A matrix consisting of random values where the Euclidean length of each column has been normalized to unity x’ − Reduced-dimensional or quantized vector, where x’ R d Data pre-processing Original Document Tags and Non-textual data are removed Stop words are removed Words are stemmed Words occurring less than 5 times are removed TF/IDF is used to weight each word
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Architecture-1 : Emphasizing Density Matching Property (cont.) With crowed neurons may solve the problem by a divide and conquer method Data Reduction Data Preprocess
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Architecture-2 : Emphasizing Encoding-Decoding Property The core of the approach is to automate the identification of typical category characteristics. A document is summarized by its words and their frequencies (TF) in descending order An SOM be used to identify a suitable character encoding, then word encoding, and word co- occurrence encoding Data pre-processing Original Document Tags and Non-textual data are removed Stop words are removed Words are stemmed Words frequencies are formed Words occurring less than 5 times are removed
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Architecture-2 : Emphasizing Encoding-Decoding Property (cont.) Input for the First-level SOMs Employs a three level hierarchical SOM architecture, characters, words, and word co-occurrences. Characters be represented by their ASCII code The relationships between characters are represented by a character’s position, or time index, in a word Example news: n 1, e 2, w 3, s 4 ASCII: n->14, e 5, w 23, s 19 Pre-processing process Convert the word’s characters to numerical Find the time indices of the characters Linearly normalize the indices, so that the first character is one, second is two.
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Architecture-2 : Emphasizing Encoding-Decoding Property (cont.) Input for the second-level SOMs For each word, k, that is input to the first-level SOM of each document, Form a vector of size equal to the number of neurons (r) in the first-level SOM For each character of k Observe which neurons n 1, n 2,…n r are affected the most (the first 3 BMUs) Increment entries in the vector corresponding to the first 3 BMUs by 1/j, 1 <= j <= 3
14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Architecture-2 : Emphasizing Encoding-Decoding Property (cont.) Input for the Third-level SOMs The third-level input vectors are built using BMUs resulting from word vectors passed through the second- level SOMs.
15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Architecture-2 : Emphasizing Encoding-Decoding Property (cont.) Training the SOMs Initialization: Choose random values for the initial weight vectors w j (0), j=1, 2,…,l where l is the number of neurons in the map Sampling: Draw a sample x from the input space with a uniform probability Similarity matching: Find the best matching neuron i(x) using the Euclidean criterion Updating: Adjust the weight vectors of all neurons by using the update formula
16
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Performance Evaluation The performance measurement used is based on A - set of correct class labels(answer key) B - baseline clusters where each document is one cluster C - set of clusters (‘winning’ clustering results) dist(C, A) - The number of operations required to transform C into A dist(B, A) - The number of operations required to transform B into A A
17
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Conclusion First architecture emphasizes a layered approach to lower the computational cost of the training of the map and employs a random mapping to decrease the dimension of the input space. The second architecture is based on a new idea where the SOM acts as an encoder to represent a large input space by finding a smaller set of prototypes Future work Develop a classifier, which will work in conjunction to these clustering systems Apply the technique to a wider cross-section of benchmark data set
18
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Personal Opinions Advantages Using random mapping method to reduced dimensions. Drawback Architecture description is not clear. Limit …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.