Download presentation
Presentation is loading. Please wait.
Published byHolly Marshall Modified over 9 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor : Dr. Hsu Presenter : Chih-Ling Wang Author : Arnulfo P. Azcarraga, Teddy N. Yap Jr.
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objection WEBSOM Random projection method Extracting meaningful labels WEBSOM text archive for CNN Reuters WEBSOM text archive Conclusion My opinion
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation Self-Organizing Maps, being used mainly with data that are not pre-labeled, need automatic procedures for extracting keywords as labels for each of the map units. The WEBSOM methodology for building very large text archives has a very slow method for extracting such unit labels.
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objection This paper describes how the meaningful labels per map unit can be deduced by analyzing the relative weight distribution of the SOM weight vectors and by taking advantage of some characteristics of the random projection method used in dimensionality reduction.
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 WEBSOM The WEBSOM methodology does include an automatic keyword extraction procedure. The procedure, however, is very slow. It computes the relative frequencies of all the words of all the documents associated to each unit and then compares these to the relative frequencies of all the words of the other units of the map.
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 WEBSOM(cont.) To reduce the computational load the words that occurred only a few times in the whole data base were neglected. The word category map is a “self- organizing semantic map” that describes relations of words based on their averaged short contexts. The documents are encoded by mapping their text, word by word, onto the word category map whereby a histogram of the “hits” on it is formed. The document map is then formed with the SOM algorithm using the histograms as “fingerprints” of the documents.
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Random projection method One of the most critical aspects of WEBSOM-based text archiving is the compression of the initial text dataset into a size that is manageable as far as SOM training, labeling, and archiving are concerned this without losing too much of the original information- content necessary for effective text classification and archiving. Reported in Kohonen[7,8], a random projection method can radically reduce the dimensionality of the document encodings. Given a document vector, where the elements of the vector are normalized term frequencies, and given a random mxn matrix R, one can compute the projection of the original document vector on a much lower dimensional space, i.e., m << n, using
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Random projection method(cont.) Let us denote r as the number of 1 s per column in the mxn random projection matrix, m as the number of dimensions in the compressed input vector, and n as the original number of keywords prior to random projection. Each term is randomly mapped to r dimensions. Each dimension, in turn, is associated with approximately rn/m terms.
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Extracting meaningful labels A term w is a meaningful label for a document cluster C in a trained map if ─ w is prominent in C compared to other words in C; and ─ w is prominent in C compared to the other occurrences of w in the whole map. Terms mapped to high weight values are the potential keywords. Since we used a random projection matrix, each weight component has numerous terms mapped to it. Thus, there is no straightforward way to determine which are the keywords that contribute significantly to a map unit.
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Extracting meaningful labels(cont.) We can deduce the set of truly significant keywords as follows: 1. For every dimension, compute the mean weight and standard deviation among all the map units. Weight values that exceed are significantly high for the given dimension. 2. Every time a certain dimension d is found to be significantly high, it is likely that only one of the rn/m terms mapped to it has truly contributed significantly to the high weight of that unit. 3. Since the random projection method randomly assigns each keyword to r different dimensions, then the truly significant keywords will consistently contribute high weights to the r dimensions. 4. The truly significant weights will be at the top of the sorted lists. 5. If we want the k most important keywords per unit, we take the top k terms in the sorted list.
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Extracting meaningful labels(cont.)
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Extracting meaningful labels(cont.) The keyword extraction technique by Lagus directly computes the relative frequencies of occurrence of all words in all the documents assigned to a given unit in the map. A goodness measure G, defined below, is used to rank the words as to how much they meaningfully represent a given unit: Neutral zone
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 WEBSOM text archive for CNN The dataset has a total of 15,194 documents with 33,795 unique keywords that appear in at least one document in the 7,597 training documents and 43,314 unique keywords appearing in both training and test datasets. Each of the news documents is assigned a class depending on where the news document originated from.
14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 WEBSOM text archive for CNN(cont.)
15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 WEBSOM text archive for CNN(cont.) Automatically extracted keywords are actually based on the relative frequencies of the words used in the documents while manually assigned labels may be based on something else( in this case, the origin of the news).
16
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 WEBSOM text archive for CNN(cont.)
17
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Reuters WEBSOM text archive Reuters has 21,578 news collection. This text collection is significantly different from the CNN collection in that each document can be assigned multiple labels. Also the labels are much more descriptive of the news stories than the news origin labels of the CNN collection. Rather than applying the WEBSOM-based archiving methodology on the entire Reuters 21,578 collection, we have decided to use only a more statistically meaningful subset of the original collection where the classes are more evenly distributed and where each class has a reasonably large number of documents. Only those classes with at least 100 and at most 1,000 documents were retained.
18
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Reuters WEBSOM text archive(cont.)
19
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Reuters WEBSOM text archive(cont.)
20
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Reuters WEBSOM text archive(cont.) We have implemented the G measure of the WEBSOM methodology and extracted top keywords based on this measure. We can claim that our method extracts fairly the same keywords just by weight verification and manipulation of the random projection matrix as what the Lagus method would extract by digging out all the words of all document of each unit.
21
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Conclusion This paper describes how the most important keywords of each unit in WEBSOM text archives can be deduced. A high percentage of the keywords we extract match the top keywords extracted for the same units using the WEBSOM method.
22
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 My opinion Advantage: Using random projection method to reduce the dimensions and raise the operation speed. Disadvantage: … Apply: Random projection method.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.