A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining
Article Information Published in Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval 1995 Authors Wiener E., Pedersen, J.O. Weigend, A.S. 54 citations
Summary Introduction Related Work The Corpus Representation Term Selection Latent Semantic Indexing Generic LSI Local LSI Cluster-Directed LSI Topic-Directed LSI Relevancy Weighting LSI
Summary Neural Network Classifier Neural Networks for Topic Spotting Linear vs. Non Linear Networks Flat Architecture vs. Modular Architecture Experiment Results Evaluating Performance Results & discussions
Introduction Topic Spotting = Text Categorization = Text Classification Problem of identifying which of a set of predefined topics are present in a natural language document. Document Topic 1 Topic 2 Topic n
Introduction Classification Approaches Expert system approach: manually construct a system of inference rules on top of large body of linguistic and domain knowledge could be extremely accurate very time consuming brittle to changes in the data environment Data driven approach: induce a set of rules from a corpus of labeled training documents practically better
Introduction – Related Work The major remarks regarding the related work: Separate classifier was constructed for each topic. Different set of terms was used to train each classifier.
Introduction – The Corpus Reuters corpus of Reuters newswire stories from 1987 21,450 stories 9,610 for training 3,662 for testing mean length: 90.6 words, SD 91.6 92 topics appeared at least once in the training set. The mean is 1.24 topics/doc. (up to 14 topics for some doc.) 11,161 unique terms after preprocessing inflectional stemming, stop word removal, conversion to lower case elimination of words appeared in fewer three documents
Representations starting point: Document Profile: term by document matrix containing word frequency entries
Representation 3/ 33 1/ 33 2/ 33 Thorsten Joachims Text Categorization with Support Vector Machines: Learning with Many Relevant Features.
Representation - Term Selection the subset of the original terms that are most useful for the classification task. Difficult to select terms that discriminate between 92 classes while being small enough to serve as the feature set for a neural network Divide problem into 92 independent classification tasks Search for best discriminator terms between documents with the topic and those without
Representation - Term Selection Relevancy Score measures how unbalanced the term is across documents w/ or w/o the topic Highly +ve and highly -ve scores indicate useful terms for discrimination using about 20 terms yielded the best classification performance No. of doc. w/ topic t & contain term k Total No. of doc. w/ topic t
Representation - Term Selection
Advantage: little computation is required resulting features have direct interpretability Drawback: many of best individual predictors contain redundant information a term which may appear to be a very poor predictor on its own may turn out to have great discriminative power in combination with other terms, and vise verse. Apple vs. Apple Computers Selected Term Representation (TERMS) with 20 features Representation - Term Selection TERMS
Representation – LSI Transform original doc to lower-dimensional space by analyzing correlational structure of terms in the document collection (Training Set): applying a singular-value decomposition (SVD) to the original term by document matrix Get U, , V (test set): Transform document vectors by projecting them into LSI space Property of LSI: higher dimensions capture less of variance of original data drop w/ minimal loss. Found: performance continues to improve up to at least 250 dimensions Improvement rapidly slows dawn after about 100 dimensions Generic LSI Representation (LSI) with 200 features LSI
Representation – LSI SVD Reuters Corpus Wool Barley Wheat Money- supply Zinc Gold Generic LSI Representation w/ 200 features
Representation – Local LSI Global LSI performs worse as frequency decreases infrequent topics are usually indicated by infrequent terms and infrequent terms may be projected out of LSI and considered as mere noise. Proposed two task-directed methods that make use of prior knowledge of the classification task
Representation – Local LSI What is Local LSI? modeling only the local portion of the corpus related to those topics includes documents that use terminology related to the topics (not necessary have any of the topics assigned) Performing SVD over only the local set of documents representation more sensitive to small, localized effects of infrequent terms. representation more effective for classification of topics related to that local structure.
Representation – Local LSI Type of Local LSI: Cluster Directed representation 5 Meta-topics (clusters): Agriculture, Energy, Foreign exchange, Government, and metals How to construct local region? Break corpus into 5 clusters each containing all documents on corresponding meta-topic Perform SVD for each Meta-topic region Clustor-Directed LSI Representation (CD/LSI) with 200 features CD/LSI
Representation – Local LSI SVD Reuters Corpus Wool Barley Wheat Money- supply Zinc Gold
Energy Metal Foreign Exchange Agriculture Government Representation – Local LSI SVD Reuters Corpus Wool Barley Wheat Money- supply ZincGold GOVERNMENTGOVERNMENT AGRICULTUREAGRICULTURE ForeignExchangeForeignExchange METALMETAL ENERGYENERGY Clustor-Directed LSI Representation (CD/LSI) w/ 200 features SVD
Representation – Local LSI Types of Local LSI: Term Directed representation More fine-grained approach to local LSI Separate representation for each topic. How to construct the local region? Use 100 most predictive terms for the topic. Pick N most similar documents. N = 5 * No. of documents containing topic, 350 N 110 Final Documents in topic region = N documents random documents Topic-Directed LSI Representation (TD/LSI) with 200 features
Representation – Local LSI SVD Reuters Corpus Wool Barley Wheat Money- supply Zinc Gold
Representation – Local LSI Reuters Corpus Wool Barley Wheat Money- supply Zinc Gold SVD Term-Directed LSI Representation (TD/LSI) w/ 200 features
Drawback of Local LSI: Narrower the region, the Lower flexibility in representations for modeling the classification of multiple topics High computational overhead Representation – Local LSI
Representation - Relevancy Weighting LSI Use term weight to emphasize the importance of particular terms before applying SVD IDF weighting importance of low frequency terms the importance of high frequency terms Assumes low frequency terms to be better discriminators than high frequency terms
Relevancy Weighting tune the IDF assumption emphasize terms in proportion to their estimated topic discrimination power Global Relevancy Weighting of term k (GRW k ) Final Weighting of term k = IDF 2 * GRW k all low frequency terms pulled up by IDF Poor predictors pushed down leaving only relevant low frequency terms with high weights Relevancy Weighted LSI Representation (REL/LSI) with 200 features Representation - Relevancy Weighting LSI
Neural Network Classifier (NN) NN consists of: processing units (Neurons) weighted links connecting neurons
major components of NN model: architecture: defines the functional form relating input to output network topology unit connectivity activation functions: e.g. Logistic regression fn. Neural Network Classifier (NN)
Logistic regression function z = is a linear combination of the input features p (0,1) - can be converted to binary classification method by thresholding the output probability
major components of NN model (cont): search algorithm: the search in weight space for a set of weights which minimizes the error between the output and the expected output (TRAINING PROCESS) Backpropagation method Mean squared errors Cross-entropy error performance function C = - sum [all cases and outputs] (d*log(y) + (1-d)*log(1-y) ) d: desired output, y: actual output Neural Network Classifier (NN)
NN for Topic Spotting Network outputs are estimates of the probability of topic presence given the feature vector of a document Generic LSI representation each network uses same representation Local LSI representation different representation for each network
Linear NN Output units with logistic activation and no hidden layer NN for Topic Spotting n 2 1
Non Linear NN Simple networks with a single hidden layer of logistic sigmoid units (6 – 15)
NN for Topic Spotting Flat Architecture Separate network for each topic use entire training set to train for each topic Avoiding overfitting problem by adding penalty term to the cross-entropy cost function to encourage elimination of small weights. Early stopping based on cross-validation
NN for Topic Spotting Modular Architecture decompose learning problem into smaller problems Meta-Topic Network trained on full training set estimate the presence probability of the five topics in doc. use 15 hidden units
NN for Topic Spotting Modular Architecture five groups of local topic networks consists of local topic networks for each topic in meta-topic each network trained only on the meta-topic region
NN for Topic Spotting Modular Architecture five groups of local topic networks (cont.) Example: wheat network trained Agriculture meta-topic. Focus on finer distinctions, e.g. wheat and grain Don’t waste time on easier distinctions, e.g. wheat and gold. Each local topic networks uses 6 hidden units.
NN for Topic Spotting Modular Architecture To compute topic predictions for a given document Present document to meta-topic network Present document to each of the topic networks Outputs of meta-topic network estimate of topic networks = final topic estimates
Experimental Results Evaluating Performance Mean squared error between actual and predicted values is inefficient Compute precision and recall based on contingency table constructed over range of decision thresholds How to get the decision Thresholds?
Experimental Results Evaluating Performance How to get the decision Thresholds? Proportional assignment Topic = ‘wool’ Topic ‘wool’ Predicted Topic = ‘wool’ iff Output probability = output probability of kp’th highest rank doc. K integer, P prior probability of “wool” topic Predicted Topic ‘wool’, iff output probability <
Experimental Results Evaluating Performance How to get the decision Thresholds? fixed recall level approach determine set of recall levels analyze ranked documents to determine what decision thresholds lead to the desired set of recall levels. Topic = ‘wool’ Topic ‘wool’ Predicted Topic = ‘wool’ iff Output probability = output probability of doc. where # of doc. with higher output probability Leads to desired recall level Predicted Topic ‘wool’, iff output probability < Target Recall
Experimental Results Performance by Micoraveraging add all contingency tables together across topics at a certain threshold compute precision and recall used proportional assignment for picking decision thresholds does not weight the topics evenly used for comparisons to previously reported results Breakeven point is used as a summary value
Experimental Results Performance by Macoraveraging compute precision and recall for each topic take the average across topics used fixed set of recall levels summary values are obtained for particular topics by averaging precision over the 19 evenly spaced recall levels between 0.05 and 0.95
Experimental Results Microaveraged performance Breakpoints compared to best algorithm: rule induction method best on heuristic search with breakpoint (0.789)
Experimental Results Macroaveraged performance TERMS appears much closer to other three. Relative effectiveness of the representations at low recall levels is reversed at high recall levels
Slight improvement of nonlinear networks LSI performance degrades compared to TERMS when f t decreases Six techniques performance on 54 most frequent topics considerable variation of performance across topics relative ups and downs are mirrored in both plots
Experimental Results Performance of Combination of Techniques and Its Improvement NN architecture Document Representation FlatModular (Meta-Topic NW trained using LSI representation) LinearNon LinearLinearNon Linear TERMS LSI CD-LSI TD-LSI REL-LSI Hybrid (CD-LSI + TERMS) Match color & shape to get an experiment
Experimental Results Flat Networks
Experimental Results Modular Networks 4 clusters only used Recomputed average precision for the flat networks
Non linear networks seem to perform better than the linear models, but the difference is very slight.
LSI representation is able to equal or exceed TERMS performance for high frequency topic, but performs poorly for low frequency
Task-Directed LSI representations improve performance in the low frequency domain TD/LSI Trade-off Cost REL/LSI Trade-off lower performance on m/h topics
Modular CD/LSI improves performance further for low frequency, because individual networks are trained only in the domain that LSI was performed
TERMS proves to be competitive to more sophisticated LSI technique most topics are predictable by small set of terms
Discussion Rich solution – many representations and many models Total Supervised approach Results are lower than what expected Is the dataset responsible? High computational overhead Does NN deserve a place in DM tool boxes?Questions?