A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Dimensionality Reduction PCA -- SVD
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Distributed Representations of Sentences and Documents
Introduction to Machine Learning Approach Lecture 5.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Chapter 5: Information Retrieval and Web Search
Chapter 5 Data mining : A Closer Look.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Radial-Basis Function Networks
Radial Basis Function Networks
Neural Networks Lecture 8: Two simple learning algorithms
Chapter 3 Data Exploration and Dimension Reduction 1.
Radial Basis Function Networks
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chapter 9 Neural Network.
Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Chapter 6: Information Retrieval and Web Search
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Neural and Evolutionary Computing - Lecture 9 1 Evolutionary Neural Networks Design  Motivation  Evolutionary training  Evolutionary design of the architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Ensemble Methods: Bagging and Boosting
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Data Mining and Decision Support
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Intro. ANN & Fuzzy Systems Lecture 13. MLP (V): Speed Up Learning.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
RiskTeam/ Zürich, 6 July 1998 Andreas S. Weigend, Data Mining Group, Information Systems Department, Stern School of Business, NYU 2: 1 Nonlinear Models.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
KNN & Naïve Bayes Hongning Wang
Chapter 7. Classification and Prediction
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Artificial Neural Networks
Word Embedding Word2Vec.
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Parametric Methods Berlin Chen, 2005 References:
Chapter 7: Transformations
Information Retrieval
Multivariate Methods Berlin Chen, 2005 References:
Restructuring Sparse High Dimensional Data for Effective Retrieval
Introduction to Neural Networks
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining

Article Information Published in  Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval 1995 Authors  Wiener E.,  Pedersen, J.O.  Weigend, A.S. 54 citations

Summary Introduction  Related Work  The Corpus Representation  Term Selection  Latent Semantic Indexing Generic LSI Local LSI  Cluster-Directed LSI  Topic-Directed LSI Relevancy Weighting LSI

Summary Neural Network Classifier Neural Networks for Topic Spotting  Linear vs. Non Linear Networks  Flat Architecture vs. Modular Architecture Experiment Results  Evaluating Performance  Results & discussions

Introduction Topic Spotting = Text Categorization = Text Classification Problem of identifying which of a set of predefined topics are present in a natural language document. Document Topic 1 Topic 2 Topic n

Introduction Classification Approaches  Expert system approach: manually construct a system of inference rules on top of large body of linguistic and domain knowledge could be extremely accurate very time consuming brittle to changes in the data environment  Data driven approach: induce a set of rules from a corpus of labeled training documents practically better

Introduction – Related Work The major remarks regarding the related work:  Separate classifier was constructed for each topic.  Different set of terms was used to train each classifier.

Introduction – The Corpus Reuters corpus of Reuters newswire stories from 1987  21,450 stories  9,610 for training  3,662 for testing  mean length: 90.6 words, SD 91.6  92 topics appeared at least once in the training set. The mean is 1.24 topics/doc. (up to 14 topics for some doc.)  11,161 unique terms after preprocessing inflectional stemming, stop word removal, conversion to lower case elimination of words appeared in fewer three documents

Representations starting point:  Document Profile: term by document matrix containing word frequency entries

Representation  3/  33  1/  33  2/  33 Thorsten Joachims Text Categorization with Support Vector Machines: Learning with Many Relevant Features.

Representation - Term Selection the subset of the original terms that are most useful for the classification task. Difficult to select terms that discriminate between 92 classes while being small enough to serve as the feature set for a neural network  Divide problem into 92 independent classification tasks  Search for best discriminator terms between documents with the topic and those without

Representation - Term Selection Relevancy Score  measures how unbalanced the term is across documents w/ or w/o the topic  Highly +ve and highly -ve scores indicate useful terms for discrimination  using about 20 terms yielded the best classification performance No. of doc. w/ topic t & contain term k Total No. of doc. w/ topic t

Representation - Term Selection

Advantage:  little computation is required  resulting features have direct interpretability Drawback:  many of best individual predictors contain redundant information  a term which may appear to be a very poor predictor on its own may turn out to have great discriminative power in combination with other terms, and vise verse. Apple vs. Apple Computers Selected Term Representation (TERMS) with 20 features Representation - Term Selection TERMS

Representation – LSI Transform original doc to lower-dimensional space by analyzing correlational structure of terms in the document collection  (Training Set): applying a singular-value decomposition (SVD) to the original term by document matrix  Get U, , V  (test set): Transform document vectors by projecting them into LSI space Property of LSI: higher dimensions capture less of variance of original data  drop w/ minimal loss.  Found: performance continues to improve up to at least 250 dimensions  Improvement rapidly slows dawn after about 100 dimensions Generic LSI Representation (LSI) with 200 features LSI

Representation – LSI SVD Reuters Corpus Wool Barley Wheat Money- supply Zinc Gold Generic LSI Representation w/ 200 features

Representation – Local LSI Global LSI performs worse as frequency decreases  infrequent topics are usually indicated by infrequent terms and infrequent terms may be projected out of LSI and considered as mere noise. Proposed two task-directed methods that make use of prior knowledge of the classification task

Representation – Local LSI What is Local LSI?  modeling only the local portion of the corpus related to those topics  includes documents that use terminology related to the topics (not necessary have any of the topics assigned)  Performing SVD over only the local set of documents representation more sensitive to small, localized effects of infrequent terms. representation more effective for classification of topics related to that local structure.

Representation – Local LSI Type of Local LSI:  Cluster Directed representation 5 Meta-topics (clusters):  Agriculture, Energy, Foreign exchange, Government, and metals How to construct local region?  Break corpus into 5 clusters  each containing all documents on corresponding meta-topic  Perform SVD for each Meta-topic region  Clustor-Directed LSI Representation (CD/LSI) with 200 features CD/LSI

Representation – Local LSI SVD Reuters Corpus Wool Barley Wheat Money- supply Zinc Gold

Energy Metal Foreign Exchange Agriculture Government Representation – Local LSI SVD Reuters Corpus Wool Barley Wheat Money- supply ZincGold GOVERNMENTGOVERNMENT AGRICULTUREAGRICULTURE ForeignExchangeForeignExchange METALMETAL ENERGYENERGY Clustor-Directed LSI Representation (CD/LSI) w/ 200 features SVD

Representation – Local LSI Types of Local LSI:  Term Directed representation More fine-grained approach to local LSI Separate representation for each topic. How to construct the local region?  Use 100 most predictive terms for the topic.  Pick N most similar documents. N = 5 * No. of documents containing topic, 350  N  110  Final Documents in topic region = N documents random documents  Topic-Directed LSI Representation (TD/LSI) with 200 features

Representation – Local LSI SVD Reuters Corpus Wool Barley Wheat Money- supply Zinc Gold

Representation – Local LSI Reuters Corpus Wool Barley Wheat Money- supply Zinc Gold SVD Term-Directed LSI Representation (TD/LSI) w/ 200 features

Drawback of Local LSI:  Narrower the region, the Lower flexibility in representations for modeling the classification of multiple topics  High computational overhead Representation – Local LSI

Representation - Relevancy Weighting LSI Use term weight to emphasize the importance of particular terms before applying SVD  IDF weighting  importance of low frequency terms  the importance of high frequency terms Assumes low frequency terms to be better discriminators than high frequency terms

 Relevancy Weighting tune the IDF assumption emphasize terms in proportion to their estimated topic discrimination power Global Relevancy Weighting of term k (GRW k )  Final Weighting of term k = IDF 2 * GRW k  all low frequency terms pulled up by IDF  Poor predictors pushed down  leaving only relevant low frequency terms with high weights Relevancy Weighted LSI Representation (REL/LSI) with 200 features Representation - Relevancy Weighting LSI

Neural Network Classifier (NN) NN consists of:  processing units (Neurons)  weighted links connecting neurons

major components of NN model:  architecture: defines the functional form relating input to output network topology unit connectivity activation functions: e.g. Logistic regression fn. Neural Network Classifier (NN)

Logistic regression function z = is a linear combination of the input features p  (0,1) - can be converted to binary classification method by thresholding the output probability

major components of NN model (cont):  search algorithm: the search in weight space for a set of weights which minimizes the error between the output and the expected output (TRAINING PROCESS) Backpropagation method  Mean squared errors  Cross-entropy error performance function C = - sum [all cases and outputs] (d*log(y) + (1-d)*log(1-y) ) d: desired output, y: actual output Neural Network Classifier (NN)

NN for Topic Spotting Network outputs are estimates of the probability of topic presence given the feature vector of a document Generic LSI representation each network uses same representation Local LSI representation different representation for each network

Linear NN  Output units with logistic activation and no hidden layer NN for Topic Spotting n 2 1

Non Linear NN  Simple networks with a single hidden layer of logistic sigmoid units (6 – 15)

NN for Topic Spotting  Flat Architecture Separate network for each topic use entire training set to train for each topic Avoiding overfitting problem by  adding penalty term to the cross-entropy cost function to encourage elimination of small weights.  Early stopping based on cross-validation

NN for Topic Spotting  Modular Architecture decompose learning problem into smaller problems Meta-Topic Network trained on full training set  estimate the presence probability of the five topics in doc.  use 15 hidden units

NN for Topic Spotting  Modular Architecture five groups of local topic networks  consists of local topic networks for each topic in meta-topic  each network trained only on the meta-topic region

NN for Topic Spotting  Modular Architecture five groups of local topic networks (cont.)  Example: wheat network trained Agriculture meta-topic.  Focus on finer distinctions, e.g. wheat and grain  Don’t waste time on easier distinctions, e.g. wheat and gold.  Each local topic networks uses 6 hidden units.

NN for Topic Spotting  Modular Architecture To compute topic predictions for a given document  Present document to meta-topic network  Present document to each of the topic networks  Outputs of meta-topic network  estimate of topic networks = final topic estimates

Experimental Results Evaluating Performance  Mean squared error between actual and predicted values is inefficient  Compute precision and recall based on contingency table constructed over range of decision thresholds  How to get the decision Thresholds?

Experimental Results Evaluating Performance  How to get the decision Thresholds? Proportional assignment Topic = ‘wool’ Topic  ‘wool’ Predicted Topic = ‘wool’ iff Output probability    = output probability of kp’th highest rank doc. K integer, P prior probability of “wool” topic Predicted Topic  ‘wool’, iff output probability < 

Experimental Results Evaluating Performance  How to get the decision Thresholds? fixed recall level approach  determine set of recall levels  analyze ranked documents to determine what decision thresholds lead to the desired set of recall levels. Topic = ‘wool’ Topic  ‘wool’ Predicted Topic = ‘wool’ iff Output probability    = output probability of doc. where # of doc. with higher output probability Leads to desired recall level Predicted Topic  ‘wool’, iff output probability <  Target Recall

Experimental Results Performance by Micoraveraging  add all contingency tables together across topics at a certain threshold  compute precision and recall  used proportional assignment for picking decision thresholds  does not weight the topics evenly  used for comparisons to previously reported results  Breakeven point is used as a summary value

Experimental Results Performance by Macoraveraging  compute precision and recall for each topic  take the average across topics  used fixed set of recall levels  summary values are obtained for particular topics by averaging precision over the 19 evenly spaced recall levels between 0.05 and 0.95

Experimental Results Microaveraged performance  Breakpoints compared to best algorithm: rule induction method best on heuristic search with breakpoint (0.789)

Experimental Results Macroaveraged performance  TERMS appears much closer to other three.  Relative effectiveness of the representations at low recall levels is reversed at high recall levels

Slight improvement of nonlinear networks LSI performance degrades compared to TERMS when f t decreases Six techniques performance on 54 most frequent topics  considerable variation of performance across topics  relative ups and downs are mirrored in both plots

Experimental Results Performance of Combination of Techniques and Its Improvement NN architecture Document Representation FlatModular (Meta-Topic NW trained using LSI representation) LinearNon LinearLinearNon Linear TERMS          LSI         CD-LSI      TD-LSI     REL-LSI   Hybrid (CD-LSI + TERMS)  Match color & shape to get an experiment

Experimental Results Flat Networks

Experimental Results Modular Networks  4 clusters only used  Recomputed average precision for the flat networks

Non linear networks seem to perform better than the linear models, but the difference is very slight.

LSI representation is able to equal or exceed TERMS performance for high frequency topic, but performs poorly for low frequency

Task-Directed LSI representations improve performance in the low frequency domain TD/LSI Trade-off  Cost REL/LSI Trade-off  lower performance on m/h topics

Modular CD/LSI improves performance further for low frequency, because individual networks are trained only in the domain that LSI was performed

TERMS proves to be competitive to more sophisticated LSI technique  most topics are predictable by small set of terms

Discussion Rich solution – many representations and many models Total Supervised approach Results are lower than what expected  Is the dataset responsible? High computational overhead Does NN deserve a place in DM tool boxes?Questions?