Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Text Categorization.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Introduction to Information Retrieval
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
K nearest neighbor and Rocchio algorithm
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Sparse vs. Ensemble Approaches to Supervised Learning
Introduction to machine learning
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Exercise Session 10 – Image Categorization
Advanced Multimedia Text Classification Tamara Berg.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Text mining.
Bayesian Networks. Male brain wiring Female brain wiring.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Text Classification, Active/Interactive learning.
Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, April 13, 2011.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Classification Techniques: Bayesian Classification
Learning from observations
Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1 CHUKWUEMEKA DURUAMAKU.  Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
KNN & Naïve Bayes Hongning Wang
Data Mining and Text Mining. The Standard Data Mining process.
Information Organization: Overview
Session 7: Face Detection (cont.)
Classification Techniques: Bayesian Classification
Prepared by: Mahmoud Rafeek Al-Farra
Text Categorization Assigning documents to a fixed set of categories
Instance Based Learning
Information Organization: Overview
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006

2 Text categorization, continues problem setting machine learning approach example of a learning method: Rocchio

3 Text categorization: problem setting let –D: a collection of documents –C = {c 1, …, c |C| } : a set of predefined categories –T = true, F = false the task is to approximate the unknown target function ’: D x C -> {T,F} by means of a function  : D x C -> {T,F}, such that the functions ”coincide as much as possible” function ’ : how documents should be classified function  : classifier (hypothesis, model…)

4 Some assumptions categories are just symbolic labels –no additional knowledge of their meaning is available no knowledge outside of the documents is available –all decisions have to be made on the basis of the knowledge extracted from the documents –metadata, e.g., publication date, document type, source etc. is not used

5 Some assumptions methods do not depend on any application- dependent knowledge –but: in operational (”real life”) applications all kind of knowledge can be used (e.g. in spam filtering) note: content-based decisions are necessarily subjective –it is often difficult to measure the effectiveness of the classifiers –even human classifiers do not always agree

6 Variations of problem setting: single-label, multi-label text categorization single-label text categorization –exactly 1 category must be assigned to each d j  D multi-label text categorization –any number of categories may be assigned to the same d j  D

7 Variations of problem setting: single-label, multi-label text categorization special case of single-label: binary –each d j must be assigned either to category c i or to its complement ¬ c i the binary case (and, hence, the single-label case) is more general than the multi-label –an algorithm for binary classification can also be used for multi-label classification –the converse is not true

8 Variations of problem setting: single-label, multi-label text categorization in the following, we will use the binary case only: –classification under a set of categories C = set of |C| independent problems of classifying the documents in D under a given category c i, for i = 1,..., |C|

9 Machine learning approach to text categorization a general program (learner) automatically builds a classifier for a category c i by observing the characteristics of a set of documents manually classified under c i or c i by a domain expert from these characteristics the learner extracts the characteristics that a new unseen document should have in order to be classified under c i use of classifier: the classifier observes the characteristics of a new document and decides whether it should be classified under c i or c i

10 Classification process: classifier construction Learner Classifier Doc 1; Label: yes Doc2; Label: no... Docn; Label: yes Examples

11 Classification process: use of the classifier Classifier New, unseen document TRUE / FALSE

12 Supervised learning from examples initial corpus of manually classified documents –let d j belong to the initial corpus –for each pair it is known if d j is a member of c i positive and negative examples of each category –in practice: for each document, all its categories are listed if a document d j has category c i in its list, document d j is a positive example of c i negative examples for c i : the documents that do not have c i in their list

13 Training set and test set the initial corpus is divided into two sets –a training set –a test set the training set is used for building the classifier the test set is used for testing the effectiveness of the classifier –each document is fed to the classifier and the decision is compared to the manual category the documents in the test set are not used in the construction of the classifier

14 Training set and test set the classification process may have several implementation choices: the best combination is chosen by testing the classifier alternative: k-fold cross-validation –k different classifiers are built by partitioning the initial corpus into k disjoint sets and then iteratively applying the train- and-test approach on pairs, where k-1 sets construct a training set and 1 set is used as a test set –individual results are then averaged

15 Classification process: classifier construction Learner Classifier Doc 1; Label: yes Doc2; Label: no... Docn; Label: yes Training set

16 Classification process: testing the classifier Test set Classifier

17 Strengths of machine learning approach learners are domain independent –usually available ’off-the-shelf’ the learning process is easily repeated, if the set of categories changes –only the training set has to be replaced manually classified documents often already available –manual process may exist –if not, it is still easier to manually classify a set of documents than to build and tune a set of rules

18 Examples of learners Rocchio method probabilistic classifiers (Naïve Bayes) decision tree classifiers decision rule classifiers regression methods on-line methods neural networks example-based classifiers (k-NN) boosting methods support vector machines

19 Rocchio method learning method adapted from the relevance feedback method of Rocchio for each category, an explicit profile (or prototypical document) is constructed from the documents in the training set –the same representation as for the documents –benefit: profile is understandable even for humans profile = classifier for the category

20 Rocchio method a profile of a category is a vector of the same dimension as the documents –in our example: 118 terms categories medicine, energy, and environment are represented by vectors of 118 elements –the weight of each element represents the importance of the respective term for the category

21 Rocchio method weight of the k th term of category i: W kj : weight of the k th term of document j POS i : set of positive examples –documents that are of category i NEG i : set of negative examples

22 Rocchio method in the formula,  and  are control parameters that are used to set the relative importance of positive and negative examples for instance, if =2 and =1, we do not want the negative examples to have as strong influence as the positive examples if =1 and =0, the category vector is the centroid (average) vector of the positive sample documents

23 * _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ + + +

24 Rocchio method in our sample dataset: what is the weight of term ’nuclear’ in the category ’medicine’? –POS medicine contains the documents Doc1-Doc4 –NEG medicine contains the documents Doc5-Doc10 –| POS medicine | = 4 and |NEG medicine | = 6

25 Rocchio method –the weights of term ´nuclear´ in documents in POS medicine w_nuclear_doc1 = 0.5 w_nuclear_doc2 = 0 w_nuclear_doc3 = 0 w_nuclear_doc4 = 0.5 –and in documents in NEG medicine w_nuclear_doc6 = 0.5

26 Rocchio method let =2 and =1 weight of ’nuclear’ in the category ’medicine’: –w_nuclear_medicine = 2* ( )/4 – 1 * 0.5/6 = = 0.42

27 Rocchio method using the classifier: cosine similarity of the category vector c i and the document vector d j is computed –|T| is the number of terms

28 Rocchio method the cosine similarity function returns a value between 0 and 1 a threshold is given –if the value is higher than the threshold -> true (the document belongs to the category) –otherwise -> false (the document does not belong to the category)

29 Rocchio method a classifier built by means of the Rocchio method rewards –closeness of a (new) document to the centroid of the positive training examples –distance of a (new) document from the centroid of the negative training examples

30 Strengths and weaknesses of Rocchio method strengths –simple to implement –fast to train weakness –if the documents in a category occur in disjoint clusters, a classifier may miss most of them e.g. two types of Sports news: boxing and rock- climbing the centroid of these clusters may fall outside all of these clusters

31 * _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

32 Enhancement to the Rocchio Method instead of considering the set of negative examples in its entirety, a smaller sample can be used –for instance, the set of near-positive examples near-positives (NPOS c ): the most positive amongst the negative training examples

33 Enhancement to the Rocchio Method the new formula:

34 Enhancement to the Rocchio Method the use of near-positives is motivated, as they are the most difficult documents to distinguish from the positive documents near-positives can be found, e.g., by querying the set of negative examples with the centroid of the positive examples –the top documents retrieved are most similar to this centroid, and therefore near- positives with this and other enhancements, the performance of Rocchio is comparable to the best methods

35 Other learners we discuss later: –Boosting (AdaBoost) –Naive Bayes (in text summarization)