1 IFT6255: Information Retrieval Text classification.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Text Categorization.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Evaluation of Decision Forests on Text Categorization
For Wednesday Read chapter 19, sections 1-3 No homework.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Classification and Decision Boundaries
Machine Learning Neural Networks
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Data Mining Techniques Outline
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
CES 514 – Data Mining Lecture 8 classification (contd…)
Ensemble Learning: An Introduction
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Lecture 5 (Classification with Decision Trees)
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Data mining and statistical learning - lecture 13 Separating hyperplane.
Machine Learning: Ensemble Methods
An Introduction To The Backpropagation Algorithm Who gets the credit?
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
INTRODUCTION TO ARTIFICIAL INTELLIGENCE
Ensemble Learning (2), Tree and Forest
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Artificial Neural Networks
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Methods: Bagging and Boosting
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
EE459 Neural Networks Backpropagation
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
Supervised Learning. CS583, Bing Liu, UIC 2 An example application An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc)
Ensemble Methods in Machine Learning
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Classification Ensemble Methods 1
Lecture Notes for Chapter 4 Introduction to Data Mining
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
An Introduction To The Backpropagation Algorithm.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Machine Learning Today: Reading: Maria Florina Balcan
Text Categorization Assigning documents to a fixed set of categories
COSC 4335: Other Classification Techniques
An Introduction To The Backpropagation Algorithm
Support Vector Machines
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

1 IFT6255: Information Retrieval Text classification

2 Overview Definition of text classification Important processes in classification Classification algorithms Advantages and disadvantages of algorithms Performance comparison of algorithms Conclusion

3 Text Classification Text classification (text categorization): assign documents to one or more predefined categories classes Documents ? class1 class2. classn

4 Illustration of Text Classification Science Sport Art

5 Applications of Text Classification Organize web pages into hierarchies Domain-specific information extraction Sort into different folders Find interests of users Etc.

6 Text Classification Framework DocumentsPreprocessingIndexing Feature selection Applying classification algorithms Performance measure

7 Preprocessing Preprocessing: transform documents into a suitable representation for classification task –Remove HTML or other tags –Remove stopwords –Perform word stemming (Remove suffix )

8 Indexing Indexing by different weighing schemes: –Boolean weighing –Word frequency weighing –tf*idf weighing –ltc weighing –Entropy weighing

9 Feature Selection Feature selection: remove non-informative terms from documents =>improve classification effectiveness =>reduce computational complexity

10 Different Feature Selection Methods Document Frequency Thresholding (DF) –tf > threshold –tf*idf Information Gain (IG)

11 Different Feature Selection Methods 2­statistic (CHI) or –A: w and C j B: w and not C j –C: not w and C j D: not w and not C j Mutual Information (MI)

12 Classification Algorithms Rocchio’s algorithm K-Nearest-Neighbor algorithm (KNN) Decision Tree algorithm (DT) Naive Bayes algorithm (NB) Artificial Neural Network (ANN) Support Vector Machine (SVM) Voting algorithms

13 Rocchio’s Algorithm Build prototype vector for each class prototype vector: average vector over all training document vectors that belong to class c i Calculate similarity between test document and each of prototype vectors Assign test document to the class with maximum similarity

14 Analysis of Analysis of Rocchio’s Algorithm Advantages: –Easy to implement –Very fast learner –Relevance feedback mechanism Disadvantages: –Low classification accuracy –Linear combination too simple for classification –Constant  and  are empirical

15 K-Nearest-Neighbor Algorithm Principle: points (documents) that are close in the space belong to the same class

16 K-Nearest-Neighbor Algorithm Calculate similarity between test document and each neighbor Select k nearest neighbors of a test document among training examples Assign test document to the class which contains most of the neighbors

17 Analysis of KNN Algorithm Advantages: –Effective –Non-parametric –More local characteristics of document are considered comparing with Rocchio Disadvantages: –Classification time is long –Difficult to find optimal value of k

18 Decision Tree Algorithm Decision tree associated with document: –Root node contains all documents –Each internal node is subset of documents separated according to one attribute –Each arc is labeled with predicate which can be applied to attribute at parent –Each leaf node is labeled with a class

19 Decision Tree Algorithm Recursive partition procedure from root node Set of documents separated into subsets according to an attribute Use the most discriminative attribute first (highest IG) Pruning to deal with overfitting

20 Analysis of Decision Tree Algorithm Advantages: –Easy to understand –Easy to generate rules –Reduce problem complexity Disadvantages: –Training time is relatively expensive –A document is only connected with one branch –Once a mistake is made at a higher level, any subtree is wrong –Does not handle continuous variable well –May suffer from overfitting

21 Naïve Bayes Algorithm Estimate the probability of each class for a document: –Compute the posterior probability (Bayes rule) –Assumption of word independency

22 Naïve Bayes Algorithm –P(Ci): –P(dj|ci):

23 Analysis of Naïve Bayes Algorithm Advantages: –Work well on numeric and textual data –Easy to implement and computation comparing with other algorithms Disadvantages: –Conditional independence assumption is violated by real-world data, perform very poorly when features are highly correlated

24 Basic Neuron Model In A Feedforward Network Inputs x i arrive through pre-synaptic connections Synaptic efficacy is modeled using real weights w i The response of the neuron is a nonlinear function f of its weighted inputs

25 Inputs To Neurons Arise from other neurons or from outside the network Nodes whose inputs arise outside the network are called input nodes and simply copy values An input may excite or inhibit the response of the neuron to which it is applied, depending upon the weight of the connection

26 Weights Represent synaptic efficacy and may be excitatory or inhibitory Normally, positive weights are considered as excitatory while negative weights are thought of as inhibitory Learning is the process of modifying the weights in order to produce a network that performs some function

27 Output The response function is normally nonlinear Samples include –Sigmoid –Piecewise linear

28 Backpropagation Preparation Training Set A collection of input-output patterns that are used to train the network Testing Set A collection of input-output patterns that are used to assess network performance Learning Rate-η A scalar parameter, analogous to step size in numerical integration, used to set the rate of adjustments

29 Network Error Total-Sum-Squared-Error (TSSE) Root-Mean-Squared-Error (RMSE)

30 A Pseudo-Code Algorithm Randomly choose the initial weights While error is too large –For each training pattern Apply the inputs to the network Calculate the output for every neuron from the input layer, through the hidden layer(s), to the output layer Calculate the error at the outputs Use the output error to compute error signals for pre- output layers Use the error signals to compute weight adjustments Apply the weight adjustments –Periodically evaluate the network performance

31 Apply Inputs From A Pattern Apply the value of each input parameter to each input node Input nodes computer only the identity function FeedforwardInputs Outputs

32 Calculate Outputs For Each Neuron Based On The Pattern The output from neuron j for pattern p is O pj where and k ranges over the input indices and W jk is the weight on the connection from input k to neuron j FeedforwardInputs Outputs

33 Calculate The Error Signal For Each Output Neuron The output neuron error signal  pj is given by  pj =(T pj -O pj ) O pj (1-O pj ) T pj is the target value of output neuron j for pattern p O pj is the actual output value of output neuron j for pattern p

34 Calculate The Error Signal For Each Hidden Neuron The hidden neuron error signal  pj is given by where  pk is the error signal of a post- synaptic neuron k and W kj is the weight of the connection from hidden neuron j to the post-synaptic neuron k

35 Calculate And Apply Weight Adjustments Compute weight adjustments W ji by W ji = η  pj O pi Apply weight adjustments according to W ji = W ji + W ji

36 Analysis of ANN Algorithm Advantages: –Produce good results in complex domains –Suitable for both discrete and continuous data (especially better for the continuous domain) –Testing is very fast Disadvantages: –Training is relatively slow –Learned results are difficult for users to interpret than learned rules (comparing with DT) –Empirical Risk Minimization (ERM) makes ANN try to minimize training error, may lead to overfitting

37 Support Vector Machines Main idea of SVMs Main idea of SVMs Find out the linear separating hyperplane which maximize the margin, i.e., the optimal separating hyperplane (OSH) Nonlinear separable case Nonlinear separable case Kernel function and Hilbert space FX f(x) x x x x f(0) X  f(X)

38 SVM classification Maximizing the margin is equivalent to: Introducing Lagrange multipliers, the Lagrangian is: Dual problem: subject to: The solution is given by: The problem of classifying a new data point x is now simply solved by looking at the sigh of

39 Analysis of SVM Algorithm Advantages: –Comparing with ANN, SVM capture the inherent characteristics of the data better –Embedding the Structural Risk Minimization (SRM) principle which minimizes the upper bound on the generalization error (better than the Empirical Risk Minimization principle) –Ability to learn can be independent of the dimensionality of the feature space –Global minima vs. local minima Disadvantage: –Parameter tuning –kernel selection x x x x x x x x

40 Voting Algorithm Principle: using multiple evidence (multiple poor classifiers=> single good classifier) Generate some base classifiers Combine them to make the final decision

41 Bagging Algorithm Use multiple versions of a training set D of size N, each created by resampling N examples from D with bootstrap Each of data sets is used to train a base classifier, the final classification decision is made by the majority voting of these classifiers

42 Adaboost Main idea: -The main idea of this algorithm is to maintain a distribution or set of weights over the training set. Initially, all weights are set equally, but in each iteration the weights of incorrectly classified examples are increased so that the base classifier is forced to focus on the ‘hard’ examples in the training set. For those correctly classified examples, their weights are decreased so that they are less important in next iteration. Why ensembles can improve performance: - Uncorrelated errors made by the individual classifiers can be removed by voting. - Our hypothesis space H may not contain the true function f. Instead, H may include several equally good approximations to f. By taking weighted combinations of these approximations, we may be able to represent classifiers that lie outside of H.

43 Adaboost algorithm Given: m examples where Initialize For t = 1,…,T:  Train base classifier using distribution  Get a hypothesis with error for all i = 1…m  Choose.  Update: where is a normalization factor (chosen so that will be a distribution). Output the final hypothesis:

44 Analysis of Voting Algorithms Advantage: –Surprisingly effective –Robust to noise –Decrease the overfitting effect Disadvantage: –Require more calculation and memory

45 Performance Measure Performance of algorithm: –Training time –Testing time –Classification accuracy Precision, Recall Micro-average / Macro-average Breakeven: precision = recall Goal: high classification quality and computation efficiency

46 Comparison Based on Six Classifiers Classification accuracy: six classifiers (Reuters collection) 1234 AuthorDumaisJoachimsWeissYang 1Training Test Topics IndexingBooleantfcFrequencyltc 5SelectionMIIG-  7MeasureBreakevenMicroavg.Breakeven 8Rocchio NB KNNN/A DTN/A SVM N/A 13VotingN/A 87.8N/A

47 Analysis of Results SVM, Voting and KNN are showed good performance DT, NB and Rocchio showed relatively poor performance

48 Comparison Based on Feature Selection Classification accuracy: NB vs. KNN vs. SVM (Reuter collection) # of featuresNBKNNSVM ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 0.04

49 Analysis of Results Accuracy is improved with an increase in the number of features until some level Top level = approximately features: accuracy reaches its peak and begins to decline SVM obtains the best performance

50 Conclusion Different algorithms perform differently depending on data collections Some algorithms (e.g. Rocchio) do not perform well None of them appears to be globally superior over the others; however, SVM and Voting are good choices by considering all the factors