Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Multistrategy Approach for Digital Text Categorization.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Multistrategy Approach for Digital Text Categorization."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Multistrategy Approach for Digital Text Categorization from Imbalanced Documents Advisor : Dr. Hsu Presenter : Jing-Wei Lin

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction System Architecture Empirical Evaluation Conclusions Future Work

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation Certain learning algorithms are more suitable for some thematic categories than for others, showing different classification results due to the different types of information present in each domain. The performance of an algorithm depends on the features or attributes chosen to represent the information.

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective The main goal of the HYCLA (HYbrid CLAssifier) system presented here is to maximize classification performance by considering all the types of information contained in documents regardless of their thematic domain.

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 The documents can be imbalanced for two reasons: 1) Some thematic categories have many preclassified documents, while others do not 2) there are thematic categories that only contain one or two types of information. Introduction

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 HYCLA system relies on a hybrid architecture that integration of the results of several classifiers Hybrid has a double meaning : 1) It symbolizes the multistrategy nature of the empirical learning approach to text categorization. 2) It refers to the genetic search carried out to find the vocabulary of the problem and integrate the individual predictions of the learners. The genetic feature selection proposed treats all categories the same because it considers several statistical measurements, thus obviating the kind of imbalance documents Introduction (cont.)

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 System Architecture chose features trainingclassify HYCLA chose features trainingclassifychose features trainingclassifychose features trainingclassify crossover mutation G.A. combination

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 HYCLA operates in two stages, learning and integration 1) The learning stage : learners obtain their own feature set, and then they are trained to obtain their classification model. 2) The integration stage, individual learned models are evaluated on a test set, and the predictions made are combined in order to achieve the best classification of test documents. When HYCLA classifies a document, the final document category is obtained by the genetic combination of the decisions made by all the models. System Architecture

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 The system receives a sample of documents of different thematic categories that is divided into two sets 1)Training set 2)Test set the task here is to scan the text of the sample and produce the list of the words or vocabulary contained in the documents. System Architecture- Preprocessing Step

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Several information statistical measurements are calculated for each preprocessed vocabulary 1) information gain 2) mutual information 3) document frequency 4) chi square 5) crossover entropy 6) odds-ratio the words of all of the vocabularies are sorted by the six measurements, and only the k v highest ranked words of each vocabulary are retained. System Architecture- Preprocessing Step

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 The k v words of each vocabulary ranked by each measurement form a view. The set of views of a vocabulary will be the initial feature subsets of a learner. System Architecture- Preprocessing Step

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Each view computed from an original vocabulary in the preprocessing step is a chromosome. Chromosome length is fixed at k v. Each gene is a word of the vocabulary. If the input vocabulary is input  {bye, see_you, hello,good_morning, good_afternoon} chi-square  {see_you, bye,good_afternoon} crossover entropy  {see_you, good_afternoon, hello} are two chromosomes, with k v = 3. System Architecture- Chromosome Representation

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 When a learner receives a feature set, it carries out the following tasks: 1) Empirical learning  Feature selection : 每一個學習器利用基因演算法去找出最佳的特徵 集合  Classification model : 根據特徵集合去訓練文件, 歸納出分類的模型 2) Testing The learner applies the inferred model to a test set and calculates several measures of classification performance. System Architecture- Learner : structure and Dynamics

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 The application of genetic algorithms to text feature selection involves defining crossover and mutation operators fitted to chromosome representation, and defining the fitness function used to determine the best chromosomes of a population. System Architecture- Genetic Feature Selection

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Crossover operator Chromosome 1: (I, you, he) Chromosome 2: (we, you, they) New Chromosome 1: (I, you, they) New Chromosome 2: (we, you, he) System Architecture- Operators

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Mutation operator Chromosome 1: (I, you, he) p = 2 Vocabulary = {I, you, he, she, we, you, they} Size of chromosome: 3 10%*3 = 1 (? 1, by default) // 決定要置換幾個 gene New Chromosome 1: (I, you, we) System Architecture- Operators

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Genetic operator can produce new chromosomes containing repeated words. Since just the first occurrence of every word within a chromosome will be considered Genetic search can yield not only an optimal feature set, but also a smaller number of features. System Architecture- Operators

18 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 The fitness function of a chromosome is a measurement of the model performance computed on a test sample represented relative to the chromosome. The fitness function value of a chromosome is the value of F-measure achieved by the chromosome for the full test set of documents. System Architecture- Fitness Function

19 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 When a learner obtains a feature set, and then the learner applies its inductive algorithm to learn a classification model. Since there could be four kinds of redundant information in documents, the system can run four learners: abstract/meta 、 reference/link 、 contents/plain 、 title/url System Architecture- Learning Kernels

20 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Recall : percentage of documents for a category correctly classified(10/100) Precision : percentage of predicted documents for a category correctly classified(10/50) F-measure, which can be viewed as a function made up of the recall and precision measurements. F-measure (F = (2 * precision * recall) / (precision + recall)) The value of F-measure is the fitness value used by the genetic algorithm for chromosomes representing tentative feature sets. System Architecture- Testing C 類 100 50

21 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 There are two options for obtaining the final classification prediction of a document:  To take the model with the best performance results : (i.e. F-measure) as the optimal final solution.  To take a combination of the models as the final solution : the combination can be determined as an average or a weighted sum of the individual predictions HYCLA performs a weighted integration of the individual predictions, and it determines the weight of each learner together with that of the other learners by using a genetic algorithm. System Architecture- Integrated Prediction

22 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 The genes of a chromosome represent the weights, between 0 and 1 Chromosome length matches the number of learners involved in the problem. System Architecture- Genetic Integration

23 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 Chromosome: (0.7, 0.8, 0.85, 0.5, 0.97) where (Chromosome [i]=Weight [Learner i]) If the predictions of the learners were: Learners 1,3,5: Prediction = Category 1; Average Weight = 0.84 Learner 2 : Prediction = Category 3; Weight = 0.8 Learner 4 : Prediction = Category 2; Weight = 0.5 The highest weight is 0.84, and so the resulting prediction assigns the document to Category 1. System Architecture- Genetic Integration

24 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 HYCLA has been evaluated on three text collections. Empirical Evaluation

25 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 “NOISE” category is composed of error pages and randomly downloaded pages. Empirical Evaluation

26 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 26 The third collection is composed of 2,442 documents belonging to five domains defined from the Yahoo Directory. Empirical Evaluation

27 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 27 The numerical values Pr, Rc and F represent precision, recall and F-measure (F = (2 * precision * recall) / (precision + recall)) normalized between 0-1. Empirical Evaluation- Feature Selection Methods

28 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 28 Empirical Evaluation- Feature Selection Methods

29 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 29 The second kind of experiment was set up to compare the performance of the predictions of every individual learner and the genetic combination of these predictions. A comparison of genetic combination of predictions and a voting combination is also reported. There are four learners for the four different types of information taken into account in HTML documents, url text, meta-text, plain text and hyperlink text. Empirical Evaluation- Integration of Predictions

30 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 30 Chromosome: (0.7, 0.8, 0.85, 0.5, 0.97) where (Chromosome [i]=Weight [Learner i]) If the predictions of the learners were: Learners 1,3,5: Prediction = Category 1; Average Weight = 0.84 Learner 2 : Prediction = Category 3; Weight = 0.8 Learner 4 : Prediction = Category 2; Weight = 0.5 The highest weight is 0.84, and so the resulting prediction assigns the document to Category 1. System Architecture- Genetic Integration Voting combine

31 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 31 Empirical Evaluation- Integration of Predictions

32 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 32 Empirical Evaluation- Integration of Predictions

33 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 33 Combination of a variable number of learners The genetic feature selection method takes advantage of each statistical selection method used. The application of different learners to each type of information allows the system to be independent of text domain without loss of accuracy. Conclusions

34 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 34 未來在分類上的特徵選取可以使用交叉配對或整合的方式, 也許會得到更好的分類績效 Opinion


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Multistrategy Approach for Digital Text Categorization."

Similar presentations


Ads by Google