Arabic Text Categorization Based on Arabic Wikipedia Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP.
Outlines Motivation Objectives Methodology Experiments Conclusions Comments
Motivation A challenge due to the correlation between certain subcategories and overlap between main categories. EX:
Objectives To solve this, we use algorithm and further adopt the two approaches .
CATEGORIZATION CORPORA - Training Data Related Tags Approach
Testing Data 10 categories with 40 documents in each category
Methodology - PREPROCESSING TECHNIQUES Root Extraction (RE) Light Stemming (LS) Special Expressions Extraction
Methodology- CATEGORIZATION PROCESS Categorize the input text in two phases Phase one: we categorize the text into one of the main categories. Phase two: We further categorize the input text based on subcategories:
Methodology - Basic Categorization Algorithm (BCA)
Methodology - Percentage and Difference Categorization (PDC) Algorithm has frequency 7 in the 300-word
Methodology - Percentage and Difference Categorization (PDC) Algorithm The category with the highest sum of flag values is considered to be the best match for the input text.
Methodology – PDC Algorithm vs. BCA Algorithm
Methodology – Enhancing Main/Subcategories Grouping Problem : The possible high correlation between subcategories of different main categories (1) Overlapping Main Categories for Phase Two
Methodology – Enhancing Main/Subcategories Grouping (2) Replacing Main Categories by Groups of Related Categories
Methodology – Enhancing Main/Subcategories Grouping
Methodology - Word Filtration Techniques within Categories
Methodology - The result of applying the three techniques
Modified PDC with N Scales 1 0.5 Define a scaling of 1 0.75 0.5 0.25
Further Testing on the PDC Algorithm Tool Root Extraction Tool Light Stemming & Light10 Tool Double Words Tool Expressions Extraction
Using Testing Data from the Reference Categories
Training Data Characteristics
COMPARISON WITH RELATED WORK
Using Testing Data from the Reference Categories
Conclusions To use training and testing data from same source by splitting the corpus into test and training components. This consistently gives better results. However, we believe that the second method (different source ) makes more sense, as the tests will be more credible and indicative of performance in real-life environments.
Comments Advantages To. Applications Arabic Text Categorization .