Arabic Text Categorization Based on Arabic Wikipedia

Arabic Text Categorization Based on Arabic Wikipedia
Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI ACM TALIP.

Outlines Motivation Objectives Methodology Experiments Conclusions
Comments

Motivation  A challenge due to the correlation between certain subcategories and overlap between main categories. EX:

Objectives To solve this, we use algorithm and further adopt the two approaches .

CATEGORIZATION CORPORA - Training Data
Related Tags Approach

Testing Data 10 categories with 40 documents in each category

Methodology - PREPROCESSING TECHNIQUES
Root Extraction (RE) Light Stemming (LS) Special Expressions Extraction

Methodology- CATEGORIZATION PROCESS
Categorize the input text in two phases Phase one: we categorize the text into one of the main categories. Phase two: We further categorize the input text based on subcategories:

Methodology - Basic Categorization Algorithm (BCA)

Methodology - Percentage and Difference Categorization (PDC) Algorithm
has frequency 7 in the 300-word

Methodology - Percentage and Difference Categorization (PDC) Algorithm
The category with the highest sum of ﬂag values is considered to be the best match for the input text.

Methodology – PDC Algorithm vs. BCA Algorithm

Methodology – Enhancing Main/Subcategories Grouping
Problem : The possible high correlation between subcategories of different main categories (1) Overlapping Main Categories for Phase Two

(2) Replacing Main Categories by Groups of Related Categories

Methodology - Word Filtration Techniques within Categories
  

Methodology - The result of applying the three techniques

Modiﬁed PDC with N Scales
1 0.5 Deﬁne a scaling of 1 0.75 0.5 0.25

Further Testing on the PDC Algorithm
Tool Root Extraction Tool Light Stemming & Light10 Tool Double Words Tool Expressions Extraction

Using Testing Data from the Reference Categories

Training Data Characteristics

COMPARISON WITH RELATED WORK

Using Testing Data from the Reference Categories

Conclusions To use training and testing data from same source by splitting the corpus into test and training components. This consistently gives better results. However, we believe that the second method (different source ) makes more sense, as the tests will be more credible and indicative of performance in real-life environments.

Comments Advantages To. Applications Arabic Text Categorization .

Arabic Text Categorization Based on Arabic Wikipedia

Similar presentations

Presentation on theme: "Arabic Text Categorization Based on Arabic Wikipedia"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Arabic Text Categorization Based on Arabic Wikipedia

Similar presentations

Presentation on theme: "Arabic Text Categorization Based on Arabic Wikipedia"— Presentation transcript:

Similar presentations

About project

Feedback