Presentation is loading. Please wait.

Presentation is loading. Please wait.

Arabic Text Categorization Based on Arabic Wikipedia

Similar presentations


Presentation on theme: "Arabic Text Categorization Based on Arabic Wikipedia"— Presentation transcript:

1 Arabic Text Categorization Based on Arabic Wikipedia
Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI ACM TALIP.

2 Outlines Motivation Objectives Methodology Experiments Conclusions
Comments

3 Motivation  A challenge due to the correlation between certain subcategories and overlap between main categories. EX:

4 Objectives To solve this, we use algorithm and further adopt the two approaches .

5 CATEGORIZATION CORPORA - Training Data
Related Tags Approach

6

7 Testing Data 10 categories with 40 documents in each category

8 Methodology - PREPROCESSING TECHNIQUES
Root Extraction (RE) Light Stemming (LS) Special Expressions Extraction

9 Methodology- CATEGORIZATION PROCESS
Categorize the input text in two phases Phase one: we categorize the text into one of the main categories. Phase two: We further categorize the input text based on subcategories:

10

11 Methodology - Basic Categorization Algorithm (BCA)

12 Methodology - Percentage and Difference Categorization (PDC) Algorithm
has frequency 7 in the 300-word

13 Methodology - Percentage and Difference Categorization (PDC) Algorithm
The category with the highest sum of flag values is considered to be the best match for the input text.

14 Methodology – PDC Algorithm vs. BCA Algorithm

15 Methodology – Enhancing Main/Subcategories Grouping
Problem : The possible high correlation between subcategories of different main categories (1) Overlapping Main Categories for Phase Two

16 Methodology – Enhancing Main/Subcategories Grouping
(2) Replacing Main Categories by Groups of Related Categories

17 Methodology – Enhancing Main/Subcategories Grouping

18 Methodology - Word Filtration Techniques within Categories

19 Methodology - The result of applying the three techniques

20 Modified PDC with N Scales
1 0.5 Define a scaling of 1 0.75 0.5 0.25

21 Further Testing on the PDC Algorithm
Tool Root Extraction Tool Light Stemming & Light10 Tool Double Words Tool Expressions Extraction

22 Using Testing Data from the Reference Categories

23 Training Data Characteristics

24 COMPARISON WITH RELATED WORK

25 Using Testing Data from the Reference Categories

26 Conclusions To use training and testing data from same source by splitting the corpus into test and training components. This consistently gives better results. However, we believe that the second method (different source ) makes more sense, as the tests will be more credible and indicative of performance in real-life environments.

27 Comments Advantages To. Applications Arabic Text Categorization .


Download ppt "Arabic Text Categorization Based on Arabic Wikipedia"

Similar presentations


Ads by Google