Download presentation
Presentation is loading. Please wait.
Published byBarrie Allen Modified over 6 years ago
1
Arabic Text Categorization Based on Arabic Wikipedia
Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI ACM TALIP.
2
Outlines Motivation Objectives Methodology Experiments Conclusions
Comments
3
Motivation A challenge due to the correlation between certain subcategories and overlap between main categories. EX:
4
Objectives To solve this, we use algorithm and further adopt the two approaches .
5
CATEGORIZATION CORPORA - Training Data
Related Tags Approach
7
Testing Data 10 categories with 40 documents in each category
8
Methodology - PREPROCESSING TECHNIQUES
Root Extraction (RE) Light Stemming (LS) Special Expressions Extraction
9
Methodology- CATEGORIZATION PROCESS
Categorize the input text in two phases Phase one: we categorize the text into one of the main categories. Phase two: We further categorize the input text based on subcategories:
11
Methodology - Basic Categorization Algorithm (BCA)
12
Methodology - Percentage and Difference Categorization (PDC) Algorithm
has frequency 7 in the 300-word
13
Methodology - Percentage and Difference Categorization (PDC) Algorithm
The category with the highest sum of flag values is considered to be the best match for the input text.
14
Methodology – PDC Algorithm vs. BCA Algorithm
15
Methodology – Enhancing Main/Subcategories Grouping
Problem : The possible high correlation between subcategories of different main categories (1) Overlapping Main Categories for Phase Two
16
Methodology – Enhancing Main/Subcategories Grouping
(2) Replacing Main Categories by Groups of Related Categories
17
Methodology – Enhancing Main/Subcategories Grouping
18
Methodology - Word Filtration Techniques within Categories
19
Methodology - The result of applying the three techniques
20
Modified PDC with N Scales
1 0.5 Define a scaling of 1 0.75 0.5 0.25
21
Further Testing on the PDC Algorithm
Tool Root Extraction Tool Light Stemming & Light10 Tool Double Words Tool Expressions Extraction
22
Using Testing Data from the Reference Categories
23
Training Data Characteristics
24
COMPARISON WITH RELATED WORK
25
Using Testing Data from the Reference Categories
26
Conclusions To use training and testing data from same source by splitting the corpus into test and training components. This consistently gives better results. However, we believe that the second method (different source ) makes more sense, as the tests will be more credible and indicative of performance in real-life environments.
27
Comments Advantages To. Applications Arabic Text Categorization .
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.