Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Chinese Text Categorization Feature Engineering and Comparison of Classification Approaches Yi-An Lin and Yu-Te Lin.

Similar presentations


Presentation on theme: "Automatic Chinese Text Categorization Feature Engineering and Comparison of Classification Approaches Yi-An Lin and Yu-Te Lin."— Presentation transcript:

1 Automatic Chinese Text Categorization Feature Engineering and Comparison of Classification Approaches Yi-An Lin and Yu-Te Lin

2 Motivation Text categorization (TC) is extensively researched in English but not in Chinese. What’s feature engineering help in Chinese? Should Chinese content be segmented? What’s ML best for TC? – Naïve Bayes, SVM, Decision Tree, k Nearest Neighbor, MaxEnt, or Language Model Methods?

3 Outline Data Preparation Feature Selection Feature Vector Encoding
Comparison of Classifiers Feature Engineering Comparison after Feature Engineering Conclusion

4 Data Preparation Tool: Yahoo News Crawler Category Entertainment
Politics Business Sports

5 Feature Selection statistics:

6 Top Features by Term Meaning 職棒 pro. baseball 總統 president 美元 USD 網賽
tennis competition 馬英九 name 公開賽 open tournament 民進黨 DP Party 中華 Chinese nation 油價 oil price stocks cup 經濟 economics question mark 電影 movies 投資 invest 歌手 singer

7 Feature Vector Encoding
Binary: whether contains a word. Count: number of occurrence. TF: ratio of words occurrence. TF-IDF: with inverse document freq.

8 Comparison of different encoding
classifier binary count TF TF-IDF NB 82.48% 80.90% 82.35% 82.13% DT 80.28% 80.00% 79.43% 80.58% kNN 72.85% 77.73% 77.18% 81.78% SVM 83.55% 79.90% 80.60% 84.48%

9 Classifier Comparison Ⅰ

10 Classifier Comparison Ⅱ

11 Feature Engineering Stop Terms: similar to stop words in English.
Group Terms: common substrings. Key Terms: distinctive terms. Key Term Intuitive Possible Category Related ball games? → Sports Related to politics? → Politics Related to business trades? → Business Related to drama? → Entertainment

12 Comparison of feature engineering methods
S: stop terms G: group terms K: key terms

13 Comparison after FE Method Accuracy SVM seg 93.66% SVM non-seg 91.65%
7-gram 93.70% 9-gram 94.20%

14 Conclusion N-gram model outperforms other methods:
Language Models’ nature: considering all features and avoid error-prone ones. No restrictive independence (ex. NB). Better smoothing. Feature engineering also helps reducing the sparsity but may cause ambiguity. Semantic understanding could be the next to try in future research.


Download ppt "Automatic Chinese Text Categorization Feature Engineering and Comparison of Classification Approaches Yi-An Lin and Yu-Te Lin."

Similar presentations


Ads by Google