Presentation is loading. Please wait.

Presentation is loading. Please wait.

Externally Enhanced Classifiers and Application in Web Page Classification Join work with Chi-Feng Chang and Hsuan-Yu Chen Jyh-Jong Tsay National Chung.

Similar presentations


Presentation on theme: "Externally Enhanced Classifiers and Application in Web Page Classification Join work with Chi-Feng Chang and Hsuan-Yu Chen Jyh-Jong Tsay National Chung."— Presentation transcript:

1 Externally Enhanced Classifiers and Application in Web Page Classification Join work with Chi-Feng Chang and Hsuan-Yu Chen Jyh-Jong Tsay National Chung Cheng University This research is supported in part by National Science Council, Taiwan, under.

2 Outline Introduction Externally Enhanced Classifiers Enhanced NB Topic Restriction Conclusion

3 Classification: Definition assignment of objects into a set of predefined categories (classes) classification of applicants into risk levels classification of web pages into topics classification of protein sequences into families topic-specific retrieval, information filter, recommendation, …

4 Classification: Task Input: a training set of examples, each labeled with one class label Output: a model (classifier) that assigns a class label to each instance based on the other attributes The model can be used to predict the class of new instances, for which the class label is missing or unknown

5 Train and Test example =instance + class label Examples are divided into training set + test set Classification model is built in two steps: training - build the model from the training set test - check the accuracy of the model using test set

6 Train and Test Kind of models: if - then rules decision trees joint probabilities decision surfaces Accuracy of models: the known class of test samples is matched against the class predicted by the model accuracy rate = % of test set samples correctly classified by the model

7 Training step training data Classification algorithm Classifier (model) if age < 31 or Car Type =Sports then Risk = High class label

8 Test step test data Classifier (model)

9 Classification (prediction) new data Classifier (model)

10 Classification Techniques Decision Tree Classification Bayesian Classifiers Hidden Markov Models(HMM) Neural Networks Support Vector Machines(SVM) k-nearest neighbor classifiers(KNN) Genetic Algorithms Rough Set Approach

11 Web Page Classification  automatically assign the document to a predefined category(topic) Topic Specific Retrieval, Filter, Recommendation, …

12 External Annotations use external annotations to enhance classification of documents categorized in one topic hierarchy (source) to another one (target). 使用者瀏覽, 找出有興趣的資訊, 根據使用者興趣來做 filtering 及資料歸類。 利用其他相關類別的資訊來幫助歸類。 使用者瀏覽, 找出有興趣的資訊, 根據使用者興趣來做 filtering 及資料歸類。 利用其他相關類別的資訊來幫助歸類。 www.yam.com S: Source Hierarchy S1S1 S2S2 S3S3 S6S6 S5S5 S4S4 S3S3 www.openfind.com.tw T: Target Hierarchy T4T4 T1T1 T2T2 T5T5 T8T8 T7T7 T3T3 T9T9 T6T6 : topic (class) : document

13 Examples web directories Google, Yahoo, ProFusion, … domain-specific channels music, sports, … product catalogs expert annotations

14 Learning Approaches internal learning produces traditional classifiers from internal information large amount of internal information external learning produces external enhancer or reducers from external information heterogeneous, sparse, dynamic

15 External Learning Probabilistic Enhancement use probabilistic enhancer to improve probabilistic classifiers Na ï ve Bayes, Hidden Markov Models, … Topic Retriction cascade reducer to reduce the set of candidates KNN, SVM, Neural Nets, …

16 Externally Enhanced Classifiers ReducersEnhancers KNN SVM NB HMM Topic RestrictionProbabilistic Enhancement Externally Enhanced Classifiers Annotated Instance Predicted Class

17 Summary Traditional Clasifiers (Yam. 工商經濟  Openfind. 工商經濟 ) Na ï ve Bayes: 55% SVM: 57% Enhanced Classfiers: Enhanced Na ï ve Bayes: 66% Topic Restricted SVM: 67%

18 Proposed Approaches Probabilistic Enhancement that uses class information to enhance probabilistic classifiers such as Na ï ve Bayes and HMM Topic Restriction that uses class information to restrict the set of candidate classes, and can be used to extend any classifier such as SVM and kNN

19 Probabilistic Methods Probabilistic Classifier When external information is available, Probabilistic Enhancement

20 Estimation of P(v t |s) straightforward estimation more robust estimation when

21 NB-Based Methods (Agrawal and Srikant, 2001)

22 Data Sets Data set I source hierarchy: Yam target hierarchy: Openfind Data set II source hierarchy: Yam.BusinessAndEconomics target hierarchy: Openfind.BusinessAndEconomics Data set III source hierarchy: Google.Business target hierarchy: Yahoo.BusinessAndEconomics

23 Comparison of NB-Based Method

24 Class-Level Comparison

25 Topic Restriction(TR) TR uses class information to reduce the set of candidate classes, and can be used for any traditional classifiers such as SVM and kNN Static Topic Restriction Most source classes are related to a small number of targeted classes Consider only those target classes that intersect the source class Dynamic Topic Restriction Simple classifiers achieve very high top k measure for small k Consider only those top k classes ranked by a simple classifier

26 Static Topic Restriction

27 Dynamic Topic Restriction Data Set II

28 Conclusion We propose probabilistic enhancement to enhance Na ï ve Bayes. We propose a topic restriction method to extend SVM. We carry out extensive experiment for text collections from Google and Yahoo, and Openfind and Yam. Experiment shows that our approaches significantly improve traditional approaches

29 Further Remarks Topic restriction is a general idea for cascading simpler, such as NB and linear classifiers, and more complicated classifiers, such as SVM and kNN Cascading improves both the running times and classification accuracy of SVM and kNN, especially when the number of topic classes is large. Further study on topic restriction is going on.

30 Cascaded SVM Web Directory Data (Openfind) SVM61.5% Widraw Hoff + SVM65.8% Rocchio+SVM64.8% Naïve Bayes + SVM60.7.3% kNN63.8% Rocchio+SVM64.6% Naïve Bayes + SVM62.5% Running time SVM05:54 Naïve Bayes +SVM00:19 Naïve Bayes00:05

31 Cascaded SVM CNA news collection SVM71.8% Widraw Hoff + SVM78.6% Rocchio+SVM77.8% Naïve Bayes + SVM76.3% Running time SVM52:26 Naïve Bayes +SVM06:55 Naïve Bayes05:54


Download ppt "Externally Enhanced Classifiers and Application in Web Page Classification Join work with Chi-Feng Chang and Hsuan-Yu Chen Jyh-Jong Tsay National Chung."

Similar presentations


Ads by Google