Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.

Similar presentations


Presentation on theme: "Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs."— Presentation transcript:

1 Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs

2 Integrating Catalogs Key Insight –Many of the data source have their own categorization, and classification accuracy can be improved by factoring in the implicit information in these source categorizations. A straightforward approach –Formulate catalog integration problem as a classification problem

3 Related Researches Why Naïve Bayes? –Naïve Bayes classifiers are competitive with other techniques in accuracy –Fast: single pass and quickly classify new documents –ATHENA: EDBT 2000 –On Integrating Catalogs (WWW10, 2001/5) Classification –Mail Agent –SONIA (ACM Digital Library ‘98)

4 Problem Statement A master catalog M with categories C 1 …C n and a set of documents in each category A source catalog N with a set of categories S 1 …S n and other set of documents We need to find the category in M for each document in N d1d1 d k-2 d k-1 dkdk … S1S1 S2S2 SmSm C1C1 C2C2 CnCn

5 A Overview of Integrating Catalogs Documents Text Representation Feature Extraction Prediction Model Document Classification Rule Implicit Information From Source Catalogs

6 Naïve Bayes Classification Basic Algorithm A document may be assigned to more than one category –P(C i |d) and P(C j |d) both have high value A document d, all the value of P(C i |d) is low, kept aside for manual classification If some S in N, a large fraction of document satisfy the previous condition, S may be a new category for M

7 Google vs. Yahoo! Classification Accuracy

8 Support Vector Machine Minimize Subject to

9 Multi-Class Classification SVM is binary classification technique On going Research One-against-one is better than other approach by experienment (cjlin, 2001)

10 Using SVM for Text Classification TF ‧ IDF (Term Frequency * Inverse Document Frequency) Where k is normalization constant ensuring that. The function is clearly a valid kernel, since it is the inner product in an explicitly constructed feature space.

11 Using SVM for Integrating Catalogs Increasing β, result in more effect from S for classification to M. More separate example let SVM finding the hyperplane with maximize margin easily. New Kernel Function

12 New Kernel Function for Integrating Catalogs Orthogonal We could treat these two catalogs orthogonal. Under this situation, the kernel function will be the same as standard classifier without information of source catalog when β=0.

13 EXPERIENMENTAL RESULTS Train: Books.com.tw, Test: Commonwealth Accuracy (%) DatasetNaïve BayesSVMOurImprove Finance&Business53.4556.1264.1114.24 % Computers57.6857.8065.5013.32 % Science51.1257.7862.007.30 % Literature37.3940.1353.7834.01 % Psychology47.6654.4459.9610.14 % Average49.4653.2561.0715.80 %

14 EXPERIENMENTAL RESULTS Train: Commonwealth, Test: Books.com.tw Accuracy DatasetNaïve BayesSVMOurImprove Finance&Business42.7749.7453.126.80 % Computers45.6048.2555.5415.11 % Science41.1444.6047.726.99 % Literature30.9937.2142.2112.44 % Psychology40.0540.1143.358.08 % Average40.1143.9843.3910.08 %

15 Conclusions SVM is very useful to the problem of integration catalogs with text documents. Traditionally, SVM is a classification tool. In this paper, we using SVM with a novel kernel function to suit this problem. The experienment here serves as a promising start for the use SVM for this problem. Future Work: We can also improve the performance by incorporation of another kernel function and proved it, or combining structural information of text document


Download ppt "Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs."

Similar presentations


Ads by Google