Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Boosting the Feature Space: Text Classification for Unstructured.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Boosting the Feature Space: Text Classification for Unstructured Data on the Web Advisor : Dr. Hsu Presenter : Yu-San Hsieh Author : Yang Song, Ding Zhou, Jian Huang, Isaac G. Councill, Hongyuan Zha, C. Lee Giles 2006. ICDM.1-5

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2  Motivation  Objective  Method  Experiments  Conclusions Outline

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation  Using bag-of-words encodes document which usually leads to sparse feature spaces with large dimensionality, thus affecting classification performance D1 = “This paper describes our attempt to unift entity extraction and collaborative filtering to boost the feature space.” D2 = “Text classification for unstructured data on the web” T = {this,paper,describes,our,attempt,to,unift,entity,extraction,and,collaborative, filtering,boost,the,feature,space,text,classification,for,unstructured,data,on,web} Document Set Bag-of-words D1 = (1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0) D2 = (0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective  A classification approach is proposed that utilizes traditional feature reduction techniques along with a collaborative filtering method for augmenting document feature spaces.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Method  Entity extraction using SVM-decision-tree POS SVM-based NP-chunker Brill tagger NP Chunking label Collaborative BI = 0.5 OB = -0.6 D1 = “This paper describes our attempt to unift entity extraction and collaborative filtering to boost the feature space.”

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Method  Feature space refinement ─ Feature space is normalized ─ Computing weight matrix ─ Candidates Selection W(2,3)=H(F23,ε2)H(F13, ε1 )sim(a,2,1) + H(F23,ε2)H(F23,ε2)sim(a,2,2) + H(F23,ε2)H(F33,ε3)sim(a,2,3) + H(F23,ε2)H(F43,ε4)sim(a,2,4)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Experiments  Data Set ─ CiteSeer Digital Library Library ─ WebKB benchmark corpue  Comparison ─ Entity extraction technique ─ Distribution of features ─ SVM and AdaBoost.MH The number of features extracted by the three techniques Bag-of-words Information-Gain SVM-decision-tree Example Feature 10,00025,031 50,000 118,058 7,000 20,000 IG reduces the feature space to half the dimension of bag-of-words, especially 118,058  CiteSeer Digital Library

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Experiments  I-CF A-CF B-CF Feature Categories I-CFA-CF 4 times >

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Experiments  WebKB (World Wide Knowledge Base) ─ CS departments of many universities ─ The number of features with regard to the training data size ─ Our approach generally outperforms IG, and the advantage becomes larger with the increase of data size. Traning pages is small Increase of pages decreaseincrease

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Conclusions  Experiments with our approach that improves classification accuracy over the traditional method for both SVM and AdaBoost classifiers.  The major contributions of this work ─ An evaluation of the use of collaborative filtering for refining minimal, noisy feature space ─ A comparison of the performance of SVM versus AdaBoost classifiers

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 My opinion  Advantage ─ ……  Drawback ─ ……  Application ─ Clustering ─ Classification

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Boosting the Feature Space: Text Classification for Unstructured.

Similar presentations

Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Boosting the Feature Space: Text Classification for Unstructured."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Boosting the Feature Space: Text Classification for Unstructured.

Similar presentations

Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Boosting the Feature Space: Text Classification for Unstructured."— Presentation transcript:

Similar presentations

About project

Feedback