Download presentation
Presentation is loading. Please wait.
Published byMerilyn Sanders Modified over 8 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Boosting the Feature Space: Text Classification for Unstructured Data on the Web Advisor : Dr. Hsu Presenter : Yu-San Hsieh Author : Yang Song, Ding Zhou, Jian Huang, Isaac G. Councill, Hongyuan Zha, C. Lee Giles 2006. ICDM.1-5
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Motivation Objective Method Experiments Conclusions Outline
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation Using bag-of-words encodes document which usually leads to sparse feature spaces with large dimensionality, thus affecting classification performance D1 = “This paper describes our attempt to unift entity extraction and collaborative filtering to boost the feature space.” D2 = “Text classification for unstructured data on the web” T = {this,paper,describes,our,attempt,to,unift,entity,extraction,and,collaborative, filtering,boost,the,feature,space,text,classification,for,unstructured,data,on,web} Document Set Bag-of-words D1 = (1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0) D2 = (0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1)
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective A classification approach is proposed that utilizes traditional feature reduction techniques along with a collaborative filtering method for augmenting document feature spaces.
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Method Entity extraction using SVM-decision-tree POS SVM-based NP-chunker Brill tagger NP Chunking label Collaborative BI = 0.5 OB = -0.6 D1 = “This paper describes our attempt to unift entity extraction and collaborative filtering to boost the feature space.”
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Method Feature space refinement ─ Feature space is normalized ─ Computing weight matrix ─ Candidates Selection W(2,3)=H(F23,ε2)H(F13, ε1 )sim(a,2,1) + H(F23,ε2)H(F23,ε2)sim(a,2,2) + H(F23,ε2)H(F33,ε3)sim(a,2,3) + H(F23,ε2)H(F43,ε4)sim(a,2,4)
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Experiments Data Set ─ CiteSeer Digital Library Library ─ WebKB benchmark corpue Comparison ─ Entity extraction technique ─ Distribution of features ─ SVM and AdaBoost.MH The number of features extracted by the three techniques Bag-of-words Information-Gain SVM-decision-tree Example Feature 10,00025,031 50,000 118,058 7,000 20,000 IG reduces the feature space to half the dimension of bag-of-words, especially 118,058 CiteSeer Digital Library
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Experiments I-CF A-CF B-CF Feature Categories I-CFA-CF 4 times >
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Experiments WebKB (World Wide Knowledge Base) ─ CS departments of many universities ─ The number of features with regard to the training data size ─ Our approach generally outperforms IG, and the advantage becomes larger with the increase of data size. Traning pages is small Increase of pages decreaseincrease
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Conclusions Experiments with our approach that improves classification accuracy over the traditional method for both SVM and AdaBoost classifiers. The major contributions of this work ─ An evaluation of the use of collaborative filtering for refining minimal, noisy feature space ─ A comparison of the performance of SVM versus AdaBoost classifiers
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 My opinion Advantage ─ …… Drawback ─ …… Application ─ Clustering ─ Classification
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.