Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Combining Unsupervised Feature Selection Strategy for Automatic Text Categorization
Ping-Tsun Chang Introduction In recent researches –The limit of using statistic or computational approach for natural language understanding –The develop of machine learning technique is almost reached its bound –Natural language is infinite and nonlinear! Unsupervised Feature Selection
Ping-Tsun Chang Text Categorization Background Knowledge Problem Definition: Text Categorization is a problem to assign a unknown lebel to a large amount of document by a large amount of text data. Sensing Segmentation Classification Post-Processing Feature ExtractionDecision
Ping-Tsun Chang Background Knowledge Machine Learning Using Computer help us to induction from complex and large amount of pattern data Bayesian Learning Instance-Based Learning –K-Nearest Neighbors Neural Networks Support Vector Machine
Ping-Tsun Chang Background Knowledge Feature Selection Information Gain Mutual Information CHI-Square
Ping-Tsun Chang Baysian Classifier Recent Researches –Naïve Bayes classifiers are competitive with other techniques in accuracy –Fast: single pass and quickly classify new documents –ATHENA: EDBT 2000
Ping-Tsun Chang Machine Learning Approaches: kNN Classifier d ?
Ping-Tsun Chang Machine Learning Approaches: Support Vector Machine Basic hypotheses : Consistent hypotheses of the Version Space Project the original training data in space X to a higher dimension feature space F via a Mercel operator K
Ping-Tsun Chang What is Certainly? Rule for SVM Rule for kNN
Ping-Tsun Chang Algorithm for Two-Stage Automatic Text Categorization ALGORITHM Two-Stage-Text-Categorization (input: document d) returns category C Statistic: Trained classifier: Traditional-Classifier The feature set: F The new feature set by user feedback: U i for related catehory C i For new document d C ← Traditional-Classifier (d) If NOT satisfy the rule of uncertainly Return C Else For all category C i If d have the feature in F C ← C i Return C End If C j ←User-Input U j ← U j + User-Selected C ←C j END If Return C
Ping-Tsun Chang Determine threshold of the Rule
Ping-Tsun Chang Experienments
