Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma Support Vector Machines Classification with A Very Large-scale Taxonomy SIGKDD, 2004
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction Dataset characteristic Complexity Analysis Effectiveness Analysis Experimental Settings Conclusions Personal Opinion
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation very large-scale classification taxonomies Hundreds of thousands of categories Deep hierarchies Skewed category distribution over documents open question whether the state-of-the-art technologies in text categorization evaluation of SVM in web-page classification over the full taxonomy of the Yahoo! categories
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective scalability and effectiveness 1. a data analysis on the Yahoo! Taxonomy 2. development of a scalable system for large-scale text categorization 3. theoretical analysis and experimental evaluation of SVMs in hierarchical and non-hierarchical setting for classification 4. threshold tuning algorithms with respect to time complexity and accuracy of SVMs
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction TC (Text categorization), SVMs, KNN, NB,… in recent years, the scale of TC problems to become larger and larger Answer this question from the views of scalability and effectiveness
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM flat SVMs, hierarchical SVMs structure of the taxonomy tree
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM Optimal separating hyperplane between the two classes by max the margin between the classes’ closest points
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM Multi-class classification basically, SVMs can only solve binary classification problems fit all binary sub-classifiers one-against-all N two-class (true class and false class) one-against-one N(N-1)/2 classifiers
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM be a set of n labeled training documents linear discriminant function a corresponding classification function as margin of a weight vector
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM Optimal separation soft-margin multiclass formulation
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM
Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE-first characteristic The full domain of the Yahoo! Directory 292,216 categories 792,601 documents
Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE-second characteristic Over 76% of the Yahoo! Categories have fewer than 5 labeled documents As “rare categories” increases at deeper hierarchy levels 36% are rare categories at deep levels
Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE-third characteristic many documents have multiple labels average has 2.23 labels the largest number of labels for a single document is 31
Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE Yahoo! Directory into a training set and a testing set with a ratio of 7:3 Remove those categories containing only one labeled document 132,199 categories 492,617 training documents 275,364 testing documents
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness Flat SVMs, with one-against-rest strategy N is the number of training documents M is the number of categories denotes the average training time per SVM model model
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness Hierarchical m i is the number of categories defined at the i-th level j is the size-based rank of the categories n ij is the number of training documents for the j-th category at the i-th level n i1 is the number of training document for the most common category at the i-th level is a level-specific parameter
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness was used to approximate the number of categories at the i-th level
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness For the testing phase of hierarchical SVMs Pachinko-machine search: 從根部做起,每次從當前類中選 擇一個最可能的子類打開,直到遇到葉子為止
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity of SVM Classification with Threshold Tuning
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity of SVM Classification with Threshold Tuning SCut Optimal performance of the classifier is obtained for the category Fix the per-category thresholds when applying the classifier to new documents in the test set RCut Sort categories by score and assign YES to each of the t top- ranking categories
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Effectiveness Analysis Compared to scalability analysis, classification effectiveness is not as clear and predictable be affected by many other factors Potential problems of SVM noisy, imbalanced Can’t expect the performance of hierarchical SVM to be very good
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results 10 machines, each with four 3GHz CPUs and 4 GB of memory
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions Text categorization algorithms to very large problems, especially large-scale Web taxonomies
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions Drawback Lower performance in deep level Application combine SVMs with concept hierarchical tree Application to Text, or others domain Pachinko-machine search… Future Work learn SVMs kernel to implement ?