Download presentation
Presentation is loading. Please wait.
Published byJoanna Charles Modified over 9 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma Support Vector Machines Classification with A Very Large-scale Taxonomy SIGKDD, 2004
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction Dataset characteristic Complexity Analysis Effectiveness Analysis Experimental Settings Conclusions Personal Opinion
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation very large-scale classification taxonomies Hundreds of thousands of categories Deep hierarchies Skewed category distribution over documents open question whether the state-of-the-art technologies in text categorization evaluation of SVM in web-page classification over the full taxonomy of the Yahoo! categories
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective scalability and effectiveness 1. a data analysis on the Yahoo! Taxonomy 2. development of a scalable system for large-scale text categorization 3. theoretical analysis and experimental evaluation of SVMs in hierarchical and non-hierarchical setting for classification 4. threshold tuning algorithms with respect to time complexity and accuracy of SVMs
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction TC (Text categorization), SVMs, KNN, NB,… in recent years, the scale of TC problems to become larger and larger Answer this question from the views of scalability and effectiveness
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM flat SVMs, hierarchical SVMs structure of the taxonomy tree
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM Optimal separating hyperplane between the two classes by max the margin between the classes’ closest points
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM Multi-class classification basically, SVMs can only solve binary classification problems fit all binary sub-classifiers one-against-all N two-class (true class and false class) one-against-one N(N-1)/2 classifiers
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM be a set of n labeled training documents linear discriminant function a corresponding classification function as margin of a weight vector
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM Optimal separation soft-margin multiclass formulation
14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM
15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM
16
Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE-first characteristic The full domain of the Yahoo! Directory 292,216 categories 792,601 documents
17
Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE-second characteristic Over 76% of the Yahoo! Categories have fewer than 5 labeled documents As “rare categories” increases at deeper hierarchy levels 36% are rare categories at deep levels
18
Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE-third characteristic many documents have multiple labels average has 2.23 labels the largest number of labels for a single document is 31
19
Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE Yahoo! Directory into a training set and a testing set with a ratio of 7:3 Remove those categories containing only one labeled document 132,199 categories 492,617 training documents 275,364 testing documents
20
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness Flat SVMs, with one-against-rest strategy N is the number of training documents M is the number of categories denotes the average training time per SVM model model
21
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness Hierarchical m i is the number of categories defined at the i-th level j is the size-based rank of the categories n ij is the number of training documents for the j-th category at the i-th level n i1 is the number of training document for the most common category at the i-th level is a level-specific parameter
22
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness was used to approximate the number of categories at the i-th level
23
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness
24
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness
25
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness
26
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness For the testing phase of hierarchical SVMs Pachinko-machine search: 從根部做起,每次從當前類中選 擇一個最可能的子類打開,直到遇到葉子為止
27
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity of SVM Classification with Threshold Tuning
28
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity of SVM Classification with Threshold Tuning SCut Optimal performance of the classifier is obtained for the category Fix the per-category thresholds when applying the classifier to new documents in the test set RCut Sort categories by score and assign YES to each of the t top- ranking categories
29
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Effectiveness Analysis Compared to scalability analysis, classification effectiveness is not as clear and predictable be affected by many other factors Potential problems of SVM noisy, imbalanced Can’t expect the performance of hierarchical SVM to be very good
30
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results 10 machines, each with four 3GHz CPUs and 4 GB of memory
31
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results
32
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results
33
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results
34
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results
35
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions Text categorization algorithms to very large problems, especially large-scale Web taxonomies
36
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions Drawback Lower performance in deep level Application combine SVMs with concept hierarchical tree Application to Text, or others domain Pachinko-machine search… Future Work learn SVMs kernel to implement ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.