Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan.

Slides:



Advertisements
Similar presentations
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: Hichem.
Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Quality evaluation of product reviews using an information.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Fast exact k nearest neighbors search using an orthogonal search tree Presenter : Chun-Ping Wu Authors.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
1 Scaling multi-class Support Vector Machines using inter- class confusion Author:Shantanu Sunita Sarawagi Sunita Sarawagi Soumen Chakrabarti Soumen Chakrabarti.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel genetic algorithm for automatic clustering Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Data mining for credit card fraud: A comparative study.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Predicting survival time for kidney dialysis patients:
Intelligent Database Systems Lab N.Y.U.S.T. I. M. OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction Presenter : Jiang-Shan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author: Aravind.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology SIGIR1 Improving Web Search Results Using Affinity Graph.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Empirical Study of Learning from Imbalanced Data Using.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A semantic similarity metric combining features and intrinsic information content Presenter: Chun-Ping.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Feature selection for text categorization on imbalanced.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 GMDH-based feature ranking and selection for improved.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A k-mean clustering algorithm for mixed numeric and categorical.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Development of a reading material recommendation system based on a knowledge engineering approach Presenter.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 The Evolving Tree — Analysis and Applications Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Fast accurate fuzzy clustering through data reduction Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Efficient Optimal Linear Boosting of a Pair of Classifiers.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A text mining approach on automatic generation of web.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Loss of the Mahalanobis Distance in High Dimensions-
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Multiclass boosting with repartitioning Graduate : Chen,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An initialization method to simultaneously find initial.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology O( ㏒ 2 M) Self-Organizing Map Algorithm Without Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Cost- sensitive boosting for classification of imbalanced.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Multistrategy Approach for Digital Text Categorization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Recognizing Partially Occluded, Expression Variant Faces.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Comparing Association Rules and Decision Trees for Disease.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology ACM SIGMOD1 Subsequence Matching on Structured Time Series.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Hierarchical model-based clustering of large datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Growing Hierarchical Tree SOM: An unsupervised neural.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An Integrated Machine Learning Approach to Stroke Prediction Presenter: Tsai Tzung Ruei Authors: Aditya.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Prediction model building and feature selection with support.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Investigating the Effect of Sampling Methods for Imbalanced.
Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma Support Vector Machines Classification with A Very Large-scale Taxonomy SIGKDD, 2004

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction Dataset characteristic Complexity Analysis Effectiveness Analysis Experimental Settings Conclusions Personal Opinion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation very large-scale classification taxonomies Hundreds of thousands of categories Deep hierarchies Skewed category distribution over documents open question whether the state-of-the-art technologies in text categorization evaluation of SVM in web-page classification over the full taxonomy of the Yahoo! categories

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective scalability and effectiveness 1. a data analysis on the Yahoo! Taxonomy 2. development of a scalable system for large-scale text categorization 3. theoretical analysis and experimental evaluation of SVMs in hierarchical and non-hierarchical setting for classification 4. threshold tuning algorithms with respect to time complexity and accuracy of SVMs

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction TC (Text categorization), SVMs, KNN, NB,… in recent years, the scale of TC problems to become larger and larger Answer this question from the views of scalability and effectiveness

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM flat SVMs, hierarchical SVMs structure of the taxonomy tree

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM Optimal separating hyperplane between the two classes by max the margin between the classes’ closest points

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM Multi-class classification basically, SVMs can only solve binary classification problems fit all binary sub-classifiers one-against-all N two-class (true class and false class) one-against-one N(N-1)/2 classifiers

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM be a set of n labeled training documents linear discriminant function a corresponding classification function as margin of a weight vector

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM Optimal separation soft-margin multiclass formulation

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM

Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE-first characteristic The full domain of the Yahoo! Directory 292,216 categories 792,601 documents

Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE-second characteristic Over 76% of the Yahoo! Categories have fewer than 5 labeled documents As “rare categories” increases at deeper hierarchy levels 36% are rare categories at deep levels

Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE-third characteristic many documents have multiple labels average has 2.23 labels the largest number of labels for a single document is 31

Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE Yahoo! Directory into a training set and a testing set with a ratio of 7:3 Remove those categories containing only one labeled document 132,199 categories 492,617 training documents 275,364 testing documents

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness Flat SVMs, with one-against-rest strategy N is the number of training documents M is the number of categories denotes the average training time per SVM model model

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness Hierarchical m i is the number of categories defined at the i-th level j is the size-based rank of the categories n ij is the number of training documents for the j-th category at the i-th level n i1 is the number of training document for the most common category at the i-th level is a level-specific parameter

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness was used to approximate the number of categories at the i-th level

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness For the testing phase of hierarchical SVMs Pachinko-machine search: 從根部做起,每次從當前類中選 擇一個最可能的子類打開,直到遇到葉子為止

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity of SVM Classification with Threshold Tuning

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity of SVM Classification with Threshold Tuning SCut Optimal performance of the classifier is obtained for the category Fix the per-category thresholds when applying the classifier to new documents in the test set RCut Sort categories by score and assign YES to each of the t top- ranking categories

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Effectiveness Analysis Compared to scalability analysis, classification effectiveness is not as clear and predictable be affected by many other factors Potential problems of SVM noisy, imbalanced Can’t expect the performance of hierarchical SVM to be very good

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results 10 machines, each with four 3GHz CPUs and 4 GB of memory

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions Text categorization algorithms to very large problems, especially large-scale Web taxonomies

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions Drawback Lower performance in deep level Application combine SVMs with concept hierarchical tree Application to Text, or others domain Pachinko-machine search… Future Work learn SVMs kernel to implement ?