Download presentation
Presentation is loading. Please wait.
Published byKelly Pearson Modified over 8 years ago
1
Speaker : Shau-Shiang Hung ( 洪紹祥 ) Adviser : Shu-Chen Cheng ( 鄭淑真 ) Date : 99/05/04 1 Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine Learning Methods for Medical Text Categorization," paccs, pp.494-497, 2009 Pacific-Asia Conference on Circuits, Communications and Systems, 2009
2
Outline Introduction Document indexing Classification Algorithm Experiments Conclusion 2
3
Introduction Text categorization (TC) is the process of automatically assigning one or more predefined category labels to text documents. Digital medical information is rapidly increasing with the development of network. How to effectively deal with and organize them is a problem in the field of medical informatics. 3
4
Document indexing Because classifiers cannot directly interpret documents, it is necessary to transform them into the forms that classifiers can identify. Vector space model (VSM) is a famous statistical model. 4
5
Document indexing A. Standard Term Frequency Inverse Document Frequency (TFIDF) 5
6
Document indexing In order for the weights to fall the [0,1] interval and for the documents to be represent by vectors of equal length, the weights resulting from tfidf are often normalized by cosine normalization. 文章 1 所有關鍵字的 TFIDF 平方相加 6
7
Document indexing B. Improvement Term Frequency, Inverted Document Frequency and Inverted Entropy (TFIDFIE) In the field of text classification, the importance of term depends on not only its term frequency, but also its contribution to classification. For example: Term 1 客房 and Term 2 風景 has same weight 7
8
Document indexing In order to stand out the relation between terms and categories, we also calculate the distribution of those documents in categories in course of weighting terms. This distribution can be weight by information entropy H. 8
9
Classification Algorithm A. K-Nearest Neighbor (KNN) B. Support Vector Machine (SVM) C. Naïve Bayes (NB) D. Clonal Selection Algorithm Based on Antibody Density (CSABAD) Because the nature of immune algorithm is to distinguish between self and non-self, it can be used in text categorization. 9
10
Classification Algorithm CSABAD In text categorization, Antigen training text. B cell An individual of classifier. Antibody affinity between the individual and training documents. The final classifier is composed with many memory B cells. The cosine value of two vectors is used to measure the affinity f(x i,d j ) between of B cell x i and antigen d j The affinity f(x i ) of B cell x i and N antigens is defined as the average value of all N affinities. The antibody selection probability P(x i ) is defined as follows: 10
11
Experiments A. Data collection OHSUMED is a bibliographical document collection. Using a single-label subset of OHSUMED is called OHSCAL, which consists of 11162 documents include 10 categories. 11
12
Experiments B. Experiment results and analysis Randomly divided the OHSCAL dataset into a training set and a test set in the proportion of 2:1. For eliminating the chanciness of experimental results, we made ten independent experiments on OHSCAL. 12
13
Conclusion In this paper, we propose an improved approach, called TFIDFIE. It considers the distribution of documents in the training set in which the term occurs. The experiments show that SVM and CSABAD outperform significantly kNN and Naive Bayes, and TFIDFIE is more effective than TFIDF. Considering the characteristics of professional medical words, we will study the feature selection in the medical text classification in further work. 13
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.