Download presentation
Presentation is loading. Please wait.
Published byRudolph Mason Modified over 8 years ago
1
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision Making, Tamkang University, May 29, 2004 Shigeich Hirasawa Department of Industrial and Management Systems Engineering School of Science and Engineering Waseda University, Japan Hirasawa@hirasa.mgmt.waseda.ac.jp Wesley W. Chu Computer science Department School of Engineering and Applied Science University of California, Los Angels, U.S.A. wwc.@cs.ucla.edu * A part of the work leading to this paper was done at UCLA during a sabbatical year of S.H. as a visiting faculty in 2002.
2
No. 2 FormatExample in paper archives matrix Fixed format Items - The name of authors - The name of journals - The year of publication - The name of publishers - The name of countries - The year of publication - The citation link Free format Text The text of a paper - Introduction - Preliminaries ……. - Conclusion 1. Introduction Document G = [ g mj ] : An item-document matrix H = [ h ij ] : A term-document matrix d j : The j -th document t i : The i -th term i m : The m -th item g mj : The selected result of the m -th item ( i m ) in the j -th document ( d j ) h ij : The frequency of the i -th term ( t i ) in the j -th document ( d j ) 1. Introduction
3
No. 3 2. Information Retrieval Model Text Mining: - Information Retrieval including - Clustering - Classification Information Retrieval Model BaseModel Set theory (Classical) Boolean Model Fuzzy Extended Boolian Model Algebraic (Classical) Vector Space Model (VSM) [BYRN99] Generalized VSM Latent Semantic Indexing (LSI) Model [BYRN99] Probabilistic LSI (PLSI) Model [Hofmann99] Neural Network Model Probabilistic (Classical) Probabilistic Model Extended Probabilistic Model Inference Network Model Bayesian Network Model 2. Information Retrieval Model
4
No. 4 tf ( i,j ) = f ij : The number of the i -th term ( t i ) in the j -th document ( d j ) (Local weight) Weight w ij is given by The Vector Space Model (VSM) 2. Information Retrieval Model idf ( i,j ) = log (D/df(i)) : General weight df( i ) : The number of documents in D for which the term t i appears (1)
5
No. 5 2. Information Retrieval Model (term vector) t i = (a i1, a i2, …, a iD ) : The i -th row (document vector) d j = (a 1j, a 2j, …, a Tj ) : The j -th column (query vector) q = (q 1, q 2, …, q T ) T The similarity s ( q, d j ) between q and d j : (2)
6
No. 6 The Latent Semantic Indexing (LSI) Model 2. Information Retrieval Model (1) be
7
No. 7 where : the j -th canonical vector (2) 2. Information Retrieval Model be
8
No. 8 The Probabilistic LSI (PLSI) Model 2. Information Retrieval Model (1) Preliminary A) A=[a ij ], a ij = f ij :the number of a term t i B) reduction of dimension similar to LSI C) latent class (state model based on factor analysis) D) (i) an independence between pairs ( t i, d j ) (ii) a conditional independence between t i and d j : a set of states ztd
9
No. 9 (2) 2. Information Retrieval Model be
10
No. 10 (3) 2. Information Retrieval Model
11
No. 11 3. Proposed Method Classification method categories: C 1, C 2, …, C K (1)Choose a subset of documents D * ( D ) which are already categorized and compute representative document vectors d * 1, d * 2, …, d * K : where n k is the number of selected documents to compute the representative document vector from C k. (2)Compute the probabilities Pr( z k ), Pr( d j | z k ) and Pr( t i | z k ) which maximizes eq.(13) by the TEM algorithm, where ∥ Z ∥ = K. (3)Decide the state for d j as (18) (19)
12
No. 12 (1)Choose a proper K (≥ S ) and compute the probabilities Pr( z k ), Pr( d j | z k ), and Pr( t i | z k ) which maximizes eq.(13) by the TEM algorithm, where ∥ Z ∥ = K. (2)Decide the state for d j as (20) If S = K, then 3. Proposed Method Clustering method || Z || = K : The number of latent states S : The number of clusters
13
No. 13 3. Proposed Method (3)If S < K, then compute a similarity measure s ( z k, z k' ): (21) (22) and use the group average distance method with the similarity measure s ( z k, z k' ) for agglomeratively clustering the states z k 's until the number of clusters becomes S. Then we have S clusters, and the members of each cluster are those of a cluster of states.
14
No. 14 4. Experimental Results 4.1 Document sets Table 1: Document sets contentsformatamountcategorize (a) articles of Mainichi news paper in ’94 [Sakai99] Free (texts only) 101,058 (see Table 2) Yes (9+1 ategories) (b) Questionnaire (see Table 3 in detail) fixed and free (see Table 3) 135+35 Yes (2 categories) (c)135no
15
No. 15 4. Experimental Results 4.2 Classification problem: (a) Experimental data: Mainichi Newspaper in ‘94 (in Japanese) 300 article, 3 categories (free format only) Conditions of (a) categorycontents# articles # used for training # used for test C1C1 business10050 C2C2 local10050 C3C3 sports10050 total300150 Table 2: Selected categories of newspaper LSI : K = 81 PLSI: K = 10
16
No. 16 4. Experimental Results Results of (a) methodfrom C k to C k C1C1 C 2 C3C3 VS method C1C1 17429 C2C2 8384 C3C3 15431 LSI method C1C1 16628 C2C2 6431 C3C3 12533 PLSI method C1C1 4109 C2C2 0473 C3C3 13631 Proposed method C1C1 4703 C2C2 0500 C3C3 4244 Table 5: Classified number form C k to for each method
17
No. 17 4. Experimental Results MethodClassification error VSM 42.7% LSI 38.7% PLSI 20.7% Proposed method 6.0% Table 6: Classification error rate
18
No. 18 4. Experimental Results Clustering process by EM algorithm sports local business sports local
19
No. 19 4. Experimental Results 4.3 Clustering Problem: (b) Format Number of questions Examples Fixed (item) 7 major questions 2 - For how many years have you used computers? - Do you have a plan to study abroad? - Can you assemble a PC? - Do you have any license in information technology? - Write 10 terms in information technology which you know 4. Free (text) 5 questions 3 - Write about your knowledge and experience on computers. - What kind of job will you have after graduation? - What do you imagine from the name of the subject? Table 3: Contents of initial questionnaire 2 Each question has 4-21 minor questions. 3 Each text is written within 250-300 Chinese and Japanese characters. 4 There is a possibility to improve the performance of the proposed method by elimination of these items. Student Questionnaire
20
No. 20 4. Experimental Results 4.3 Clustering Problem: (b) Table 4: Object classes Name of subjectCourse Number of students Introduction to Computer Science (Class CS) Science Course135 Introduction to Information Society (Class IS) Literary Course35 Object classes
21
No. 21 4. Experimental Results Condition of (b) I) First, the documents of the students in Class CS and those in Class IS are merged. II) Then, the merged documents are divided into two class ( S =2) by the proposed method. Class CS Class IS True class Merge Clustering by the proposed method Clustering error C(e) Fig.3 Class partition problem by clustering method
22
No. 22 students Results of (b) Fig.4 Clustering process by EM algorithm, K=2
23
No. 23 similarity
24
No. 24 4. Experimental Results S=K=2C(e)=0.411 K-means method Fig.6 Clustering process for EM algorithm, K =3
25
No. 25 (Item only) C(e) : the ratio of the number of students in the difference set between divided two classes and the original classes to the number of the total students. (Text only) 4. Experimental Results Fig.4 Clustering error rate C(e) vs. λ
26
No. 26 4. Experimental Results Fig.5 Clustering error rate C(e) vs. λ
27
No. 27 4. Experimental Results Results of (b) Statistical analysis by discriminant analysis
28
No. 28 4. Experimental Results Another Experiment Class CS Cass G Class S Clustering by the proposed method Clustering error C(e) Clustering for class partition problem Only form IQ Fig. Another Class partition problem by clustering method S: Specialist G: Generalist
29
No. 29 4. Experimental Results (1) Member of students in each class classCharacteristics of students student's selection S - Having a good knowledge of technical terms - Hoping the evaluation by exam G - Having much interest in use of a computer Clustering S - Having much interest in theory - Having higher motivation for a graduate school G - Having much interest in use of a computer - Having a good knowledge of system using the computer
30
No. 30 (2) Member of students in each class By discriminant analysis, two classes are evaluated for each partition which are interpreted in table 5. The most convenient case for characteristics of students should be chosen. 4. Experimental Results
31
No. 31 5. Concluding Remarks (1)We have proposed a classification method for a set of documents and extend it to a clustering method. (2)The classification method exhibits its better performance for a document set with comparatively small size by using the idea of the PLSI model. (3)The clustering method also has good performance. We show that it is applicable to documents with both fixed and free formats by introducing a weighting parameter λ. (4)As an important related study, it is necessary to develop a method for abstracting the characteristics of each cluster [HIIGS03][IIGSH03-b]. (5)An extension to a method for a set of documents with comparatively large size also remains as a further investigation.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.