Presentation is loading. Please wait.

Presentation is loading. Please wait.

No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.

Similar presentations


Presentation on theme: "No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering."— Presentation transcript:

1 No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering School of Science and Engineering Waseda University, Japan Hirasawa@hirasa.mgmt.waseda.ac.jp Wesley W. Chu Computer science Department School of Engineering and Applied Science University of California, Los Angels, U.S.A. wwc.@cs.ucla.edu * A part of the work leading to this paper was done at UCLA during a sabbatical year of S.H. as a visiting faculty in 2002. 2003 IEEE International Conference on Systems, Man and Cybernetics Oct. 5-8, 2003, Washington D.C.

2 No. 2 FormatExample in paper archives matrix Fixed format Items - The name of authors - The name of journals - The year of publication - The name of publishers - The name of countries - The citation link Free format Text The text of a paper - Introduction - Preliminaries ……. - Conclusion Document G = [ g mj ] : An item-document matrix H = [ h ij ] : A term-document matrix d j : The j -th document t i : The i -th term i m : The m -th item g mj : The selected result of the m -th item ( i m ) in the j -th document ( d j ) h ij : The frequency of the i -th term ( t i ) in the j -th document ( d j ) 1. Introduction

3 No. 3 2. Information Retrieval Model Text Mining: Information Retrieval, including - Clustering - Classification Information Retrieval Model BaseModel Set theory (Classic) Boolian Model Fuzzy Model Extended Boolian Model Algebraic (Classical) Vector Space Model (VSM) [7] Generalized VSM Latent Semantic Indexing (LSI) Model [2] Probabilistic LSI (PLSI) Model [4] Neural Network Model Probabilistic (Classical) Probabilistic Model Extended Probabilistic Model Inference Network Model Bayesian Network Model 2. Information Retrieval Model

4 No. 4 tf ( i,j ) = f ij : The number of the i -th term ( t i ) in the j -th document ( d j ) (Local weight) Weight w ij is given by The Vector Space Model (VSM) 2. Information Retrieval Model idf ( i,j ) = log (D/df(i)) : General weight df( i ) : The number of documents in D for which the term t i appears (1)

5 No. 5 2. Information Retrieval Model (term vector) t i = (a i1, a i2, …, a iD ) : The i -th row (document vector) d j = (a 1j, a 2j, …, a Tj ) : The j -th column (query vector) q = (q 1, q 2, …, q T ) T The similarity s ( q, d j ) between q and d j : (2) (4) (3) (5) (2)

6 No. 6 The Latent Semantic Indexing (LSI) Model 2. Information Retrieval Model (1) SVD: Single Valued Decomposition

7 No. 7 where : the j -th canonical vector (2) 2. Information Retrieval Model (7) (8)

8 No. 8 The Probabilistic LSI (PLSI) Model 2. Information Retrieval Model (1) Preliminary A) A=[a ij ], a ij = f ij :the number of a term t i B) reduction of dimension similar to LSI C) latent class (state model based on factor analysis) D) (i) an independence between pairs ( t i, d j ) (ii) a conditional independence between t i and d j : a set of states ztd (12)

9 No. 9 (2) 2. Information Retrieval Model

10 No. 10 (3) 2. Information Retrieval Model

11 No. 11 3. Formats of Documents

12 No. 12 4. Proposed Methods Clustering method = K : The number of latent states S : The number of clusters

13 No. 13 4. Proposed Methods

14 No. 14 4. Proposed Methods

15 No. 15 5. Experimental Results Preliminary experiment [5] supervised classification problem Classification error VSM 42.7% LSI 38.7% PLSI 20.7% Proposed method 6.0% (1)Experimental data: Mainichi Newspaper in ‘94 (in Japanese) 300 article, 3 categories (free format only) (2)Condition LSI : K = 81 PLSI: K = 10 (3)Result

16 No. 16 5. Experimental Results (4) Clustering process for EM algorithm sports local business sports local

17 No. 17 5. Experimental Results Class Data Class CS - Initial Questionnaires (IQ) - Final Questionnaires (FQ) - Mid-term Test (MT) - Final Test (FT) - Technical Report (TR) Class IS - Initial Questionnaires (IQ) - Final Questionnaires (FQ) - First Report (R1) - Second Report (R2) - Third Report (R3) - Fourth Report (R4)

18 No. 18 5. Experimental Results

19 No. 19 5. Experimental Results Experiment 1 (E1) I) First, the documents of the students in Class CS and those in Class IS are merged. II) Then, the merged documents are divided into two class ( S =2) by the proposed method. Class CS Class IS True class Merge Clustering by the proposed method Clustering error C(e) Experiment 1. As a supervised learning problem

20 No. 20 students Results of E1 (3) Clustering process for EM algorithm

21 No. 21 5. Experimental Results Results of E1 S=K=2C(e)=0.411 (4) K-means method (3) Clustering process for EM algorithm

22 No. 22 5. Experimental Results Results of E1 (1) C(e) : the ratio of the number of students in the difference set between divided two classes and the original classes to the number of the total students. (Text only) (Item only)

23 No. 23 5. Experimental Results Results of E1 (4) Statistical analysis by discriminant analysis

24 No. 24 5. Experimental Results Experiment 2 (E2) Class CS Cass G Class S Clustering by the proposed method Clustering error C(e) Clustering for class partition problem Only form IQ Experiment 2. As a unsupervised learning problem S: Specialist G: Generalist

25 No. 25 5. Experimental Results Results of E2 (1) Member of students in each class classCharacteristics of students student's selection S - Having a good knowledge of technical terms - Hoping the evaluation by exam G - Having much interest in use of a computer Clustering S - Having much interest in theory - Having higher motivation for a graduate school G - Having much interest in use of a computer - Having a good knowledge of system using the computer

26 No. 26 (2) Member of students in each class By discriminant analysis, two classes are evaluated for each partition which are interpreted in table 5. The most convenient case for characteristics of students should be chosen. 5. Experimental Results

27 No. 27 5. Experimental Results Discussions on Experiments (1)The present contents of Initial Questionnaires (IQ) are proper for E1, they should, however, be improved for E2. (2)Performance of the proposed method is dependent on the structure of characteristics of the students. (3)If we derive multiple solutions for partition of students into two classes, it is possible to choose better partition from a viewpoint of class management. (4)It is impossible to predict a score of a student from only IQ, and is, however, possible with 67.5% in cumulative proportion to do that from both IQ and FQ. Further results has been reported in [*] [*] S.Hirasawa, T.Ishida, J.Itoh, M.Goto, and T.Sakai, “Analyses on student questionnaires with both fixed and free formats,” (in Japanese) Proc. of Promotion of Information Society in University, pp.144-145, Tokyo, Sep. 2003.

28 No. 28 6. Conclusion and Remerks Conclusion and Remarks (1) (2) (3) (4) (5)


Download ppt "No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering."

Similar presentations


Ads by Google