No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Document Clustering Content: 1.Document Clustering Essentials. 2.Text Clustering Architecture 3.Preprocessing 4.Different Document Models 1.Probabilistic.
A Vector Space Model for Automatic Indexing
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Information retrieval – LSI, pLSI and LDA
Basic IR: Modeling Basic IR Task: Slightly more complex:
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Knowledge Discovery from Questionnaires Shigeichi Hirasawa A Short Course at Tamkang University, May 2004 Department of Industrial and Management Systems.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Latent Dirichlet Allocation a generative model for text
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Student Questionnaire Analyses for Class Management based on Document Clustering and Classification Algorithms Shigeichi Hirasawa The 2009 International.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Chapter 5: Information Retrieval and Web Search
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
IMSS005 Computer Science Seminar
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Presented by Tienwei Tsai July, 2005
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Generalized Fuzzy Clustering Model with Fuzzy C-Means Hong Jiang Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, US.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval.
Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Latent Dirichlet Allocation
Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
1 A Fuzzy Logic Framework for Web Page Filtering Authors : Vrettos, S. and Stafylopatis, A. Source : Neural Network Applications in Electrical Engineering,
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
1 Shigeichi HIRASAWA Department of Industrial and Management Systems Engineering, School of Science and Engineering, Waseda University A Collaboration.
Semi-Supervised Clustering
Document Classification Method with Small Training Data
Methods and Metrics for Cold-Start Recommendations
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Project 1: Text Classification by Neural Networks
Topic Models in Text Processing
CS 430: Information Discovery
Presentation transcript:

No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C., March 7-9, 2006 Shigeich Hirasawa Department of Industrial and Management Systems Engineering School of Science and Engineering Waseda University, Japan

No. 2 FormatExample in paper archives matrix Fixed format Items - The name of authors - The name of journals - The year of publication - The name of publishers - The name of countries - The year of publication - The citation link Free format Text The text of a paper - Introduction - Preliminaries ……. - Conclusion 1. Introduction Document G = [ g mj ] : An item-document matrix H = [ h ij ] : A term-document matrix d j : The j -th document t i : The i -th term i m : The m -th item g mj : The selected result of the m -th item ( i m ) in the j -th document ( d j ) h ij : The frequency of the i -th term ( t i ) in the j -th document ( d j ) 1. Introduction

No Information Retrieval Model Text Mining: - Information Retrieval including - Clustering - Classification Information Retrieval Model BaseModel Set theory (Classical) Boolean Model Fuzzy Extended Boolian Model Algebraic (Classical) Vector Space Model (VSM) [BYRN99] Generalized VSM Latent Semantic Indexing (LSI) Model [BYRN99] Neural Network Model Probabilistic (Classical) Probabilistic Model Extended Probabilistic Model Probabilistic LSI (PLSI) Model [Hofmann99] Inference Network Model Bayesian Network Model 2. Information Retrieval Model

No. 4 tf ( i,j ) = f ij : The number of the i -th term ( t i ) in the j -th document ( d j ) (Local weight) Weight w ij is given by The Vector Space Model (VSM) 2. Information Retrieval Model idf ( i,j ) = log (D/df(i)) : General weight df( i ) : The number of documents in D for which the term t i appears (1)

No Information Retrieval Model (term vector) t i = (a i1, a i2, …, a iD ) : The i -th row (document vector) d j = (a 1j, a 2j, …, a Tj ) : The j -th column (query vector) q = (q 1, q 2, …, q T ) T The similarity s ( q, d j ) between q and d j : (2)

No. 6 The Latent Semantic Indexing (LSI) Model 2. Information Retrieval Model (1) be into eq.(2.1). eq.(2.6) T D min

No. 7 where : the j -th canonical vector (2) 2. Information Retrieval Model be eq.(2.4)

No. 8 The Probabilistic LSI (PLSI) Model 2. Information Retrieval Model (1) Preliminary A) A=[a ij ], a ij = f ij :the number of a term t i B) reduction of dimension similar to LSI C) latent class (state model based on factor analysis) D) (i) an independence between pairs ( t i, d j ) (ii) a conditional independence between t i and d j : a set of states ztd (2.10) (2.11)

No. 9 (2) 2. Information Retrieval Model be (2.13) eq.(2.1).

No. 10 (3) 2. Information Retrieval Model (2.14) (2.17) (2.16) (2.15) eq.(2.11), eq.(2.13)

No Proposed Method 3.1 Classification method categories: C 1, C 2, …, C K (1)Choose a subset of documents D * ( D ) which are already categorized and compute representative document vectors d * 1, d * 2, …, d * K : where n k is the number of selected documents to compute the representative document vector from C k. (2)Compute the probabilities Pr( z k ), Pr( d j | z k ) and Pr( t i | z k ) which maximizes eq.(13) by the TEM algorithm, where ∥ Z ∥ = K. (3)Decide the state for d j as (3.1) (3.2)

No. 12 (1)Choose a proper K (≥ S ) and compute the probabilities Pr( z k ), Pr( d j | z k ), and Pr( t i | z k ) which maximizes eq.(13) by the TEM algorithm, where ∥ Z ∥ = K. (2)Decide the state for d j as (3.3) If S = K, then 3. Proposed Method 3.2 Clustering method || Z || = K : The number of latent states S : The number of clusters

No Proposed Method (3)If S < K, then compute a similarity measure s ( z k, z k' ): (3.4) (3.5) and use the group average distance method with the similarity measure s ( z k, z k' ) for agglomeratively clustering the states z k 's until the number of clusters becomes S. Then we have S clusters, and the members of each cluster are those of a cluster of states.

No Experimental Results 4.1 Document sets Table 4.1: Document sets contentsformat # words T amountcategorize # selected document D L + D T (a) articles of Mainichi news paper in ’94 [Sakai99] Free (texts only) 107, ,058 (see Table 4.2) Yes (9+1 categories) 300 ( S =3) (b) 200 ~ 300 ( S =2 ~ 8) (c) Question naire (see Table 4.3 in detail) fixed and free (see Table 4.3) 3, Yes (2 categories)

No Experimental Results 4.2 Classification problem: (a) Experimental data: Mainichi Newspaper in ‘94 (in Japanese) 300 article, 3 categories (free format only) Conditions of (a) categorycontents # articles D L + D T # used for training D L # used for test D T C1C1 business10050 C2C2 local10050 C3C3 sports10050 total Table 4.2: Selected categories of newspaper LSI : K = 81 PLSI: K = 10

No Experimental Results Results of (a) methodfrom C k to C k C1C1 C 2 C3C3 VS method C1C C2C C3C LSI method C1C C2C C3C PLSI method C1C C2C C3C Proposed method C1C C2C C3C Table 4.5: Classified number form C k to for each method

No Experimental Results MethodClassification error VSM 42.7% LSI 38.7% PLSI 20.7% Proposed method 6.0% Table 4.6: Classification error rate

No Experimental Results Clustering process by EM algorithm sports local business sports local

No Experimental Results 4.3 Classification Problem: (b) We choose S = 2, 3, …, 8 categories, each of which contains D L =100 ~ 450 articles randomly chosen. The half of them DL is used for training, and the rest of them D T, for test. Condition of (b)

No Experimental Results Results of (b) Fig. 4.2: Classification error rate for D L

No Experimental Results Results of (b) Fig. 4.3: Classification error rate for S

No Experimental Results 4.4 Clustering Problem: (c) Format Number of questions Examples Fixed (item) 7 major questions 2 - For how many years have you used computers? - Do you have a plan to study abroad? - Can you assemble a PC? - Do you have any license in information technology? - Write 10 terms in information technology which you know 4. Free (text) 5 questions 3 - Write about your knowledge and experience on computers. - What kind of job will you have after graduation? - What do you imagine from the name of the subject? Table 4.3: Contents of initial questionnaire 2 Each question has 4-21 minor questions. 3 Each text is written within Chinese and Japanese characters. 4 There is a possibility to improve the performance of the proposed method by elimination of these items. Student Questionnaire

No Experimental Results 4.4 Clustering Problem: (c) Table 4.4: Object classes Name of subjectCourse Number of students Introduction to Computer Science (Class CS) Science Course135 Introduction to Information Society (Class IS) Literary Course35 Object classes

No Experimental Results Condition of (c) I) First, the documents of the students in Class CS and those in Class IS are merged. II) Then, the merged documents are divided into two class ( S =2) by the proposed method. Class CS Class IS True class Merge Clustering by the proposed method Clustering error C(e) Fig.4.4 Class partition problem by clustering method

No. 25 students Results of (c) Fig.4.7 Clustering process by EM algorithm, K=2

No. 26 similarity

No Experimental Results S=K=2C(e)=0.411 K-means method Fig.4.7 Clustering process for EM algorithm, K =3

No. 28 C(e) : the ratio of the number of students in the difference set between divided two classes and the original classes to the number of the total students. 4. Experimental Results Fig.4.5 Clustering error rate C(e) vs. λ (Item only) (Text only)

No Experimental Results Fig.4.6 Clustering error rate C(e) vs. K

No Experimental Results Results of (c) Statistical analysis by discriminant analysis

No Experimental Results Another Experiment Class CS Cass G Class S Clustering by the proposed method Clustering error C(e) Clustering for class partition problem Only form IQ Fig. Another Class partition problem by clustering method S: Specialist G: Generalist

No Experimental Results (1) Member of students in each class classCharacteristics of students student's selection S - Having a good knowledge of technical terms - Hoping the evaluation by exam G - Having much interest in use of a computer Clustering S - Having much interest in theory - Having higher motivation for a graduate school G - Having much interest in use of a computer - Having a good knowledge of system using the computer

No. 33 (2) Member of students in each class By discriminant analysis, two classes are evaluated for each partition which are interpreted in table 5. The most convenient case for characteristics of students should be chosen. 4. Experimental Results

No Concluding Remarks (1)We have proposed a classification method for a set of documents and extend it to a clustering method. (2)The classification method exhibits its better performance for a document set with comparatively small size by using the idea of the PLSI model. (3)The clustering method also has good performance. We show that it is applicable to documents with both fixed and free formats by introducing a weighting parameter λ. (4)As an important related study, it is necessary to develop a method for abstracting the characteristics of each cluster [HIIGS03][IIGSH03-b]. (5)An extension to a method for a set of documents with comparatively large size also remains as a further investigation.