No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Text Categorization.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Document Clustering Content: 1.Document Clustering Essentials. 2.Text Clustering Architecture 3.Preprocessing 4.Different Document Models 1.Probabilistic.
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Information retrieval – LSI, pLSI and LDA
Basic IR: Modeling Basic IR Task: Slightly more complex:
Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Knowledge Discovery from Questionnaires Shigeichi Hirasawa A Short Course at Tamkang University, May 2004 Department of Industrial and Management Systems.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
IR Models: Overview, Boolean, and Vector
Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Principal Component Analysis
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Latent Dirichlet Allocation a generative model for text
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Dimension of Meaning Author: Hinrich Schutze Presenter: Marian Olteanu.
Student Questionnaire Analyses for Class Management based on Document Clustering and Classification Algorithms Shigeichi Hirasawa The 2009 International.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Chapter 5: Information Retrieval and Web Search
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Text mining.
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Badrul M. Sarwar, George Karypis, Joseph A. Konstan, and John T. Riedl
SINGULAR VALUE DECOMPOSITION (SVD)
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 A Compact Feature Representation and Image Indexing in Content- Based Image Retrieval A presentation by Gita Das PhD Candidate 29 Nov 2005 Supervisor:
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Vector Space Models.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Latent Dirichlet Allocation
Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
Document Clustering Based on Non-negative Matrix Factorization
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Special Topics in Data Mining Applications Focus on: Text Mining
Topic Models in Text Processing
CS 430: Information Discovery
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering School of Science and Engineering Waseda University, Japan Wesley W. Chu Computer science Department School of Engineering and Applied Science University of California, Los Angels, U.S.A. * A part of the work leading to this paper was done at UCLA during a sabbatical year of S.H. as a visiting faculty in IEEE International Conference on Systems, Man and Cybernetics Oct. 5-8, 2003, Washington D.C.

No. 2 FormatExample in paper archives matrix Fixed format Items - The name of authors - The name of journals - The year of publication - The name of publishers - The name of countries - The citation link Free format Text The text of a paper - Introduction - Preliminaries ……. - Conclusion Document G = [ g mj ] : An item-document matrix H = [ h ij ] : A term-document matrix d j : The j -th document t i : The i -th term i m : The m -th item g mj : The selected result of the m -th item ( i m ) in the j -th document ( d j ) h ij : The frequency of the i -th term ( t i ) in the j -th document ( d j ) 1. Introduction

No Information Retrieval Model Text Mining: Information Retrieval, including - Clustering - Classification Information Retrieval Model BaseModel Set theory (Classic) Boolian Model Fuzzy Model Extended Boolian Model Algebraic (Classical) Vector Space Model (VSM) [7] Generalized VSM Latent Semantic Indexing (LSI) Model [2] Probabilistic LSI (PLSI) Model [4] Neural Network Model Probabilistic (Classical) Probabilistic Model Extended Probabilistic Model Inference Network Model Bayesian Network Model 2. Information Retrieval Model

No. 4 tf ( i,j ) = f ij : The number of the i -th term ( t i ) in the j -th document ( d j ) (Local weight) Weight w ij is given by The Vector Space Model (VSM) 2. Information Retrieval Model idf ( i,j ) = log (D/df(i)) : General weight df( i ) : The number of documents in D for which the term t i appears (1)

No Information Retrieval Model (term vector) t i = (a i1, a i2, …, a iD ) : The i -th row (document vector) d j = (a 1j, a 2j, …, a Tj ) : The j -th column (query vector) q = (q 1, q 2, …, q T ) T The similarity s ( q, d j ) between q and d j : (2) (4) (3) (5) (2)

No. 6 The Latent Semantic Indexing (LSI) Model 2. Information Retrieval Model (1) SVD: Single Valued Decomposition

No. 7 where : the j -th canonical vector (2) 2. Information Retrieval Model (7) (8)

No. 8 The Probabilistic LSI (PLSI) Model 2. Information Retrieval Model (1) Preliminary A) A=[a ij ], a ij = f ij :the number of a term t i B) reduction of dimension similar to LSI C) latent class (state model based on factor analysis) D) (i) an independence between pairs ( t i, d j ) (ii) a conditional independence between t i and d j : a set of states ztd (12)

No. 9 (2) 2. Information Retrieval Model

No. 10 (3) 2. Information Retrieval Model

No Formats of Documents

No Proposed Methods Clustering method = K : The number of latent states S : The number of clusters

No Proposed Methods

No Proposed Methods

No Experimental Results Preliminary experiment [5] supervised classification problem Classification error VSM 42.7% LSI 38.7% PLSI 20.7% Proposed method 6.0% (1)Experimental data: Mainichi Newspaper in ‘94 (in Japanese) 300 article, 3 categories (free format only) (2)Condition LSI : K = 81 PLSI: K = 10 (3)Result

No Experimental Results (4) Clustering process for EM algorithm sports local business sports local

No Experimental Results Class Data Class CS - Initial Questionnaires (IQ) - Final Questionnaires (FQ) - Mid-term Test (MT) - Final Test (FT) - Technical Report (TR) Class IS - Initial Questionnaires (IQ) - Final Questionnaires (FQ) - First Report (R1) - Second Report (R2) - Third Report (R3) - Fourth Report (R4)

No Experimental Results

No Experimental Results Experiment 1 (E1) I) First, the documents of the students in Class CS and those in Class IS are merged. II) Then, the merged documents are divided into two class ( S =2) by the proposed method. Class CS Class IS True class Merge Clustering by the proposed method Clustering error C(e) Experiment 1. As a supervised learning problem

No. 20 students Results of E1 (3) Clustering process for EM algorithm

No Experimental Results Results of E1 S=K=2C(e)=0.411 (4) K-means method (3) Clustering process for EM algorithm

No Experimental Results Results of E1 (1) C(e) : the ratio of the number of students in the difference set between divided two classes and the original classes to the number of the total students. (Text only) (Item only)

No Experimental Results Results of E1 (4) Statistical analysis by discriminant analysis

No Experimental Results Experiment 2 (E2) Class CS Cass G Class S Clustering by the proposed method Clustering error C(e) Clustering for class partition problem Only form IQ Experiment 2. As a unsupervised learning problem S: Specialist G: Generalist

No Experimental Results Results of E2 (1) Member of students in each class classCharacteristics of students student's selection S - Having a good knowledge of technical terms - Hoping the evaluation by exam G - Having much interest in use of a computer Clustering S - Having much interest in theory - Having higher motivation for a graduate school G - Having much interest in use of a computer - Having a good knowledge of system using the computer

No. 26 (2) Member of students in each class By discriminant analysis, two classes are evaluated for each partition which are interpreted in table 5. The most convenient case for characteristics of students should be chosen. 5. Experimental Results

No Experimental Results Discussions on Experiments (1)The present contents of Initial Questionnaires (IQ) are proper for E1, they should, however, be improved for E2. (2)Performance of the proposed method is dependent on the structure of characteristics of the students. (3)If we derive multiple solutions for partition of students into two classes, it is possible to choose better partition from a viewpoint of class management. (4)It is impossible to predict a score of a student from only IQ, and is, however, possible with 67.5% in cumulative proportion to do that from both IQ and FQ. Further results has been reported in [*] [*] S.Hirasawa, T.Ishida, J.Itoh, M.Goto, and T.Sakai, “Analyses on student questionnaires with both fixed and free formats,” (in Japanese) Proc. of Promotion of Information Society in University, pp , Tokyo, Sep

No Conclusion and Remerks Conclusion and Remarks (1) (2) (3) (4) (5)