Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.
Advertisements

Eigen Decomposition and Singular Value Decomposition
Text Databases Text Types
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Dimensionality Reduction PCA -- SVD
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Principal Component Analysis
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
Singular Value Decomposition
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Adding Semantics to Information Retrieval By Kedar Bellare 20 th April 2003.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Chapter 5: Information Retrieval and Web Search
CS276A Text Retrieval and Mining Lecture 15 Thanks to Thomas Hoffman, Brown University for sharing many of these slides.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Chapter 2 Dimensionality Reduction. Linear Methods
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Matrix Factorization and Latent Semantic Indexing 1 Lecture 13: Matrix Factorization and Latent Semantic Indexing Web Search and Mining.
Introduction to Information Retrieval Lecture 19 LSI Thanks to Thomas Hofmann for some slides.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
Latent Semantic Indexing
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.
Best pTree organization? level-1 gives te, tf (term level)
LSI, SVD and Data Management
Concept Decomposition for Large Sparse Text Data Using Clustering
Information Retrieval and Web Search
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics, Varaždin

Outline Information retrieval in vector space model (VSM) or bag of words representation Techniques for conceptual indexing –Latent semantic indexing –Concept indexing Comparison: Academic example Experiment Further work

Information retrieval in VSM 1/3 Task of information retrieval: to extract documents that are relevant for user query in document collection In VSM documents are presented in high dimensional space Dimension of space depends on number of indexing terms which are chosen to be relevant for the collection ( in my experiments) VSM is implemented by forming term-document matrix

Information retrieval in VSM 2/3 Term-document matrix is m  n matrix where m is number of terms and n is number of documents row of term-document matrix = term column of term- document matrix = document Figure 1. Term-document matrix

Information retrieval in VSM 3/3 query has the same shape as document (m dimensional vector) measure of similarity between query q and a document a j is a cosine of angle between those two vectors

Retrieval performance evaluation Measures for evaluation: –Recall –Precision –Average precision Recall Precision r i is number of relevant documents among i highest ranked documents r n is total number of relevant documents in collection Average precision – average precision for distinct levels of recall

Techniques for conceptual indexing In term-matching method similarity between query and the document is tested lexically Polysemy (words having multiple meaning) and synonymy (multiple words having the same meaning) are two fundamental problems in efficient information retrieval Here we compare two techniques for conceptual indexing based on projection of vectors of documents (in means of least squares) on lower-dimensional vector space –Latent semantic indexing (LSI) –Concept indexing (CI)

Latent semantic indexing Introduced in 1990; improved in 1995 S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R. Harsman: Indexing by latent semantic analysis, J. American Society for Information Science, 41, 1990, pp M. W. Berry, S.T. Dumas, G.W. O’Brien: Using linear algebra for intelligent information retrieval, SIAM Review, 37, 1995, pp Based on spectral analysis of term-document matrix

Latent semantic indexing For every m×n matrix A there is singular value decomposition (SVD) U orthogonal m×m matrix whose columns are left singular vectors of A  diagonal matrix on whose diagonal are singular values of matrix A in descending order V orthogonal n×n matrix whose columns are right singular vectors of A

Latent semantic indexing For LSI truncated SVD is used where U k is m×k matrix whose columns are first k left singular vectors of A  k is k×k diagonal matrix whose diagonal is formed by k leading singular values of A V k is n×k matrix whose columns are first k right singular vectors of A Rows of U k = terms Rows of V k = documents

(Truncated) SVD

Latent semantic indexing Using the truncated LSI we include only first k independent linear components of A (singular vectors and values) Documents are projected in means of least squares on space spread by first k singular vectors of A (LSI space) First k components capture the major associational structure in in the term-document matrix and throw out the noise Minor differences in terminology used in documents are ignored Closeness of objects (queries and documents) is determined by overall pattern of term usage, so it is context based Documents which contain synonyms are closer in LSI space than in original space; documents which contain polysemy in different context are more far in LSI space than in original space

Concept indexing (CI) Indexing using concept decomposition (CD) instead of SVD like in LSI Concept decomposition was introduced in 2001 I.S.Dhillon, D.S. Modha: Concept decomposition for large sparse text data using clustering, Machine Learning, 42:1, 2001, pp

Concept decomposition First step: clustering of documents in term-document matrix A on k groups Clustering algorithms: –Spherical k-means algorithm –Fuzzy k-means algorithm Spherical k-means algorithm is a variant of k-means algorithm which uses the fact that vectors of documents are of the unit norm Centroids of groups = concept vectors Concept matrix is matrix whose columns are centroids of groups c j – centroid of j-th group

Concept decomposition Second step: calculating the concept decomposition Concept decomposition D k of term-document matrix A is least squares approximation of A on the space of concept vectors where Z is solution of the least squares problem Rows of C k = terms Columns of Z = documents

Comparison: Academic example Collection of 15 documents (Titles of books) –9 from the field of data mining –5 from the field of linear algebra –1 combination of these fields (application of linear algebra for data mining) List of terms was formed 1)By words contained in at least two documents 2)Words on stop list were ejected 3)Stemming was performed On term-document matrix we apply –Truncated SVD (k=2) –Concept decomposition (k=2)

Documents 1/2 D1Survey of text mining: clustering, classification, and retrieval D2Automatic text processing: the transformation analysis and retrieval of information by computer D3Elementary linear algebra: A matrix approach D4Matrix algebra & its applications statistics and econometrics D5Effective databases for text & document management D6Matrices, vector spaces, and information retrieval D7Matrix analysis and applied linear algebra D8Topological vector spaces and algebras

Documents 2/2 D9Information retrieval: data structures & algorithms D10Vector spaces and algebras for chemistry and physics D11Classification, clustering and data analysis D12Clustering of large data sets D13Clustering algorithms D14Document warehousing and text mining: techniques for improving business operations, marketing and sales D15Data mining and knowledge discovery

Terms Data mining termsLinear algebra termsNeutral terms TextLinearAnalysis MiningAlgebraApplication ClusteringMatrixAlgorithm ClassificationVector RetrievalSpace Information Document Data

Projection of terms by SVD

Projection of terms by CD

Queries Q1: Data mining –Relevant documents : All data mining documents Q2: Using linear algebra for data mining –Relevant document: D6

Projection of documents by SVD

Projection of documents by CD

Results of information retrieval (Q1)

Results of information retrieval (Q2)

Collections MEDLINE –1033 documents –30 queries –Relevant judgements CRANFIELD –1400 documents –225 queries –Relevant judgements

Test A Comparison of errors of approximation term- document matrix by 1) k-rank SVD 2) k-rank CD

MEDLINE - errors of approximation

CRANFIELD - errors of approximation

Test B Average inner product between concept vectors c j, j=1,2,…,k Comparison of average inner product for –Concept vectors obtained by spherical k-means algorithm –Concept vectors obtained by fuzzy k-means algorithm

MEDLINE – average inner product

CRANFIELD – average inner product

Test C Comparison of mean average precision of information retrieval and precision-recall plots Mean average precision for term-matching method: –MEDLINE : 43,54 –CRANFIELD : 20,89

MEDLINE – mean average precision

CRANFIELD – mean average precision

MEDLINE – precision-recall plot

CRANFIELD – precision-recall plot

Test D Correlation between mean average precision (MAP) and clustering quality Measure of cluster quality – generalized within groups sum of square errors function J fuzz a j, j=1,2,…,n are vectors of documents, c i, i=1,2,…,k are concept vectors  ij is the fuzzy membership degree of document a j in the group whose concept is c i b  1,  is weight exponent

MEDLINE - Correlation (clustering quality and MAP) 46 observations for rank of approximation k  [1,100] Correlation between mean average precision and J fuzz is r=-0, with significance p<<0,01 Correlation between rank of approximation and mean average precision is r= 0,70247 ( p<<0,01) Correlation between rank of approximation and J fuzz is r= -0, ( p<<0,01)

CRANFILD - Correlation (clustering quality and MAP) 46 observations for rank of approximation k  [1,100] Correlation between mean average precision and J fuzz is r=-0, with significance p<<0,01 Correlation between rank of approximation and mean average precision is r= 0, ( p<<0,01) Correlation between rank of approximation and J fuzz is r= -0, ( p<<0,01)

Regression line: clustering quality and MAP (MEDLINE)

Regression line: clustering quality and MAP (CRANFIELD)

Conclusion 1/3 By SVD approximation term-document matrix is projected on the first k left singular vectors, which for orthogonal base for LSI space By CD approximation term-document matrix is projected on the k centroids of groups (concept vectors) Concept vectors form the base for CI space; they tend to orthogonality as k raises Concept vectors obtained by fuzzy k-means algorithm tend to orthogonality faster then those obtained by spherical k-means algorithm CI using CD by fuzzy k-means algorithm gives higher MAP of information retrieval then LSI on both collections we have used

Conclusion 2/3 CI using CD by spherical k-means algorithm gives lower (but comparable) MAP of information retrieval then LSI on both collections we have used According the results of MAP k=75 for MEDLINE collection, and k=200 for CRANFIELD collection is good choice of rank of approximation By LSI and CI documents are presented in smaller matrices: –For MEDLINE collection term-document matrix is stored in 5940×1033 matrix – approximations of documents are stored in 75×1033 matrix –For CRANFIELD collection term-document matrix is stored in 4758×1400 matrix - approximations of documents are stored in 200×1400 matrix

Conclusion 3/3 LSI and CI work better on MEDLINE collection When evaluated for different ranks of approximation MAP is more stable for LSI then for CI There is high correlation between MAP and clustering quality

Further work 1)To apply CI on the problem of classification in supervised setting 2)To propose solutions of problem adding new documents in collection for CI method Adding new documents in collection requires recomputation of SVD or CD It is computationally inefficient 2 approximation methods are developed for adding new document in collection for LSI method