Download presentation
Presentation is loading. Please wait.
Published byDustin Walker Modified over 8 years ago
1
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina University of Trento June 5, 2006
2
University of Trento2 Index 1. The Problem 2. The Concept 3. Advantages and Drawbacks 4. LSI and VSM: comparison 5. LSI and Routing Problem 6. Conclusions
3
June 5, 2006University of Trento3 1. The Problem Vector Space Model (VSM) that has long been a standard in IR has its flaw: It ignores both the order and association between terms!
4
June 5, 2006University of Trento4 1. The Problem (cont.) The document by term matrix is sufficient to represent the collection. But: some of the information contained there could actually hinder the process of document retrieval!
5
June 5, 2006University of Trento5 Index 1. The Problem 2. The Concept 3. Advantages and Drawbacks 4. LSI and VSM: comparison 5. LSI and Routing Problem 6. Conclusions
6
June 5, 2006University of Trento6 2. The Concept The solution: A smaller, more tractable representation of terms and documents that retains only the most important information from the original matrix may actually improve both the quality and the speed of the retrieval system
7
June 5, 2006University of Trento7 2. The Concept (cont.) Latent Semantic Indexing (LSI) is a technique that projects queries and documents into a space with “latent” semantic dimensions.
8
June 5, 2006University of Trento8 2. The Concept (cont.) LSI is a method for dimensionality reduction: a high-dimensional space is represented in low-dimensional space (often in two- or three- dimensional)
9
June 5, 2006University of Trento9 2. The Concept (cont.) LSI is the application of the particular mathematical technique, called Singular Value Decomposition, to a word-by-document matrices. SVD (and hence LSI) is a least-squares method.
10
June 5, 2006University of Trento10 2. The Concept (cont.) How SVD works? SVD takes the matrix A and represents it as A´ in a lower dimensional space such that the “distance” between the two matrices is minimized: Δ=||A-A´|| 2
11
June 5, 2006University of Trento11 Index 1. The Problem 2. The Concept 3. Advantages and Drawbacks 4. LSI and VSM: comparison 5. LSI and Routing Problem 6. Conclusions
12
June 5, 2006University of Trento12 3. Advantages and Drawbacks Advantages of LSI: Synonymy (the same underlying concept can be described using different terms) Polysemy (describes the words that have more than one meaning) Dependence (improving performance by adding common phrases as search items)
13
June 5, 2006University of Trento13 3. Advantages and Drawbacks Drawbacks of LSI: Storage (SVD representation is more compact) Efficiency (with LSI the query must be compared to every document in the collection)
14
June 5, 2006University of Trento14 Index 1. The Problem 2. The Concept 3. Advantages and Drawbacks 4. LSI and VSM: comparison 5. LSI and Routing Problem 6. Conclusions
15
June 5, 2006University of Trento15 4. LSI and VSM: comparison Two collections of data: MED and CISI. MED – LSI improves average precision from.45 to.51 with the largest benefits found at high recall CISI – no significant differences between LSI and VSM is found
16
June 5, 2006University of Trento16 Index 1. The Problem 2. The Concept 3. Advantages and Drawbacks 4. LSI and VSM: comparison 5. LSI and Routing Problem 6. Conclusions
17
June 5, 2006University of Trento17 5. LSI and Routing Problem The routing problem is just a special case of the classification problem, since there are only two groups of documents, relevant and nonrelevant.
18
June 5, 2006University of Trento18 5. LSI and Routing Problem (cont.) To test the performance of LSI when applied to the routing task the technique of cross- validation is used.
19
June 5, 2006University of Trento19 5. LSI and Routing Problem (cont.) Cross-validation: The strategy is to remove one document at a time from the collection, and then use the remaining documents to try to predict the relevance of missing document. Precision and recall are used to evaluate the performance.
20
June 5, 2006University of Trento20 5. LSI and Routing Problem (cont.) Results: LSI does not greatly improve performance over the vector space model for the routing problem, although the difference is measurable: Evaluation method VSM LSI Avg.precision0.4050.451 Avg.recall0.7580.811
21
June 5, 2006University of Trento21 5. LSI and Routing Problem (cont.) To obtain a significant improvement in retrieval performance LSI can be used in conjunction with statistical classification.
22
June 5, 2006University of Trento22 5. LSI and Routing Problem (cont.) The general statistical classification problem: A population consists of two or more groups, and there exists a training sample for which the class of each element is known and a test sample for which the class is unknown. The goal is to produce a classification rule which will predict the class of the unknown elements.
23
June 5, 2006University of Trento23 5. LSI and Routing Problem (cont.) Results: The performance is significantly improved: Evaluation method VSM LSITDA Avg.precision0.4050.4510.604 Avg.recall0.7580.8110.830 TDA – method for text-based discriminant analysis.
24
June 5, 2006University of Trento24 Index 1. The Problem 2. The Concept 3. Advantages and Drawbacks 4. LSI and VSM: comparison 5. LSI and Routing Problem 6. Conclusions
25
June 5, 2006University of Trento25 6. Conclusions LSI addresses the problem of term independence by re-expressing the term document matrix in a new coordinate system to capture the most significant components of the term association structure.
26
June 5, 2006University of Trento26 Thank You!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.