Adding Semantics to Information Retrieval By Kedar Bellare 20 th April 2003.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Text Databases Text Types
Information retrieval – LSI, pLSI and LDA
Modern Information Retrieval Chapter 1: Introduction
Dimensionality Reduction PCA -- SVD
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
An Introduction to Latent Semantic Analysis
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Probabilistic Latent Semantic Analysis
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Chapter 5: Information Retrieval and Web Search
CS276A Text Retrieval and Mining Lecture 15 Thanks to Thomas Hoffman, Brown University for sharing many of these slides.
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Matrix Factorization and Latent Semantic Indexing 1 Lecture 13: Matrix Factorization and Latent Semantic Indexing Web Search and Mining.
Introduction to Information Retrieval Lecture 19 LSI Thanks to Thomas Hofmann for some slides.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 6: Information Retrieval and Web Search
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Information Management Information Retrieval hussein suleman uct cs
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
Latent Semantic Indexing
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Information Retrieval hussein suleman uct cs 3003s 2006.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Best pTree organization? level-1 gives te, tf (term level)
LSI, SVD and Data Management
Information Management Information Retrieval
Chapter 5: Information Retrieval and Web Search
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

Adding Semantics to Information Retrieval By Kedar Bellare 20 th April 2003

Motivation Current IR techniques term-based Semantics of document and query not considered Problems like polysemy and synonymy Lot of advances in NLP and Statistical Modeling of Semantics Is Semantic IR really required?

Organization Traditional IR Statistics for Semantics – Latent Semantic Indexing Semantic Resources for Semantics – Use of Semantic Nets, Conceptual Graphs, WordNet etc. in IR. Conclusion

Information Retrieval An information retrieval system does not inform the user on the subject of his inquiry. It merely informs on the existence (or non- existence) and whereabouts of documents relating to his request.

A Typical IR System

Current IR Preprocessing of Documents Inverted Index Removing stopwords and Stemming Representation of Documents Vector Space Model – TF and IDF Document Clustering Improvements to the above Better weighting of Document Vectors Link analysis – PageRank and Anchor Text

Latent Semantic Indexing Problems with Traditional Approaches Synonymy – Automobile and Car Polysemy – Jaguar means both a Car and Animal LSI – Linear Algebra for capturing “Latent Semantics” of documents Method of dimensionality reduction

LSI Compares document vectors in Latent Semantic Space Two documents can have high similarity value even if no terms shared Attempts to remove minor differences in terminology during indexing Truncated SVD – used for construction of Latent Semantic Space

Singular Value Decomposition Given a term-document matrix A t x d converts it into product of three matrices T t x r, S r x r and D d x r such that A = T S D T T and D are orthogonal, S is diagonal and r is rank of A Reduced space corresponds to axes of greatest variation

What LSI does? Uses truncated SVD Instead of r – dimensional space uses a factor k Ā t x d = T t x k S k x k D T d x k Truncated SVD – captures underlying structure in association of terms and documents

Using the SVD model Comparison of terms – entries of the matrix T S 2 T T Comparison of documents – entries of the matrix D S 2 D T Comparison of term and document – entries of the matrix TSD T Query in SVD model – q’ = q T T S -1

Example of LSI

Why LSI works? Although lot of empirical evidence no concrete proof of why LSI works No major degradation – Theorem of Eckart and Young States that the distance of two matrices is minimum Still does not explain improvements in recall and precision

Why LSI works? (contd.) Papadimitriou et. al. Assumes documents generated from set of topics with disjoint vocabularies If term-document matrix A is perturbed, they prove that LSI recovers topic information and removes the noise Kontostathis et. al. Essentially claims that LSI’s ability to trace term co-occurrences is what helps in improved recall

Advantages & Disadvantages Advantages Synonymy Term Dependence Disadvantages Storage Efficiency

Semantic Resources Semantic Nets - E.g. John gave Mary the book Applied in UNL – Eg. Only a few farmers could use information technology in early 1990s

Semantic Resources (contd.) Conceptual Graphs – E.g. A bird is singing in a Sycamore tree Conceptual Dependency – E.g. I gave the man a book Lexical Resources – WordNet

Applications of Semantic Resources in IR UNL Used in improving document vectors Conceptual Graphs Graph matching of query and document CDs FERRET – Comparison of CD patterns WordNet Query Expansion using WordNet

Conclusion Various things need to be considered before applying to Web Storage Efficiency Knowledge Content of Query Clearly, semantic method needed for eliminating synonymy and polysemy Currently, traditional models with minor hacks serve the purpose However, in conclusion : Statistical or Conceptual or combination of both to model Document Semantics is definitely required

References [1] M. W. Berry, S. T. Dumais, and G. W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4), pages 573–595, [2] S. Chakrabarti. Mining the Web - Discovering Knowledge from Hypertext Data. Morgan Kau.mann Publishers, San Francisco, [3] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by Latent Semantic Analysis. Journal of the Society for Information Science 41 (6), pages 391–407, [4] A. Kontostathis and W. M. Pottenger. A mathematical view of Latent Semantic Indexing: Tracing Term Co-occurences. Technical report, Lehigh University, [5] R. Mandala, T. Takenobu, and T. Hozumi. The use of WordNet in Information Retrieval. In COLING/ACL Workshop on the Usage of WordNet in Natural Language Processing Systems, pages 31–37, 1998.

References (contd.) [6] M. L. Mauldin. Retrieval performance in FERRET: a conceptual information retrieval system. In Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pages 347–355. ACM Press, [7] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J.Miller. Introduction to WordNet: an on-line lexical database. International Journal of Lexicography 3 (4), pages 235 – 244, [8] M. Montes-y-Gomez, A. Lopez, and A. F. Gelbukh. Information retrieval with Conceptual Graph matching. In Database and Expert Systems Applications, pages 312–321, [9] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent Semantic Indexing: A probabilistic analysis. pages 159–168, [10] E. Rich and K. Knight. Artificial Intelligence. Tata McGraw-Hill Publishers, New Delhi, 2002.

References (contd.) [11] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613– 620, [12] C. Shah, B. Chowdhary, and P. Bhattacharyya. Constructing better Document Vectors using Universal Networking Language (UNL). In Proceedings of International Conference on Knowledge-Based Computer Systems (KBCS) NCST, Navi Mumbai, India, [13] H. Uchida, M. Zhu, and S. T. Della. UNL : A gift for a millenium. Technical report, The United Nations University, [14] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.