Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.

Slides:



Advertisements
Similar presentations
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Advertisements

Introduction to Information Retrieval (Part 2) By Evren Ermis.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
An Introduction to Latent Semantic Analysis
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Latent Semantic Indexing Extending the Boolean Model.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modeling Modern Information Retrieval
Hinrich Schütze and Christina Lioma
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Chapter 2 Dimensionality Reduction. Linear Methods
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
An Introduction to Latent Semantic Analysis. 2 Matrix Decompositions Definition: The factorization of a matrix M into two or more matrices M 1, M 2,…,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
Vector Space Models.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Information Retrieval and Web Search Probabilistic IR and Alternative IR Models Rada Mihalcea (Some of the slides in this slide set come from a lecture.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Plan for Today’s Lecture(s)
Information Retrieval and Web Search
CS 430: Information Discovery
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Information Retrieval and Web Search
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA

Technical Memo Example: Titles c1Human machine interface for Lab ABC computer applications c2A survey of user opinion of computer system response time c3The EPS user interface management system c4System and human system engineering testing of EPS c5Relation of user-perceived response time to error measurement m1The generation of random, binary, unordered trees m2The intersection graph of paths in trees m3Graph minors IV: Widths of trees and well-quasi-ordering m4Graph minors: A survey

Technical Memo Example: Terms and Documents Terms Documents c1c2c3c4c5m1m2m3m4 human interface computer user system response time EPS survey trees graph minors

Alternative IR models Probabilistic relevance

Probabilistic model Asks the question “what is probability that user will see relevant information if they read this document”  P(rel|d i ) – probability of relevance after reading d i  How likely is the user to get relevance information from reading this document  high probability means more likely to get relevant info.

Probabilistic model “if a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance... the overall effectiveness of the system to its user will be the best that is obtainable…”, Probability ranking principle  Rank documents based on decreasing probability of relevance to user  Calculate P(rel|d i ) for each document and rank Suggests mathematical approach to IR and matching  Can predict (somewhat) good retrieval models based on their estimates of P(rel|d i )

Example Relevance Feedback

What these systems are doing is  Using examples of documents the user likes To detect which words are useful  new query words  query expansion To detect how useful these words are  change weights of query words  term reweighting Use new query for retrieval

Query Expansion Add useful terms to the query  Terms that appear often in relevant documents  Try to compensate for poor queries usually short, can be ambiguous, use too many common words add better terms to user’s query  Try to emphasize recall Rada Mihalcea Rada Mihalcea, LIT, UNT, NLP, Computer Science

Term Weighting Reweight query terms  Assign new weights according to importance in relevant documents  Personalized searching  Try to improve precision Rada Mihalcea, 1.0 Rada Mihalcea, 2.0 LIT, 1.5 UNT, 1.5 NLP, 1.5 Computer Science,1.5 Download,0.5

Probabilistic Models Most probabilistic models based on combining probabilities of relevance and non-relevance of individual terms  Probability that a term will appear in a relevant document  Probability that the term will not appear in a non‐relevant document These probabilities are estimated based on counting term appearances in document descriptions

Example Assume we have a collection of 100 documents  N= of the documents contain the term sausage  Searcher has read and marked 10 documents as relevant  R=10 Of these relevant documents 5 contain the term sausage  How important is the word sausage to the searcher?

Probability of Relevance from these four numbers we can estimate probability of sausage given relevance information  i.e. how important term sausage is to relevant documents  is number of relevant documents containing sausage (5)  is number of relevant documents that do not contain sausage (5)  Eq. (I) is higher if most relevant documents contain sausage lower if most relevant documents do not contain sausage high value means sausage is important term to user in our example (5/5=1) - I # of relevant docs which contain sausage # of relevant docs which don’t contain sausage

Probability of Non‐relevance Also we can estimate probability of sausage given non‐relevance information  i.e. how important term sausage is to non‐relevant documents is number of non‐relevant documents that contain term sausage (20‐5)=15 is number of non‐relevant documents that do not contain sausage (100‐20‐10+5)=75 - II Eq(II) higher if more documents containing term sausage are non‐relevant lower if more documents that do not contain sausage are non‐relevant low value means sausage is important term to user in our example (15/75=0.2) Eq(II) higher if more documents containing term sausage are non‐relevant lower if more documents that do not contain sausage are non‐relevant low value means sausage is important term to user in our example (15/75=0.2) # of docs which don’t contain sausage # of relevant docs which don’t contain sausage # of non-relevant docs which contain sausage

F4 reweighting formula how important is sausage being present in relevant documents how important is sausage being absent from non-relevant documents from example on slide 10 weight of sausage is 5 (1/0.2)

F4 reweighting formula F4 gives new weights to all terms in collection (or just query)  high weights to important terms  Low weights to unimportant terms  replaces idf, tf, or any other weights  document score is based on sum of query terms in documents

Probabilistic model can also use to rank terms for addition to query  rank terms in relevant documents by term reweighting formula  i.e. by how good the terms are at retrieving relevant documents add all terms add some, e.g. top 5 in probabilistic formula query expansion and term reweighting done separately Rada Mihalcea, 1.0 Rada Mihalcea, 2.0 LIT, 1.5 UNT, 1.5 NLP, 1.5 Computer Science,1.5 Download,0.5 Teaching,0.25

Probabilistic model Advantages over vector‐space  Strong theoretical basis  Based on probability theory (very well understood)  Easy to extend Disadvantages  Models are often complicated  No term frequency weighting Which is better vector‐space or probabilistic?  Both are approximately as good as each other  Depends on collection, query, and other factors

Explicit/Latent Semantic Analysis BOW Explicit Semantic Analysis Latent Semantic Anaylsis Democrats, Republicans, abortion, taxes, homosexuality, guns, etc American politics Car Wikipedia:Car, Wikipedia:Automobil e, Wikipedia:BMW, Wikipedia:Railway, etc Car {car, truck, vehicle}, {tradeshows}, {engine}

Explicit/Latent Semantic Analysis Objective  Replace indexes that use sets of index terms/docs by indexes that use concepts. Approach  Map the term vector space into a lower dimensional space, using singular value decomposition.  Each dimension in the new space corresponds to a explicit/latent concept in the original data.

Deficiencies with Conventional Automatic Indexing Synonymy:  Various words and phrases refer to the same concept (lowers recall). Polysemy:  Individual words have more than one meaning (lowers precision) Independence:  No significance is given to two terms that frequently appear together Explict/Latent semantic indexing addresses the first of these (synonymy), and the third (dependence)

Technical Memo Example: Titles c1Human machine interface for Lab ABC computer applications c2A survey of user opinion of computer system response time c3The EPS user interface management system c4System and human system engineering testing of EPS c5Relation of user-perceived response time to error measurement m1The generation of random, binary, unordered trees m2The intersection graph of paths in trees m3Graph minors IV: Widths of trees and well-quasi-ordering m4Graph minors: A survey

Technical Memo Example: Terms and Documents Terms Documents c1c2c3c4c5m1m2m3m4 human interface computer user system response time EPS survey trees graph minors

Technical Memo Example: Query Query: Find documents relevant to "human computer interaction“ Simple Term Matching:  Matches c1, c2, and c4  Misses c3 and c5

Mathematical concepts Define X as the term-document matrix, with t rows (number of index terms) and d columns (number of documents). Singular Value Decomposition  For any matrix X, with t rows and d columns, there exist matrices T 0, S 0 and D 0 ', such that:  X = T 0 S 0 D 0 ‘  T 0 and D 0 are the matrices of left and right singular vectors  S 0 is the diagonal matrix of singular values

Dimensions of matrices X= T0T0 D0'D0'S0S0 t x dt x mm x dm x m m is the rank of X < min(t, d)

Reduced Rank S 0 can be chosen so that the diagonal elements are positive and decreasing in magnitude. Keep the first k and set the others to zero. Delete the zero rows and columns of S 0 and the corresponding rows and columns of T 0 and D 0. This gives:  X = TSD' Interpretation  If value of k is selected well, expectation is that X retains the semantic information from X, but eliminates noise from synonymy and recognizes dependence.

Dimensionality Reduction X = t x dt x kk x dk x k k is the number of latent concepts (typically 300 ~ 500) X ~ X = TSD' T SD' ^

Recombination after Dimensionality Reduction

Mathematical Revision A is a p x q matrix B is a r x q matrix a i is the vector represented by row i of A b j is the vector represented by row j of B The inner product a i.b j is element i, j of AB p q q r A B' i th row of A j th row of B p r AB' i j th element of AB'

Comparing a Query and a Document A query can be expressed as a vector in the term- document vector space x q. x qi = 1 if term i is in the query and 0 otherwise. (Ignore query terms that are not in the term vector space.) Let p qj be the inner product of the query x q with document d j in the term-document vector space.

Comparing a Query and a Document [p q1... p qj... p qt ] = [x q1 x q2... x qt ] ^ X inner product of query q with document d j query document d j is column j of X ^ cosine of angle is inner product divided by lengths of vectors