Information Retrieval and Web Search

Slides:

Advertisements

Similar presentations

3D Geometry for Computer Graphics

Advertisements

Text Databases Text Types

The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.

Dimensionality Reduction PCA -- SVD

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Hinrich Schütze and Christina Lioma

Computer Graphics Recitation 5.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 11 Latent Semantic Indexing Extending the Boolean Model.

1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005

Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.

1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

IR Models: Review Vector Model and Probabilistic.

Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.

1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.

Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka Virginia de Sa (UCSD) Cogsci 108F Linear.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])

Chapter 2 Dimensionality Reduction. Linear Methods

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Digital Image Processing, 3rd ed. © 1992–2008 R. C. Gonzalez & R. E. Woods Gonzalez & Woods Matrices and Vectors Objective.

An Introduction to Latent Semantic Analysis. 2 Matrix Decompositions Definition: The factorization of a matrix M into two or more matrices M 1, M 2,…,

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.

LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.

SINGULAR VALUE DECOMPOSITION (SVD)

Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.

1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.

Vector Space Models.

Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

Introduction to Linear Algebra Mark Goldman Emily Mackevicius.

Information Retrieval and Web Search Probabilistic IR and Alternative IR Models Rada Mihalcea (Some of the slides in this slide set come from a lecture.

1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.

Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.

Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.

Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.

1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.

Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.

Matrices and Vectors Review Objective

Latent Semantic Indexing

LSI, SVD and Data Management

Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors.

Maths for Signals and Systems Linear Algebra in Engineering Lectures 13 – 14, Tuesday 8th November 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR)

Lecture 13: Singular Value Decomposition (SVD)

Maths for Signals and Systems Linear Algebra in Engineering Lecture 18, Friday 18th November 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL.

Recuperação de Informação B

CS 430: Information Discovery

Packing to fewer dimensions

Math review - scalars, vectors, and matrices

Restructuring Sparse High Dimensional Data for Effective Retrieval

Recuperação de Informação B

Latent Semantic Analysis

Presentation transcript:

Information Retrieval and Web Search Vasile Rus, PhD vrus@memphis.edu www.cs.memphis.edu/~vrus/teaching/ir-websearch/

Outline Advanced IR Models Probabilistic Model Latent Semantic Analysis

Probabilistic Model An initial set of documents is retrieved somehow User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) IR system uses this information to refine description of ideal answer set By repeating this process, it is expected that the description of the ideal answer set will improve Have always in mind the need to guess at the very beginning the description of the ideal answer set Description of ideal answer set is modeled in probabilistic terms

Probabilistic Ranking Principle Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant) The model assumes that this probability of relevance depends on the query and the document representations only Ideal answer set is referred to as R and should maximize the probability of relevance Documents in the set R are predicted to be relevant But, how to compute probabilities? what is the sample space?

The Ranking Probabilistic ranking computed as: Definition: sim(q, dj) = P(dj relevant-to q) / P(dj non-relevant-to q) How to read this? “Maximize the number of relevant documents, minimize the number of irrelevant documents” This is the odds of the document dj being relevant Taking the odds minimize the probability of an erroneous judgement Definition: Weights wij  {0,1} P(R | vec(dj)): probability that given document is relevant P(R | vec(dj)) : probability that document is not relevant Bayes Rule: P(A|B) P(B) = P(B|A)P(A)

Probabilistic Model PROS: Ranking based on probability of being relevant CONS: Need to guess initial relevant set Binary weights instead of term-weighting Independence assumption of index terms

Latent Semantic Analysis (LSA) Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the term vector space into a lower dimensional space, using singular value decomposition. Each dimension in the new space corresponds to a latent concept in the original data.

LSA Relationship between concepts and words is many-to-many Solve problems of synonymy and ambiguity by representing documents as vectors of ideas or concepts, not terms For retrieval, analyze queries the same way, and compute cosine similarity of vectors of ideas

Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together Latent semantic indexing addresses the first of these (synonymy) and the third (dependence)

LSA Find the latent semantic space that underlies the documents Find the basic (coarse-grained) ideas, regardless of the words used to say them A kind of co-occurrence analysis; co-occurring words as “bridges” between non–co-occurring words Latent semantic space has many fewer dimensions than term space has Space depends on documents from which it is derived Components have no names; can’t be interpreted

Technical Memo Example: Titles c1 Human machine interface for Lab ABC computer applications c2 A survey of user opinion of computer system response time c3 The EPS user interface management system c4 System and human system engineering testing of EPS c5 Relation of user-perceived response time to error measurement m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 Graph minors IV: Widths of trees and well-quasi-ordering m4 Graph minors: A survey

Technical Memo Example: Terms and Documents Technical Memo Example: Terms and Documents Terms Documents c1 c2 c3 c4 c5 m1 m2 m3 m4 human 1 0 0 1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 computer 1 1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1 1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0 0 0 0 0 0 1 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0 0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1

Technical Memo Example: Query Find documents relevant to "human computer interaction" Simple Term Matching: Matches c1, c2, and c4 Misses c3 and c5

Mathematical concepts Define X as the term-document matrix, with t rows (number of index terms) and d columns (number of documents). Singular Value Decomposition For any matrix X, with t rows and d columns, there exist matrices T0, S0 and D0', such that: X = T0S0D0' T0 and D0 are the matrices of left and right singular vectors S0 is the diagonal matrix of singular values

More Linear Algebra A non-negative real number σ is a singular value for X if and only if there exist unit-length vectors u in Kt and v in Kd such that X v= σu and X‘ u= σv u are the left singular vectors while v are the right singular vectors K = field such as the field of real numbers

Eigenvectors vs. Singular vectors Eigenvector: Mv = λv where v is an eigenvector and v is a scalar (real number) called eigenvalue MV = VD, where D is the collection of eigenvalues and V is the collection of eigenvectors M = VDV-1 if V is invertible (which is the case if all eigenvectors are distinct) ‘

Eigenvectors vs. Singular vectors M = VDV‘ (V‘=V-1 if eigenvectors are normalized) XX‘ = (TSD‘)(TSD)‘=TSDD‘S‘T‘ = TSS‘T‘ = TDT‘ D = SS‘

Linear Algebra X = T0S0D0' T,D are column orthonormal Their columns are orthogonal vectors that can form a basis for a space They are unitary which means T‘ and D‘ are also orthonormal

More Linear Algebra Unitary matrices have the following properties UU‘=U‘U=In If U has all entries real it is orthogonal Orthogonal matrices preserve the inner product of two real vectors <Ux,Uy=<x,y> U is an isometry, i.e. preserves distances

LSA Properties The projection into the latent concept space preserves topological properties of the original space Close vectors will stay close The reduced latent concept space is the best approximation of the original space in terms of distance-preservation compared to other choices of same dimensionality Both terms and documents are mapped into a new space where they both could be compared

Dimensions of matrices t x d t x m m x m m x d S0 D0' X = T0 m is the rank of X < min(t, d)

Reduced Rank S0 can be chosen so that the diagonal elements are positive and decreasing in magnitude. Keep the first k and set the others to zero. Delete the zero rows and columns of S0 and the corresponding rows and columns of T0 and D0. This gives: X X = TSD' Interpretation If value of k is selected well, expectation is that X retains the semantic information from X, but eliminates noise from synonymy and recognizes dependence. ^ ~ ^

Dimensionality Reduction t x d t x k k x k k x d S D' ^ = X T k is the number of latent concepts (typically 300 ~ 500) X ~ X= TSD' ^

Animation of SVD M is just an m×m square matrix with positive determinant whose entries are plain real numbers. Run as slideshow to see the Animation From Wikipedia

Projected Terms XX‘ = (TSD‘)(TSD)‘=TSDD‘S‘T‘ = TSS‘T‘= (TS)(TS)‘

Mathematical Revision A is a p x q matrix B is a r x q matrix ai is the vector represented by row i of A bj is the vector represented by row j of B The inner product ai.bj is element i, j of AB' q r ith row of A q p B' jth row of B A

Comparing a Query and a Document A query can be expressed as a vector xq in the term-document vector space. xqi = 1 if term i is in the query and 0 otherwise. (Ignore query terms that are not in the term vector space.) Let pqj be the inner product of the query xq with document dj in the term-document vector space. pqj is the jth element in the product of xq'X ^

Comparing a Query and a Document [pq1 ... pqj ... pqt] = [xq1 xq2 ... xqt] document dj is column j of X ^ X ^ inner product of query q with document dj query ^ pq' = xq'X = xq'TSD' = xq'T(DS)' similarity(q, dj) = cosine of angle is inner product divided by lengths of vectors pqj |xq| |dj|

LSA Summary Strong formal framework Completely automatic; no stemming required; allows misspellings “Conceptual IR” recall improvement: one can retrieve relevant documents that do not contain any search terms Often improving precision is more important: need query and word sense disambiguation Computation is expensive

Summary Probabilistic Model LSA

Next Project Presentations Review Exam Winter Break!