Web Search & Information Retrieval. 2 Boolean queries: Examples Simple queries involving relationships between terms and documents Simple queries involving.

Web Search & Information Retrieval

2 Boolean queries: Examples Simple queries involving relationships between terms and documents Simple queries involving relationships between terms and documents Documents containing the word Java Documents containing the word Java Documents containing the word Java but not the word coffee Documents containing the word Java but not the word coffee Proximity queries Proximity queries Documents containing the phrase Java beans or the term API Documents containing the phrase Java beans or the term API Documents where Java and island occur in the same sentence Documents where Java and island occur in the same sentence

3 Document preprocessing Tokenization Tokenization Filtering away tags Filtering away tags Tokens regarded as nonempty sequence of characters excluding spaces and punctuations. Tokens regarded as nonempty sequence of characters excluding spaces and punctuations. Token represented by a suitable integer, tid, typically 32 bits Token represented by a suitable integer, tid, typically 32 bits Optional: stemming/conflation of words Optional: stemming/conflation of words Result: document (did) transformed into a sequence of integers (tid, pos) Result: document (did) transformed into a sequence of integers (tid, pos)

4 Storing tokens Straight-forward implementation using a relational database Straight-forward implementation using a relational database Example figure Example figure Space scales to almost 10 times Space scales to almost 10 times Accesses to table show common pattern Accesses to table show common pattern reduce the storage by mapping tids to a lexicographically sorted buffer of (did, pos) tuples. reduce the storage by mapping tids to a lexicographically sorted buffer of (did, pos) tuples. Indexing = transposing document-term matrix Indexing = transposing document-term matrix

5 Two variants of the inverted index data structure, usually stored on disk. The simpler version in the middle does not store term offset information; the version to the right stores term offsets. The mapping from terms to documents and positions (written as “document/position”) may be implemented using a B-tree or a hash-table.

6 Stopwords Function words and connectives Function words and connectives Appear in large number of documents and little use in pinpointing documents Appear in large number of documents and little use in pinpointing documents Indexing stopwords Indexing stopwords Stopwords not indexed Stopwords not indexed For reducing index space and improving performance For reducing index space and improving performance Replace stopwords with a placeholder (to remember the offset) Replace stopwords with a placeholder (to remember the offset) Issues Issues Queries containing only stopwords ruled out Queries containing only stopwords ruled out Polysemous words that are stopwords in one sense but not in others Polysemous words that are stopwords in one sense but not in others E.g.; can as a verb vs. can as a noun E.g.; can as a verb vs. can as a noun

7 Stemming Conflating words to help match a query term with a morphological variant in the corpus. Conflating words to help match a query term with a morphological variant in the corpus. Remove inflections that convey parts of speech, tense and number Remove inflections that convey parts of speech, tense and number E.g.: university and universal both stem to universe. E.g.: university and universal both stem to universe. Techniques Techniques morphological analysis (e.g., Porter's algorithm) morphological analysis (e.g., Porter's algorithm) dictionary lookup (e.g., WordNet ). dictionary lookup (e.g., WordNet ). Stemming may increase recall but at the price of precision Stemming may increase recall but at the price of precision Abbreviations, polysemy and names coined in the technical and commercial sectors Abbreviations, polysemy and names coined in the technical and commercial sectors E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to “gate”, may be bad ! E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to “gate”, may be bad !

8 Maintaining indices over dynamic collections.

9 Relevance ranking Keyword queries Keyword queries In natural language In natural language Not precise, unlike SQL Not precise, unlike SQL Boolean decision for response unacceptable Boolean decision for response unacceptable Solution Solution Rate each document for how likely it is to satisfy the user's information need Rate each document for how likely it is to satisfy the user's information need Sort in decreasing order of the score Sort in decreasing order of the score Present results in a ranked list. Present results in a ranked list. No algorithmic way of ensuring that the ranking strategy always favors the information need No algorithmic way of ensuring that the ranking strategy always favors the information need Query: only a part of the user's information need Query: only a part of the user's information need

10 Responding to queries Set-valued response Set-valued response Response set may be very large Response set may be very large (E.g., by recent estimates, over 12 million Web pages contain the word java.) (E.g., by recent estimates, over 12 million Web pages contain the word java.) Demanding selective query from user Demanding selective query from user Guessing user's information need and ranking responses Guessing user's information need and ranking responses Evaluating rankings Evaluating rankings

11 Evaluating procedure Given benchmark Given benchmark Corpus of n documents D Corpus of n documents D A set of queries Q A set of queries Q For each query, an exhaustive set of relevant documents identified manually For each query, an exhaustive set of relevant documents identified manually Query submitted system Query submitted system Ranked list of documents retrieved Ranked list of documents retrieved compute a 0/1 relevance list compute a 0/1 relevance list iff iff otherwise. otherwise.

12 Recall and precision Recall at rank Recall at rank Fraction of all relevant documents included in. Fraction of all relevant documents included in.. Precision at rank Precision at rank Fraction of the top k responses that are actually relevant. Fraction of the top k responses that are actually relevant..

13 Other measures Average precision Average precision Sum of precision at each relevant hit position in the response list, divided by the total number of relevant documents Sum of precision at each relevant hit position in the response list, divided by the total number of relevant documents.... avg.precision =1 iff engine retrieves all relevant documents and ranks them ahead of any irrelevant document avg.precision =1 iff engine retrieves all relevant documents and ranks them ahead of any irrelevant document Interpolated precision Interpolated precision To combine precision values from multiple queries To combine precision values from multiple queries Gives precision-vs.-recall curve for the benchmark. Gives precision-vs.-recall curve for the benchmark. For each query, take the maximum precision obtained for the query for any recall greater than or equal to For each query, take the maximum precision obtained for the query for any recall greater than or equal to average them together for all queries average them together for all queries Others like measures of authority, prestige etc Others like measures of authority, prestige etc

14 Precision-Recall tradeoff Interpolated precision cannot increase with recall Interpolated precision cannot increase with recall Interpolated precision at recall level 0 may be less than 1 Interpolated precision at recall level 0 may be less than 1 At level k = 0 At level k = 0 Precision (by convention) = 1, Recall = 0 Precision (by convention) = 1, Recall = 0 Inspecting more documents Inspecting more documents Can increase recall Can increase recall Precision may decrease Precision may decrease we will start encountering more and more irrelevant documents we will start encountering more and more irrelevant documents Search engine with a good ranking function will generally show a negative relation between recall and precision. Search engine with a good ranking function will generally show a negative relation between recall and precision. Higher the curve, better the engine Higher the curve, better the engine

15 Precision and interpolated precision plotted against recall for the given relevance vector. Missing are zeroes.

16 The vector space model Documents represented as vectors in a multi- dimensional Euclidean space Documents represented as vectors in a multi- dimensional Euclidean space Each axis = a term (token) Each axis = a term (token) Coordinate of document d in direction of term t determined by: Coordinate of document d in direction of term t determined by: Term frequency TF(d,t) Term frequency TF(d,t) number of times term t occurs in document d, scaled in a variety of ways to normalize document length number of times term t occurs in document d, scaled in a variety of ways to normalize document length Inverse document frequency IDF(t) Inverse document frequency IDF(t) to scale down the coordinates of terms that occur in many documents to scale down the coordinates of terms that occur in many documents

17 Term frequency.... Cornell SMART system uses a smoothed version Cornell SMART system uses a smoothed version

18 Inverse document frequency Given Given D is the document collection and is the set of documents containing t D is the document collection and is the set of documents containing t Formulae Formulae mostly dampened functions of mostly dampened functions of SMART SMART.

19 Vector space model Coordinate of document d in axis t Coordinate of document d in axis t. Transformed to in the TFIDF-space Transformed to in the TFIDF-space Query q Query q Interpreted as a document Interpreted as a document Transformed to in the same TFIDF-space as d Transformed to in the same TFIDF-space as d

20 Measures of proximity Distance measure Distance measure Magnitude of the vector difference Magnitude of the vector difference. Document vectors must be normalized to unit ( or ) length Document vectors must be normalized to unit ( or ) length Else shorter documents dominate (since queries are short) Else shorter documents dominate (since queries are short) Cosine similarity Cosine similarity cosine of the angle between and cosine of the angle between and Shorter documents are penalized Shorter documents are penalized

21 Relevance feedback Users learning how to modify queries Users learning how to modify queries Response list must have least some relevant documents Response list must have least some relevant documents Relevance feedback Relevance feedback `correcting' the ranks to the user's taste `correcting' the ranks to the user's taste automates the query refinement process automates the query refinement process Rocchio's method Rocchio's method Folding-in user feedback Folding-in user feedback To query vector To query vector Add a weighted sum of vectors for relevant documents D+ Add a weighted sum of vectors for relevant documents D+ Subtract a weighted sum of the irrelevant documents D- Subtract a weighted sum of the irrelevant documents D-.

22 Relevance feedback (contd.) Pseudo-relevance feedback Pseudo-relevance feedback D+ and D- generated automatically D+ and D- generated automatically E.g.: Cornell SMART system E.g.: Cornell SMART system top 10 documents reported by the first round of query execution are included in D+ top 10 documents reported by the first round of query execution are included in D+ typically set to 0; D- not used typically set to 0; D- not used Not a commonly available feature Not a commonly available feature Web users want instant gratification Web users want instant gratification System complexity System complexity Executing the second round query slower and expensive for major search engines Executing the second round query slower and expensive for major search engines

23 Ranking by odds ratio R : Boolean random variable which represents the relevance of document d w.r.t. query q. R : Boolean random variable which represents the relevance of document d w.r.t. query q. Ranking documents by their odds ratio for relevance Ranking documents by their odds ratio for relevance. Approximating probability of d by product of the probabilities of individual terms in d Approximating probability of d by product of the probabilities of individual terms in d. Approximately… Approximately…

24 Meta-search systems Take the search engine to the document Take the search engine to the document Forward queries to many geographically distributed repositories Forward queries to many geographically distributed repositories Each has its own search service Each has its own search service Consolidate their responses. Consolidate their responses. Advantages Advantages Perform non-trivial query rewriting Perform non-trivial query rewriting Suit a single user query to many search engines with different query syntax Suit a single user query to many search engines with different query syntax Surprisingly small overlap between crawls Surprisingly small overlap between crawls Consolidating responses Consolidating responses Function goes beyond just eliminating duplicates Function goes beyond just eliminating duplicates Search services do not provide standard ranks which can be combined meaningfully Search services do not provide standard ranks which can be combined meaningfully

Mining the WebChakrabarti and Ramakrishnan25 Similarity search Cluster hypothesis Cluster hypothesis Documents similar to relevant documents are also likely to be relevant Documents similar to relevant documents are also likely to be relevant Handling “find similar” queries Handling “find similar” queries Replication or duplication of pages Replication or duplication of pages Mirroring of sites Mirroring of sites

Mining the WebChakrabarti and Ramakrishnan26 Document similarity Jaccard coefficient of similarity between document and Jaccard coefficient of similarity between document and T(d) = set of tokens in document d T(d) = set of tokens in document d. Symmetric, reflexive, not a metric Symmetric, reflexive, not a metric Forgives any number of occurrences and any permutations of the terms. Forgives any number of occurrences and any permutations of the terms. is a metric is a metric

Web Search & Information Retrieval. 2 Boolean queries: Examples Simple queries involving relationships between terms and documents Simple queries involving.

Similar presentations

Presentation on theme: "Web Search & Information Retrieval. 2 Boolean queries: Examples Simple queries involving relationships between terms and documents Simple queries involving."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Search & Information Retrieval. 2 Boolean queries: Examples Simple queries involving relationships between terms and documents Simple queries involving.

Similar presentations

Presentation on theme: "Web Search & Information Retrieval. 2 Boolean queries: Examples Simple queries involving relationships between terms and documents Simple queries involving."— Presentation transcript:

Similar presentations

About project

Feedback