T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries

2 T.Sharon - A.Frank IR Basic Concepts In the classic models: –each document is described/summarized by a set of representative keywords called index terms. –index terms are mainly nouns, but could be all the distinct terms in a document. –distinct index terms have varying relevance. –index term (numerical) weights are usually assumed to be mutually independent.

3 T.Sharon - A.Frank Common Weights for Keywords Binary: 1 if present in document and 0 otherwise. Term Frequency (TF): Number of occurrences in the document. Inverse Document Frequency (IDF): The inverse of the number of occurrences of the keywords in the whole collection of documents.

4 T.Sharon - A.Frank Boolean Model Simple retrieval model –Based on Set Theory and Boolean Algebra Queries are specified as Boolean expressions. Advantages: –Precise semantics, neat formalism, inherent simplicity Disadvantages: –Difficult to translate information need into a Boolean expression. –Binary decision criterion; relevant or not, no grading scale. –Data (not information) retrieval model. –Exact matching may lead to retrieval of too few or too many documents.

5 T.Sharon - A.Frank Statistical Queries Purpose: –Increase flexibility by setting the amount of documents retrieved –Reduce query formulation complexity

6 T.Sharon - A.Frank Statistical Queries Overall Scheme Query –words list –word combinations (like “prime minister”) – How many times a word appears in a document? Giving a matching score to each document –relevance score to documents  What happens to the measures when taking documents with lower scores?

7 T.Sharon - A.Frank Additional Query Parameters Location of the word in the document –Title –First paragraph –Body Distance between words (proximity search)

8 T.Sharon - A.Frank Matching Score Factors Frequency: number of appearances of a query keyword in a document. Count: number of query keywords in the document. Importance: weight of each word in the query.  Usually use vector space model

9 T.Sharon - A.Frank Vector Space Model Documents/queries are represented/converted into vectors. Vector features are index terms in the document or query, after stemming and removing stop-words. Index terms are assumed to be mutually independent. Vectors are non-binary weighted to emphasize the important index terms. The query vector is compared to each document vector to compute the degree of similarity. Those that are closest to the query are considered to be similar, and are returned.

10 T.Sharon - A.Frank Vector Space Implementation V(word, weight) –In the document: weight = number of appearances of word in the document –In the query: weight = according to the user’s definition

11 T.Sharon - A.Frank Query/Documents Matching Score Symbols –t = term –d = document –q = query –w = weight Equations –w(t,d) = weight of term in document –w(t,q) = weight of term in query Score(d,q) = sum[w(t,q)*w(t,d)] * scalar multiplication t How many times a word appears in a document?

12 T.Sharon - A.Frank Example of Computing Scores Information retrieval abstract. Meant to show how results are evaluated for all kinds of queries. There are two measures are recall and precision and they change if the evaluation method changes. Information retrieval is important! It is used a lot for search engines that store and retrieve a lot of information, to help us search the World Wide Web. Document (d) Document Related Part w(t,d)

13 T.Sharon - A.Frank Example of Computing Scores Score = 300+300+20 = 620 *= Query Related Part

14 T.Sharon - A.Frank Problem with Scalar Multiplication Problem: –Longer documents have more words Normalization Needed Solutions: –Use normalized word frequency –Consider overall number of words in the document –Set significance of each word (called IDF) Effective measure of similarity: TF * IDF

15 T.Sharon - A.Frank Inverse Document Frequency (IDF) ni - numbers of the documents in which the term appeared N - number of documents in the repository maxn - maximal frequency of a word in the repository Example of two variations: IDF = log(N/ni) IDF = log(maxn/ni)+1 The effect of the frequency of the word in the whole repository:

16 T.Sharon - A.Frank Vector Model Advantages Term-weighting scheme improves retrieval performance. Partial matching strategy allows retrieval of documents that approximate the query conditions. Documents sorted/ranked according to their degree of similarity to the query. It is simple and fast – turns out to be superior to many other IR models - so very popular.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

Similar presentations

Presentation on theme: "T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

Similar presentations

Presentation on theme: "T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries."— Presentation transcript:

Similar presentations

About project

Feedback