1 Ranking. 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document.

1 Ranking

2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document satisfies a query –all satisfying documents should be returned –no significance to order of results –similar to query processing in a database This model is often inappropriate We now consider a non-boolean setting: query results are ranked and returned in ranking order

Relevant and Irrelevant The user has a task, which he formulates as a query A given document may contain all words and yet not be relevant A given document may not contain all words and yet be relevant Relevance is subjective and can only be determined by the user 3

Evaluating the Quality of Search Results Goals: –Return all relevant documents –Return no non-relevant documents –Return relevant documents “earlier” Suppose that a search engine runs a query Q and returns the result set R. Then, a document falls in one of the categories: 4 RelevantNot Relevant Retrieved Not Retrieved

Quality of Search Results Lots of measures. We focus on 3 measures: –Precision: percentage of retrieved documents that are relevant –Recall: percentage of relevant documents that are retrieved –Precision at k: percentage of relevant results within top k 5

Questions Suppose that: – there are 1000 documents –50 documents are relevant to the query –30 query results are returned, including 20 relevant documents What is the precision? Recall? How can perfect precision be achieved? How can perfect recall be achieved? Using these scores, how can search engine quality be automatically assessed? 6

7 What is Ranking? Ranking is the problem of returning queries answers in the “best” order Note that “best” is subjective Ranking is NOT based on money!!! To rank well, we want to know which documents best satisfy a query –How can we determine this?

8 Types of Ranking Query Dependent versus Query Independent –advantages/disadvantages of each Based on: –plain text –HTML markup –link analysis

9 Query Dependent Ranking: Ranking of Plain Text TF-IDF and the Vector Space Model

10 Goal Given a query Q and document P, want to find a rank r that indicates how relevant P is for Q –Answers will be returned by decreasing r To simplify: Assume that queries are given as free-text (without logical operators) and that a document may be relevant if it contains at least one word in the query

11 Goal Intuition: –Give each term t in a document d a weight w t,d –Give each term t in the query a weight w t –The score of a document for a query will be a function of the values of the query term weights in the document Two questions: –how should we define the weight of a term in a document? –how should we combine the term weights together?

12 Weights of Terms in Documents From now on, t is a term, d is a document Goal is to define w t,d, i.e., weight of t in d All our weighting schemes will have in common: –w t,d = 0 if t does not appear in d Simplest definition of weight: –w t,d = 1 if t appears in d –Advantages? Disadvantages?

13 Term Frequency Intuitively, a document that has many occurrences of a term t is more “about” t Leads us to defining weight: –w t,d = f t,d –f t,d is the number of times that t appears in d, also called the term frequency of t in d Advantages? Disadvantages? Are 20 occurrences of t 20 times better than 1 occurrence of t?

14 Normalized Term Frequency The term frequency can be normalized in many ways One normalization: –w t,d = 1 + log 2 f t,d if t appears in d –w t,d = 0 otherwise Are all terms created equal?

15 Are All Words Equal? The occurrence of a very rare term is usually more interesting than the occurrence of a common word Example query: Winnie the Pooh Suppose that –document 1 has 300 occurrences of the –document 2 has 1 occurrence of Pooh –Which do you prefer?

16 Inverse Document Frequency The document frequency of a term t is: –f t = number of documents containing t We define a term weight of –w t = log 2 (1+N/f t ) –N is the number of documents –Again, log is used for normalization w t reflects the inverse document frequency of t

Summary We will use the following weighting scheme: –w t,d = 1 + log 2 f t,d if t appears in d –w t,d = 0 otherwise –w t = log 2 (1+N/f t ) This is called TF-IDF ranking, since it takes into consideration the term frequency and the inverse document frequency 17

18 Other TF-IDF Variants There are many different options that have been suggested to compute TF and IDF values –Example: Okapi BM25Okapi BM25 All methods comply with 2 monotonicity constraints: –a term that appears in many documents should not be more important than a term that appears in only a few –a document with many occurrences of a term should not be less important than a document with only a few

19 Combining Weights Suppose a query has several terms t 1,…,t k We have defined weights for terms in the query and in the document How can/should these be combined? Ideas?

20 The Vector Space Model We model documents and queries as vectors with n dimensions –n is the number of words in our lexicon If t is the k-th term in the lexicon, then –The k-th place in the vector of a document d is w t,d –The k-th place in the vector of the query is w t if t appears in the query, and 0 otherwise

Similarity Between Vectors The similarity between two vectors is measured by the difference in angles between the vectors If X and Y are vectors, then the angle  between them satisfies –Where is the inner product of X and Y – is the length of X (i.e., the square root of Then 21

Cosine Distance for Ranking Since cos  increases when  decreases, a higher value indicates greater similarity –We will compute the cosine distance between the query vector and each document vector –The greater the cosine distance, the higher the document will rank 22

Ranking Since |Q| appears in all documents, we can remove it from the equation without affecting the relative scores of the documents, i.e., use for ranking: 23

Example Suppose: –N = 210 –f apple = 30 –f banana = 210 –f cherry = 70 Doc 1: apple apple banana banana pear Doc 2 banana cherry cherry cherry cherry Which will be higher ranked for the query: apple banana cherry 24

25 Markup-Based Ranking Give higher score to words appearing in "important HTML elements", e.g., –title –h1, h2 –bold or italics –links How can this be implemented using the Vector Space Model?

26 Using Anchor Text Consider a page P1 that points to another page P2. The link from P1 to P2 has anchor text Most search engines use this text to understand the content of page P2 –Often pages do not contain words describing their content. –IBM does not have computer on its homepage –Google does not have search engine on its homepage

27 Using Anchor Text (2) Using the current model, how can anchor text help with ranking? What problems can arise? Google bombs! –Try Googling for כשלון or כשלון חרוץ

28 Query Dependent Ranking: Language Models

Language Models: Overview A statistical language model assigns a probability to a sequence of m words by means of a probability distribution –A language model is associated with each document in a collection. –Given a query Q as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query 29

Language Models: Overview Usually use unigram language models –Likelihood of a word is independent of previous words –Bag of words model –Given by a multinomial distribution over words –Smoothing needed to deal with words not appearing in a document! Sometimes bigram, trigram models used instead 30

31 We can view a finite state automaton as a deterministic language model. Can generate: I wish I wish I wish I wish... Cannot generate: “wish I wish” or “I wish I”. Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic. What is a language model?

32 A probabilistic language model This is a one-state probabilistic finite-state automaton Called a unigram language model STOP is not a word, but a special symbol indicating that the automaton stops. Example: string = “frog said that toad likes frog STOP” P(string) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02 = 0.0000000000048

33 There are different language models for each document Example: string = “frog said that toad likes frog STOP” P(string|M d1 ) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02 = 0.0000000000048 = 4.8 · 10 -12 P(string|M d2 ) = 0.01 · 0.03 · 0.05 · 0.02 · 0.02 · 0.01 · 0.02 = 0.0000000000120 = 12 · 10 -12 P(string|M d1 ) < P(string|M d2 ): d 2 is “more relevant” to this string

34 Using language models in IR  Each document is treated as (the basis for) a language model.  Given a query q, rank documents based on P(d|q)  P(q) is the same for all documents, so ignore  P(d) is the prior – often treated as the same for all d  But we can give a prior to “high-quality” documents, e.g., those with high PageRank (not done here).  P(q|d) is the probability of q given d.  So to rank documents according to relevance to q, ranking according to P(q|d) and P(d|q) is equivalent.

35 What next?  In the LM approach to IR, we attempt to model the query generation process.  Then we rank documents by the probability that a document would be generated by the language of a query.  Equivalently, we rank documents by the probability that a query would be observed as a random sample from the respective document model.  That is, we rank according to P(q|d).  Next: how do we compute P(q|d)?

36 How to compute P(q|d)  We will make the a conditional independence assumption (|q|: length ofr q; t k : the token occurring at position k in q)  This is equivalent to:  tf t,q : term frequency (# occurrences) of t in q  Multinomial model (omitting constant factor)

37 Parameter estimation  Missing piece: Where do the parameters P(t|M d ) come from?  Start with maximum likelihood estimates (|d|: length of d; tf t,d : # occurrences of t in d)  What happens if a document does not contain one of the query words? What will P(q|d) be?  Is this good?

38 Parameter estimation  A single t with P(t|M d ) = 0 will make zero.  Gives a single term “veto power”.  Conjunctive semantics for query  For example, for query “Michael Jackson top hits” a document about “top songs” (but not using the word “hits”) would have P(t|M d ) = 0.  Bad :~(  We need to smooth the estimates to avoid zeros.

39 Smoothing  Key intuition: A nonoccurring term is possible (even though it didn’t occur),... ... but no more likely than would be expected by chance in the collection.  Notation: M c : the collection model; cf t : the number of occurrences of t in the collection; : the total number of tokens in the collection.  We will use to “smooth” P(t|d) away from zero.  Smoothing is also good for other reasons. Why do you think it is helpful? = cf t / T

40 Mixture model  How do we combine P(t|M d ) and P(t|M c )?  One simple solution is to use a mixture model P(t|d) = λP(t|M d ) + (1 - λ)P(t|M c )  Mixes the probability from the document with the general collection frequency of the word.  How does our choice of λ affect the results?  High value of λ: “conjunctive-like” search – tends to retrieve documents containing all query words.  Low value of λ: more disjunctive, suitable for long queries  Correctly setting λ is very important for good performance.

41 Mixture model: Summary  What we model: The user has a document in mind and generates the query from this document.  The equation represents the probability that the document that the user had in mind was in fact this one.

42 Example  Collection: documents d 1 and d 2  d 1 : “Jackson was one of the most talented entertainers of all time”  d 2 : “Michael Jackson anointed himself King of Pop”  Query q: “Michael Jackson”  Use mixture model with λ = 1/2  Calculate P(q|d 1 ) and P(q|d 2 )  Which ranks higher?

43 LMs vs. vector space model  LMs have some things in common with vector space models.  How are they the same/different?  How does term frequency, inverse document frequency come to play in language models?  How would you define a bigram language model?

44 Query Independent Ranking: Link Analysis

45 Intuition Pages are given an apriori ranking of importance –this ranking is irrespective of any query Return pages satisfying the query, in ranking order Question: Is it possible for D 1 to score higher than D 2 on one query and lower on another?

46 A Naive Approach to Link- Based Ranking We can represent the Web as a graph: –pages are nodes –there is an edge from P 1 to P 2 if P 1 to links to P 2 Intuitively a link from P 1 to P 2 is a vote of confidence of P 1 for P 2 –Directed Popularity: Grade of P is the number of incoming links to P –Undirected Popularity: Grade of P is the sum of incoming and outgoing links of P What problems can you find with these ranking schemes?

47 Random Surfer Imagine a surfer doing a random walk on web pages: –Start at a random page –At each step, go out of the current page along one of the links on that page, with equal probability The long-run probability of being at a page can be used for ranking

Transition Matrix Let out k be the out degree of node k Model the surfer’s walk using a transition matrix M –m ik = 1/out k if k links to i –m ik = 0, otherwise 48 1 Yahoo 2 Amazon 3 Microsoft

49 Dead-Ends and Spider Traps Web is full of dead-ends and spider traps. –Random walk can get stuck in these. –Makes no sense to talk about long-term visit rates. Page 5 Page 3 Page 4 Page 2 Page 1 Which pages are dead ends? spider traps? Page 6

50 Solution: Teleporting At any moment in time, –With probability d, go out on a random link. –With probability 1-d, jump to a random web page. (may be the same page!) –d is a parameter called the damping factor, often thought to be 0.85 Teleporting ensures that the transitions represent an ergodic Markov chain –there is a long-term probability of being at a given state

51 PageRank (PR) The PageRank formula: –P 1,…,P n are pages with incoming links to P –O i is the number of outgoing links of page P i –N is the number of pages in total NOTE: Formula has also been presented with 1-d used instead of (1-d)/N PR(P) = (1-d)/N + d(PR(P 1 )/O 1 +... + PR(P n )/O n )

Matrix Formulation of PageRank PageRank is solution to the above formula, for y, a, m (d is a chosen constant) 52 1 Yahoo 2 Amazon 3 Microsoft

53 Computing PR One can compute PageRank using standard methods from linear algebra, e.g., Guassian elimination The web has billions of pages –Can't solve equations for so many variables Power Method: Choose a starting value for each of the PR-s (e.g., 1/N). Iteratively recompute all PR values, based on the formulas –This process is guaranteed to converge

Very Small Example Damping factor: 0.8 54 2 Amazon 3 Microsoft

55 Intuitive Examples Which page will have the highest page rank? In which of the following will E have higher PR? AB CD BC DE A BC DE A F

56 Careful with Your References! Consider the following urls: –http://www.cs.huji.ac.il/~webdata –http://www.cs.huji.ac.il/~webdata/ –http://moodle.cs.huji.ac.il/cs09/course/vie w.php?name=webdata –All three urls point to the same page –If the page is pointed to by different names, its PR may be spread among several pages.

57 Topic-Specific PageRank Suppose that we are building a topic-specific search engine, e.g., one that should help to find nature information Consider the query jaguar –what pages should get a higher preference? –what will PageRank give?

Simple Solution The set of documents that can be reached, when the surfer chooses a random page on the Web is call the teleport set –so far, the entire Web is the teleport set For Topic-Specific PageRank, choose as the teleport set a set of pages known to be related to the topic of interest –What effect will this have? 58

Spam Wikipedia: “Spamdexing (also known as search spam or search engine spam) involves a number of methods, such as repeating unrelated phrases, to manipulate the relevancy or prominence of resources indexed by a search engine, in a manner inconsistent with the purpose of the indexing system” How can one spamdex assuming that TF-IDF is used? If PageRank is used? 59

The Google Dance Nickname given to the periods of time in which Google is updating its index, and its different servers are inconsistent The Florida Dance (Nov. 16, 2003): Huge changes in rankings over a period of a few months –Most likely theories: Google incorporated one of Topic- Sensitive PageRank, TrustRank, HillTop 60

Hilltop Ranking Krishna Bharat and George Mihaila

62 Problems with PageRank A web-site that is authoritative in general, may contain a page that discusses a topic on which it is not the authority –A page on Sun (Java’s Site) that discusses XML. Sun is an authority on Java, but not on XML PageRank cannot differentiate between pages that are authoritative in general and those that are authoritative on a particular topic Another problem: PageRank is (relatively) easily tricked by “spamming”

63 Hilltop Intuition Given a query, compute a set of expert pages on the topic –these are pages that are created to direct people towards relevant pages (i.e., pages with many links) Identify relevant links within the experts and follow them to find target pages Targets are ranked according to number and relevance of experts that point to them Note: An answer is returned only if there are enough experts!

64 Basic Idea: Finding Expert Pages (1) An expert page should point to numerous unaffiliated pages on the subject 2 hosts are affiliated if –they share the first three bytes of their IP address –The rightmost non-generic token in the hostname is the same Example: Are www.ibm.com and ibm.co.il affiliated? Affiliation is transitive: If A is affiliated to B, and B is to C, then A is to C

65 Basic Idea: Finding Expert Pages (2) From all the pages in the index, find pages that have at least k outgoing links that point to unaffiliated hosts (e.g., k=5) If k=2, which are expert pages?

66 Indexing the Experts For each expert page, store the terms within key phrases –A key phrase is one that qualifies one or more URLs. –Examples: title (qualifies all URLs), heading (qualifies URLs until new heading), anchor text (qualifies associated URL) Also store the list of URLs in each page and the identifiers in the key phrases that qualify them

67 Key Phrases Example

68 Query Processing Find N (e.g., N=200) expert pages that are most relevant for the query. How? –consider expert pages that contain at least one URL qualified by all terms in the query –score of expert page for a query q is ExpertScore = 2 32 S 0 + 2 16 S 1 + S 2 w here S i =  key phrases p with k-i query terms LevelScore(p)*FullnessFactor(p,q)

69 Query Processing (cont) LevelScore(p) is determined by the type of phrase that p is, e.g., –title gets 16 –headings get 6 –anchor text gets 1

70 Query Processing (cont) FullnessFactor(p,q) is inversely proportional to the number of tems in p, that are not in q (i.e., measures how fully q covers p) –let len be the length of p –let m be the number of terms in p, but not in q –If m<=2, then FullnessFactor(p,q) = 1 –If m>2, then FullnessFactor(p,q) = 1 – (m-2)/len

71 Computing the Target Score So far, we have found the top N experts Now, we want to rank the pages that these experts point to –those are the pages actually returned –pages are called targets –only consider targets pointed to by at least 2 unaffiliated experts (that are also unaffiliated to the target)

72 Computing Target Score For each expert E (within the N chosen), draw an edge to each target T, that it points to For each keyword w, let occ(w,T) be the number of key phrases in E that contain w, and have T in their scope –if occ(w,T) = 0 for any w in the query, then the edge (E,T) has score 0 –otherwise, score of the edge is ExpertScore(E)*  query keywords w occ(w,T)

73 Computing Target Score (cont) If there are affiliations between expert pages that point to T, remove all but highest scoring edge Score of target is the sum of its incoming edges

74 Think About It There are many who claim that Google uses Hilltop as one of its ranking factors How can Hilltop be efficiently computed?

1 Ranking. 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document.

Similar presentations

Presentation on theme: "1 Ranking. 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Ranking. 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document.

Similar presentations

Presentation on theme: "1 Ranking. 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document."— Presentation transcript:

Similar presentations

About project

Feedback