Download presentation
1
Web Information retrieval (Web IR)
Handout #9: Connectivity Ranking Ali Mohammad Zareh Bidoki ECE Department, Yazd University Autumn 2011
2
Outline PageRank HITS Personalized PageRank HostRank Distance Rank
Autumn 2011
3
Ranking : Definition Ranking is the process which estimates the quality of a set of results retrieved by a search engine Ranking is the most important part of a search engine Autumn 2011
4
Ranking Types Content-based Connectivity based (web)
Classical IR Connectivity based (web) Query independent Query dependent User-behavior based Autumn 2011
5
Web information retrieval
Queries are short: 2.35 terms in avg. Huge variety in documents: language, quality, duplication Huge vocabulary: 100s millions terms Deliberate misinformation Spamming! Its rank is completely under the control of Web page’s author Autumn 2011
6
Ranking in Web IR Words Docs 1 2 w n Web graph Query Ranking is a function of the query terms and of the hyperlink structure Using content of other pages to rank current pages It is out of the control of the page’s author Spamming is hard Autumn 2011
7
Connectivity-based Ranking
Query independent PageRank Query dependent HITS Autumn 2011
8
Google’s PageRank Algorithm
Idea: Mine structure of the web graph Each web page is a node Each hyperlink is a directed edge Autumn 2011
9
PageRank Assumption: A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Quality of a page is related to its in-degree Recursion: Quality of a page is related to its in-degree, and to the quality of pages linking to it PageRank A B Successor Autumn 2011
10
Definition of PageRank
Consider the following infinite random walk (surf): Initially the surfer is at a random page At each step, the surfer proceeds to a randomly chosen successor of the current page (With probability 1/outdegree) The PageRank of a page p is the fraction of steps the surfer spends at p in the limit. Random surfer model p s Surfer Autumn 2011
11
PageRank (cont.) By previous theorem:
PageRank = stationary probability for this Markov chain, i.e. Autumn 2011
12
PageRank (cont.) B A P PageRank of P is P(A)/4 + P(B)/3 Autumn 2011
13
PageRank (cont.) Autumn 2011
14
Damping Factor (d) where n is the total number of nodes in the graph
Web graph is not strongly connected Convergence of PageRank is not guaranteed Effects of sinking web pages Pages without outputs Trapping pages Damping factor (d) Surfer proceeds to a randomly chosen successor of the current page with probability d or to a randomly chosen web page with probability (1-d) where n is the total number of nodes in the graph Autumn 2011
15
PageRank Vector (Linear Algebra)
R is the rank vector (eigen vector) ri is rank value of page i P is a matrix in that pij=1/O(i) if i points to j then else pij=0 Goal is to find eigen vector of matrix P with eigen value one It iterates to converge (power method) Using damping factor we have (ei=1/n) Autumn 2011
16
PageRank Properties Advantages Disadvantages Finds popularity
It is offline Disadvantages It is query independent All of pages will compete together Unfairness Autumn 2011
17
HITS (An online query dependent)
Hypertext Induced Topic Search By Kleinberg Autumn 2011
18
HITS (Hypertext Induced Topic Selection)
The algorithm produces two types of pages: - Authority: A page is very authoritative if it receives many citations. Citation from important pages weight more than citations from less-important pages - Hub: Hubness shows the importance of a page. A good hub is a page that links to many authoritative sites For each vertex v Є V in a graph of interest: a(v) - the authority of v h(v) - the hubness of v Autumn 2011
19
HITS 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7) Autumn 2011
20
Authority and Hubness Convergence
Authorities and hubs exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed to by many good hubs Autumn 2011
21
HITS Example Find a base subgraph:
Start with a root set R {1, 2, 3, 4} {1, 2, 3, 4} - nodes relevant to the topic Expand the root set R to include all the children and a fixed number of parents (d) of nodes in R A new set S (base subgraph) Real version of HITS is based on site relations Autumn 2011
22
Topic Sensitive PageRank (TSPR)
It precomputes the importance scores online, as with ordinary PageRank. However, it compute multiple importance scores for each page; It computes a set of scores of the importance of a page with respect to various topics. At query time, these importance scores are combined based on the topics of the query to form a composite PageRank score for those pages matching the query. Autumn 2011
23
TSPR (Cont.) We have n topics on the web, The rank of page v in topic t Difference with original PageRank is in E vector (it is not uniform and we have n E vectors) The are n ranking values for each page Problem: finding the topic of a page and a query (we do not know user interest) Autumn 2011
24
TSPR (Cont.) Cj = category j
Given a query q, let q’ be the context of q (here q’=q) P(q’|cj) is computed from the class term-vector Dj (number of the terms in the documents below each of the 16 top-level categories, Djt simply gives the total number of occurrences of term t in documents listed below class cj) Autumn 2011
25
TSPR (Cont.) The quantity P(cj ) is not as straightforward. It is used uniformly, although we could personalize the query results for deferent users by varying this distribution. In other words, for some user k, we can use a prior distribution Pk(cj ) that reflects the interests of user k. This method provides an alternative framework for user-based personalization, rather than directly varying the damping vector E Autumn 2011
26
TrustRank Spamming in Web TrustRank is used to overcome Spamming
Good and bad pages TrustRank is used to overcome Spamming It proposes techniques to semi-automatically separate reputable, good pages from spam. It first selects a small set of seed pages to be evaluated by an expert. Once it manually identifies the reputable seed pages, then it uses the link structure of the web to discover other pages that are likely to be good. Autumn 2011
27
TrustRank (cont.) Idea: Good page links to other Good page and bad pages link to other bad pages Autumn 2011
28
TrustRank (cont.) It formalizes the notion of a human checking a page for spam by a binary oracle function O over all pages p : Autumn 2011
29
Trust damping and Trust Splitting
Autumn 2011
30
Computing Trustiness of each page
Goal: Trust Propagation (E(i) is computed from Normalized Oracle vector for example if O(5)=O(10)=O(15)=1 then E(5)=E(10)=E(15)=0.33 ): Autumn 2011
31
HostRank Pervious link analysis algorithms generally work on a flat link graph, ignoring the hierarchal structure of the Web graph. They suffer from two problems: the sparsity of link graph and biased ranking of newly-emerging pages. HostRank considers both the hierarchical structure and the link structure of the Web. Autumn 2011
32
Example of Domain, Host, Directory
Autumn 2011
33
Supper node & Hierarchical Structure of a Web Graph
Tupper-layer is an aggregated link graph which consists with supernodes (such as domain, host and directory). The lower-layer graph is the hierarchical tree structure, in which each node is the individual Web page in the supernode, and the edges are the hierarchical links between the pages. Autumn 2011
34
Hierarchical Random Walk Model
At the beginning of each browsing session, a user randomly selects the supernode. 2. After the user finished reading a page in a supernode, he may select one of the following three actions with a certain probability: Going to another page within current supernode. Jumping to another supernode that is linked by current supernode. Ending the browsing. Autumn 2011
35
Two stages in HostRank First computing score of each suppernode by random walk (PageRank) Second propagating score among pages inside a supernode Dissipative Heat Conductance (DHC) model Autumn 2011
36
Ranking & Crawling Challenges
Rich-get-richer problem Unfairness Low precision Spamming phenomenon Autumn 2011
37
Popularity and Quality
Definition 1:We define the popularity of page p at time t, P(p; t), as the fraction of Web users who like the page We can interpret the PageRank of page as its popularity on the web Definition 2 : We define the quality of a page p, Q(p), as the probability that an average user will like the page when user sees the page for the first time Autumn 2011
38
Rich-get-richer Problem
It causes young high quality pages receive less popularity It is from search-engine bias Entrenchment Effect Autumn 2011
39
Entrenchment Effect Search engines show entrenched (already-popular) pages at the top Users discover pages via search engines; tend to focus on top results … entrenched pages user attention new unpopular pages Autumn 2011
40
Popularity as a Surrogate for Quality
Search engines want to measure the “quality” of pages Quality is hard to define and measure Various “popularity” measures are used in ranking e.g., in-links, PageRank, user traffic Autumn 2011
41
Measuring Search-Engine Bias
Random-surfer model Users follow links randomly Never use search engines Search-dominant model Users always start with a search engine Only visit pages returned by search engines It has been found that it takes 60 times longer for a new page to become popular under Search-Dominant than Random-Surfer model. Autumn 2011
42
Popularity Evaluation
Random Surfer Search Dominant Autumn 2011
43
Some Definitions Autumn 2011
44
Relation between Popularity & Visit rate in Random Surfer Model
r1 is constant We can consider PageRank as Popularity (the current PageRank of a page represents the probability that a person arrives at the page if the person follows links on the Web randomly) Autumn 2011
45
Popularity evolution Autumn 2011
46
Popularity evolution (Q(p)=1)
Autumn 2011
47
Relation in Search Dominant Model
Autumn 2011
48
Random Surfer vs. Search Dominant
Autumn 2011
49
Search Dominant Formula Detail (Found from AltaVista log-power law)
Autumn 2011
50
Rank Promotion (by Pandey)
Autumn 2011
51
Relationship Between Popularity and Quality
aware of page p Users like page p Popularity : depends on the number of users who “like” a page relies on both quality and awareness of the page Popularity is different from quality But strongly correlated when awareness is large P(p,t) = A(p,t)*Q(p) Autumn 2011
52
Exploitation vs. Exploration
Popular pages High quality + Others Opportunity for no-popular pages Autumn 2011
53
An Intelligent Ranking Algorithm for Web Pages
DistanceRank: An Intelligent Ranking Algorithm for Web Pages Autumn 2011
54
DistanceRank Goals A Ranking algorithm based on connectivity with less sensitivity to Rich-get-richer problem in addition to the better ranking Autumn 2011
55
Definitions Definition 1. If page i points to page j then the weight of link between i and j is equal to Log10O(i) j i logO(i) Definition 2. The distance between two pages i and j is the weight of the shortest path from i to j. We call this logarithmic distance and denote it with dij i j dij Autumn 2011
56
An Example log4 s q t log3 p log2 r The distance between p and t is equal to log(2)+log(3) if the path p-r-t was the shortest path between p and q The distance between p and s is log(2)+log(4) Autumn 2011
57
Distance as a Random Surfer Model
If distance between i and j was less than between i and k (dij<dik ) the probability that a random surfer from i reach to j is more than k j dij i dik k Autumn 2011
58
Definition 3 If dij shows the (logarithmic) distance between two page i and j as definition 2, then dj denotes the average distance of page j and is defined as the following (N shows number of web pages): Autumn 2011
59
Distance as Ranking Criterion
Each page that has less average distance from others has a higher rank A page having many input links should have low average distance If pages pointing to this page have low distance then this page should have a low average distance Autumn 2011
60
Disadvantage of Average Distance
The main problem of the average distance is its complexity, O(|V|*|E|) So the implementation of average distance is not practical in real web with 25 billion pages Autumn 2011
61
The DistanceRank In general, suppose O(i) denotes the number of forwarding (outgoing) links from i and B(j) denotes the set of pages that point to page j. The DistanceRank of page j, denoted by dj , is given by: Min i di+logO(i) j dk+logO(k) k Low complexity! B(j) Autumn 2011
62
Intelligent Surfer Intuitively, when a user start browsing from a random page, s/he does not have any background about the web Then, by surfing and visiting web pages, s/he clicks links based on both his/her pervious experiences and the current status (content) of web pages Continuously, s/he accumulates knowledge that causing to reach his/her goal and suitable pages faster Autumn 2011
63
DistanceRank based on Reinforcement Learning
State: Web pages (V) Action: Move from i to j (click j) Punishment: logO(i) Minimize punishments (distance) First: Definition of RL در محيط يادگيري تقويتي يادگيري با استفاده از تعامل صورت ميگيرد كه يادگيرنده يا تصميمگيرنده عامل ناميده ميشود. چيزهايي كه عامل با آنها تعامل ميكند شامل همه شرايط خارج از كنترل عامل ميباشد كه اصطلاحاً محيط ناميده ميشود. عامل با انجام يك سري عمل با محيط تعامل كرده و محيط در پاسخ به هر عمل به عامل، پاداش يا توبيخارائه ميكند. عامل و محيط با يكديگر در بازههاي زماني متوالي t=0,1,2,… تعامل خواهند كرد. در هر مرحلهي زماني عامل تصويري از حالت محيط دريافت ميكند كه S شامل همه حالتهاي ممكن ميباشد، و عمل را براي رسيدن به حالت جديد انتخاب ميكند. هدف اصلي عامل بيشينه (كمينه) كردن پاداش(توبيخ)هاي دريافتي در طي زمان ميباشد. i logO(i) j logO(j) Autumn 2011
64
DistanceRank is similar to Q-Learning
The learning rate: log(O(i)) is the instantaneous punishment it receives in transition state from i to j djt and dit show average distance value of page j and i in time t respectively djt+1 is average distance of page j at time t+1 It is mapped on Intelligent surfer First alpha is one and then decreases in time Gamma is discount factor It is iterative Finally distances are sorted in increasing order Autumn 2011
65
DistanceRank example 1 2 3 4 5 Autumn 2011 Iteration 4 Iteration 3
Distnace Formula Node log2 d1=0.5*d4+log2 1 0.25log2 d2=0.5*Min(d3,d1) 2 0.5log2 d3=0.5*d5 3 0.12log2 d4=0.5*d2 4 d5=0.5*d4+log2 5 Autumn 2011
66
Experimental Results We used University of California, Berkeley’s Web site which includes 5 millions web pages to evaluate DistanceRank Three scenarios were used: Crawling scheduling The goal was to find more important pages faster in the crawling process Rank ordering We compared the ordering of DistanceRank with PageRank and found their similarity Comparison with Google Autumn 2011
67
Rank Ordering We used Kendall's metric for correlation between two rank lists Ideal is PageRank Algorithm Kendall's Tau Breadth-first 0.11 Back Link 0.40 OPIC 0.62 DistanceRank 0.75 Autumn 2011
68
Rich-get-richer Problem
It is proved that the popularity of each page is computed as the following. Pp and Pc are pervious and current popularity respectively There is a relation between popularity and quality (Q=0.80, ): Autumn 2011
69
Popularity growth in DistanceRank (Q=0.8)
DistanceRank is closer to the page quality Autumn 2011
70
Experimental result DistanceRank is less sensitive to the "rich get richer" problem in in comparison with PageRank dj=Alpha*(difference between current and previous distance) + pervious distance DistanceRank is closer to the page quality Autumn 2011
71
DistanceRank Convergence
The ordering obtained by only 5 iterations agree closely with ordering of 20 iterations for 5 million pages the time complexity of algorithm reduces to O(p*|E|) where p<<N in that p shows number of iterations for convergence Autumn 2011
72
Different learning rates
Autumn 2011
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.