Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Information retrieval (Web IR)

Similar presentations


Presentation on theme: "Web Information retrieval (Web IR)"— Presentation transcript:

1 Web Information retrieval (Web IR)
Handout #9: Connectivity Ranking Ali Mohammad Zareh Bidoki ECE Department, Yazd University Autumn 2011

2 Outline PageRank HITS Personalized PageRank HostRank Distance Rank
Autumn 2011

3 Ranking : Definition Ranking is the process which estimates the quality of a set of results retrieved by a search engine Ranking is the most important part of a search engine Autumn 2011

4 Ranking Types Content-based Connectivity based (web)
Classical IR Connectivity based (web) Query independent Query dependent User-behavior based Autumn 2011

5 Web information retrieval
Queries are short: 2.35 terms in avg. Huge variety in documents: language, quality, duplication Huge vocabulary: 100s millions terms Deliberate misinformation Spamming! Its rank is completely under the control of Web page’s author Autumn 2011

6 Ranking in Web IR Words Docs 1 2 w n Web graph Query Ranking is a function of the query terms and of the hyperlink structure Using content of other pages to rank current pages It is out of the control of the page’s author Spamming is hard Autumn 2011

7 Connectivity-based Ranking
Query independent PageRank Query dependent HITS Autumn 2011

8 Google’s PageRank Algorithm
Idea: Mine structure of the web graph Each web page is a node Each hyperlink is a directed edge Autumn 2011

9 PageRank Assumption: A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Quality of a page is related to its in-degree Recursion: Quality of a page is related to its in-degree, and to the quality of pages linking to it PageRank A B Successor Autumn 2011

10 Definition of PageRank
Consider the following infinite random walk (surf): Initially the surfer is at a random page At each step, the surfer proceeds to a randomly chosen successor of the current page (With probability 1/outdegree) The PageRank of a page p is the fraction of steps the surfer spends at p in the limit. Random surfer model p s Surfer Autumn 2011

11 PageRank (cont.) By previous theorem:
PageRank = stationary probability for this Markov chain, i.e. Autumn 2011

12 PageRank (cont.) B A P PageRank of P is P(A)/4 + P(B)/3 Autumn 2011

13 PageRank (cont.) Autumn 2011

14 Damping Factor (d) where n is the total number of nodes in the graph
Web graph is not strongly connected Convergence of PageRank is not guaranteed Effects of sinking web pages Pages without outputs Trapping pages Damping factor (d) Surfer proceeds to a randomly chosen successor of the current page with probability d or to a randomly chosen web page with probability (1-d) where n is the total number of nodes in the graph Autumn 2011

15 PageRank Vector (Linear Algebra)
R is the rank vector (eigen vector) ri is rank value of page i P is a matrix in that pij=1/O(i) if i points to j then else pij=0 Goal is to find eigen vector of matrix P with eigen value one It iterates to converge (power method) Using damping factor we have (ei=1/n) Autumn 2011

16 PageRank Properties Advantages Disadvantages Finds popularity
It is offline Disadvantages It is query independent All of pages will compete together Unfairness Autumn 2011

17 HITS (An online query dependent)
Hypertext Induced Topic Search By Kleinberg Autumn 2011

18 HITS (Hypertext Induced Topic Selection)
The algorithm produces two types of pages: - Authority: A page is very authoritative if it receives many citations. Citation from important pages weight more than citations from less-important pages - Hub: Hubness shows the importance of a page. A good hub is a page that links to many authoritative sites For each vertex v Є V in a graph of interest: a(v) - the authority of v h(v) - the hubness of v Autumn 2011

19 HITS 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7) Autumn 2011

20 Authority and Hubness Convergence
Authorities and hubs exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed to by many good hubs Autumn 2011

21 HITS Example Find a base subgraph:
Start with a root set R {1, 2, 3, 4} {1, 2, 3, 4} - nodes relevant to the topic Expand the root set R to include all the children and a fixed number of parents (d) of nodes in R A new set S (base subgraph) Real version of HITS is based on site relations Autumn 2011

22 Topic Sensitive PageRank (TSPR)
It precomputes the importance scores online, as with ordinary PageRank. However, it compute multiple importance scores for each page; It computes a set of scores of the importance of a page with respect to various topics. At query time, these importance scores are combined based on the topics of the query to form a composite PageRank score for those pages matching the query. Autumn 2011

23 TSPR (Cont.) We have n topics on the web, The rank of page v in topic t Difference with original PageRank is in E vector (it is not uniform and we have n E vectors) The are n ranking values for each page Problem: finding the topic of a page and a query (we do not know user interest) Autumn 2011

24 TSPR (Cont.) Cj = category j
Given a query q, let q’ be the context of q (here q’=q) P(q’|cj) is computed from the class term-vector Dj (number of the terms in the documents below each of the 16 top-level categories, Djt simply gives the total number of occurrences of term t in documents listed below class cj) Autumn 2011

25 TSPR (Cont.) The quantity P(cj ) is not as straightforward. It is used uniformly, although we could personalize the query results for deferent users by varying this distribution. In other words, for some user k, we can use a prior distribution Pk(cj ) that reflects the interests of user k. This method provides an alternative framework for user-based personalization, rather than directly varying the damping vector E Autumn 2011

26 TrustRank Spamming in Web TrustRank is used to overcome Spamming
Good and bad pages TrustRank is used to overcome Spamming It proposes techniques to semi-automatically separate reputable, good pages from spam. It first selects a small set of seed pages to be evaluated by an expert. Once it manually identifies the reputable seed pages, then it uses the link structure of the web to discover other pages that are likely to be good. Autumn 2011

27 TrustRank (cont.) Idea: Good page links to other Good page and bad pages link to other bad pages Autumn 2011

28 TrustRank (cont.) It formalizes the notion of a human checking a page for spam by a binary oracle function O over all pages p : Autumn 2011

29 Trust damping and Trust Splitting
Autumn 2011

30 Computing Trustiness of each page
Goal: Trust Propagation (E(i) is computed from Normalized Oracle vector for example if O(5)=O(10)=O(15)=1 then E(5)=E(10)=E(15)=0.33 ): Autumn 2011

31 HostRank Pervious link analysis algorithms generally work on a flat link graph, ignoring the hierarchal structure of the Web graph. They suffer from two problems: the sparsity of link graph and biased ranking of newly-emerging pages. HostRank considers both the hierarchical structure and the link structure of the Web. Autumn 2011

32 Example of Domain, Host, Directory
Autumn 2011

33 Supper node & Hierarchical Structure of a Web Graph
Tupper-layer is an aggregated link graph which consists with supernodes (such as domain, host and directory). The lower-layer graph is the hierarchical tree structure, in which each node is the individual Web page in the supernode, and the edges are the hierarchical links between the pages. Autumn 2011

34 Hierarchical Random Walk Model
At the beginning of each browsing session, a user randomly selects the supernode. 2. After the user finished reading a page in a supernode, he may select one of the following three actions with a certain probability: Going to another page within current supernode. Jumping to another supernode that is linked by current supernode. Ending the browsing. Autumn 2011

35 Two stages in HostRank First computing score of each suppernode by random walk (PageRank) Second propagating score among pages inside a supernode Dissipative Heat Conductance (DHC) model Autumn 2011

36 Ranking & Crawling Challenges
Rich-get-richer problem Unfairness Low precision Spamming phenomenon Autumn 2011

37 Popularity and Quality
Definition 1:We define the popularity of page p at time t, P(p; t), as the fraction of Web users who like the page We can interpret the PageRank of page as its popularity on the web Definition 2 : We define the quality of a page p, Q(p), as the probability that an average user will like the page when user sees the page for the first time Autumn 2011

38 Rich-get-richer Problem
It causes young high quality pages receive less popularity It is from search-engine bias Entrenchment Effect Autumn 2011

39 Entrenchment Effect Search engines show entrenched (already-popular) pages at the top Users discover pages via search engines; tend to focus on top results entrenched pages user attention new unpopular pages Autumn 2011

40 Popularity as a Surrogate for Quality
Search engines want to measure the “quality” of pages Quality is hard to define and measure Various “popularity” measures are used in ranking e.g., in-links, PageRank, user traffic Autumn 2011

41 Measuring Search-Engine Bias
Random-surfer model Users follow links randomly Never use search engines Search-dominant model Users always start with a search engine Only visit pages returned by search engines It has been found that it takes 60 times longer for a new page to become popular under Search-Dominant than Random-Surfer model. Autumn 2011

42 Popularity Evaluation
Random Surfer Search Dominant Autumn 2011

43 Some Definitions Autumn 2011

44 Relation between Popularity & Visit rate in Random Surfer Model
r1 is constant We can consider PageRank as Popularity (the current PageRank of a page represents the probability that a person arrives at the page if the person follows links on the Web randomly) Autumn 2011

45 Popularity evolution Autumn 2011

46 Popularity evolution (Q(p)=1)
Autumn 2011

47 Relation in Search Dominant Model
Autumn 2011

48 Random Surfer vs. Search Dominant
Autumn 2011

49 Search Dominant Formula Detail (Found from AltaVista log-power law)
Autumn 2011

50 Rank Promotion (by Pandey)
Autumn 2011

51 Relationship Between Popularity and Quality
aware of page p Users like page p Popularity : depends on the number of users who “like” a page relies on both quality and awareness of the page Popularity is different from quality But strongly correlated when awareness is large P(p,t) = A(p,t)*Q(p) Autumn 2011

52 Exploitation vs. Exploration
Popular pages High quality + Others Opportunity for no-popular pages Autumn 2011

53 An Intelligent Ranking Algorithm for Web Pages
DistanceRank: An Intelligent Ranking Algorithm for Web Pages Autumn 2011

54 DistanceRank Goals A Ranking algorithm based on connectivity with less sensitivity to Rich-get-richer problem in addition to the better ranking Autumn 2011

55 Definitions Definition 1. If page i points to page j then the weight of link between i and j is equal to Log10O(i) j i logO(i) Definition 2. The distance between two pages i and j is the weight of the shortest path from i to j. We call this logarithmic distance and denote it with dij i j dij Autumn 2011

56 An Example log4 s q t log3 p log2 r The distance between p and t is equal to log(2)+log(3) if the path p-r-t was the shortest path between p and q The distance between p and s is log(2)+log(4) Autumn 2011

57 Distance as a Random Surfer Model
If distance between i and j was less than between i and k (dij<dik ) the probability that a random surfer from i reach to j is more than k j dij i dik k Autumn 2011

58 Definition 3 If dij shows the (logarithmic) distance between two page i and j as definition 2, then dj denotes the average distance of page j and is defined as the following (N shows number of web pages): Autumn 2011

59 Distance as Ranking Criterion
Each page that has less average distance from others has a higher rank A page having many input links should have low average distance If pages pointing to this page have low distance then this page should have a low average distance Autumn 2011

60 Disadvantage of Average Distance
The main problem of the average distance is its complexity, O(|V|*|E|) So the implementation of average distance is not practical in real web with 25 billion pages Autumn 2011

61 The DistanceRank In general, suppose O(i) denotes the number of forwarding (outgoing) links from i and B(j) denotes the set of pages that point to page j. The DistanceRank of page j, denoted by dj , is given by: Min i di+logO(i) j dk+logO(k) k Low complexity! B(j) Autumn 2011

62 Intelligent Surfer Intuitively, when a user start browsing from a random page, s/he does not have any background about the web Then, by surfing and visiting web pages, s/he clicks links based on both his/her pervious experiences and the current status (content) of web pages Continuously, s/he accumulates knowledge that causing to reach his/her goal and suitable pages faster Autumn 2011

63 DistanceRank based on Reinforcement Learning
State: Web pages (V) Action: Move from i to j (click j) Punishment: logO(i) Minimize punishments (distance) First: Definition of RL در محيط يادگيري تقويتي يادگيري با استفاده از تعامل صورت مي‌گيرد كه يادگيرنده يا تصميم‌گيرنده عامل ناميده مي‌شود. چيزهايي كه عامل با آنها تعامل مي‌كند شامل همه شرايط خارج از كنترل عامل مي‌باشد كه اصطلاحاً محيط ناميده مي‌شود. عامل با انجام يك سري عمل با محيط تعامل كرده و محيط در پاسخ به هر عمل به عامل، پاداش يا توبيخارائه مي‌كند. عامل و محيط با يكديگر در بازه‌هاي زماني متوالي t=0,1,2,… تعامل خواهند كرد. در هر مرحله‌ي زماني عامل تصويري از حالت محيط دريافت مي‌كند كه S شامل همه حالتهاي ممكن مي‌باشد، و عمل را براي رسيدن به حالت جديد انتخاب مي‌كند. هدف اصلي عامل بيشينه (كمينه) كردن پاداش(توبيخ)هاي دريافتي در طي زمان مي‌باشد. i logO(i) j logO(j) Autumn 2011

64 DistanceRank is similar to Q-Learning
The learning rate: log(O(i)) is the instantaneous punishment it receives in transition state from i to j djt and dit show average distance value of page j and i in time t respectively djt+1 is average distance of page j at time t+1 It is mapped on Intelligent surfer First alpha is one and then decreases in time Gamma is discount factor It is iterative Finally distances are sorted in increasing order Autumn 2011

65 DistanceRank example 1 2 3 4 5 Autumn 2011 Iteration 4 Iteration 3
Distnace Formula Node log2 d1=0.5*d4+log2 1 0.25log2 d2=0.5*Min(d3,d1) 2 0.5log2 d3=0.5*d5 3 0.12log2 d4=0.5*d2 4 d5=0.5*d4+log2 5 Autumn 2011

66 Experimental Results We used University of California, Berkeley’s Web site which includes 5 millions web pages to evaluate DistanceRank Three scenarios were used: Crawling scheduling The goal was to find more important pages faster in the crawling process Rank ordering We compared the ordering of DistanceRank with PageRank and found their similarity Comparison with Google Autumn 2011

67 Rank Ordering We used Kendall's metric for correlation between two rank lists Ideal is PageRank Algorithm Kendall's Tau Breadth-first 0.11 Back Link 0.40 OPIC 0.62 DistanceRank 0.75 Autumn 2011

68 Rich-get-richer Problem
It is proved that the popularity of each page is computed as the following. Pp and Pc are pervious and current popularity respectively There is a relation between popularity and quality (Q=0.80, ): Autumn 2011

69 Popularity growth in DistanceRank (Q=0.8)
DistanceRank is closer to the page quality Autumn 2011

70 Experimental result DistanceRank is less sensitive to the "rich get richer" problem in in comparison with PageRank dj=Alpha*(difference between current and previous distance) + pervious distance DistanceRank is closer to the page quality Autumn 2011

71 DistanceRank Convergence
The ordering obtained by only 5 iterations agree closely with ordering of 20 iterations for 5 million pages the time complexity of algorithm reduces to O(p*|E|) where p<<N in that p shows number of iterations for convergence Autumn 2011

72 Different learning rates
Autumn 2011


Download ppt "Web Information retrieval (Web IR)"

Similar presentations


Ads by Google