Web Information retrieval (Web IR)

Name: Web Information retrieval (Web IR)
Uploaded: 2017-11-07T11:09:28+00:00
Duration: PTM25S5
Channel: Buck James
Description: Web Information retrieval (Web IR)

Web Information retrieval (Web IR)
Handout #9: Connectivity Ranking Ali Mohammad Zareh Bidoki ECE Department, Yazd University Autumn 2011

Outline PageRank HITS Personalized PageRank HostRank Distance Rank
Autumn 2011

Ranking : Definition Ranking is the process which estimates the quality of a set of results retrieved by a search engine Ranking is the most important part of a search engine Autumn 2011

Ranking Types Content-based Connectivity based (web)
Classical IR Connectivity based (web) Query independent Query dependent User-behavior based Autumn 2011

Web information retrieval
Queries are short: 2.35 terms in avg. Huge variety in documents: language, quality, duplication Huge vocabulary: 100s millions terms Deliberate misinformation Spamming! Its rank is completely under the control of Web page’s author Autumn 2011

Ranking in Web IR Words Docs 1 2 w n Web graph Query Ranking is a function of the query terms and of the hyperlink structure Using content of other pages to rank current pages It is out of the control of the page’s author Spamming is hard Autumn 2011

Connectivity-based Ranking
Query independent PageRank Query dependent HITS Autumn 2011

Google’s PageRank Algorithm
Idea: Mine structure of the web graph Each web page is a node Each hyperlink is a directed edge Autumn 2011

PageRank Assumption: A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Quality of a page is related to its in-degree Recursion: Quality of a page is related to its in-degree, and to the quality of pages linking to it PageRank A B Successor Autumn 2011

Definition of PageRank
Consider the following infinite random walk (surf): Initially the surfer is at a random page At each step, the surfer proceeds to a randomly chosen successor of the current page (With probability 1/outdegree) The PageRank of a page p is the fraction of steps the surfer spends at p in the limit. Random surfer model p s Surfer Autumn 2011

PageRank (cont.) By previous theorem:
PageRank = stationary probability for this Markov chain, i.e. Autumn 2011

PageRank (cont.) B A P PageRank of P is P(A)/4 + P(B)/3 Autumn 2011

PageRank (cont.) Autumn 2011

Damping Factor (d) where n is the total number of nodes in the graph
Web graph is not strongly connected Convergence of PageRank is not guaranteed Effects of sinking web pages Pages without outputs Trapping pages Damping factor (d) Surfer proceeds to a randomly chosen successor of the current page with probability d or to a randomly chosen web page with probability (1-d) where n is the total number of nodes in the graph Autumn 2011

PageRank Vector (Linear Algebra)
R is the rank vector (eigen vector) ri is rank value of page i P is a matrix in that pij=1/O(i) if i points to j then else pij=0 Goal is to find eigen vector of matrix P with eigen value one It iterates to converge (power method) Using damping factor we have (ei=1/n) Autumn 2011

PageRank Properties Advantages Disadvantages Finds popularity
It is offline Disadvantages It is query independent All of pages will compete together Unfairness Autumn 2011

HITS (An online query dependent)
Hypertext Induced Topic Search By Kleinberg Autumn 2011

HITS (Hypertext Induced Topic Selection)
The algorithm produces two types of pages: - Authority: A page is very authoritative if it receives many citations. Citation from important pages weight more than citations from less-important pages - Hub: Hubness shows the importance of a page. A good hub is a page that links to many authoritative sites For each vertex v Є V in a graph of interest: a(v) - the authority of v h(v) - the hubness of v Autumn 2011

HITS 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7) Autumn 2011

Authority and Hubness Convergence
Authorities and hubs exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed to by many good hubs Autumn 2011

HITS Example Find a base subgraph:
Start with a root set R {1, 2, 3, 4} {1, 2, 3, 4} - nodes relevant to the topic Expand the root set R to include all the children and a fixed number of parents (d) of nodes in R A new set S (base subgraph) Real version of HITS is based on site relations Autumn 2011

Topic Sensitive PageRank (TSPR)
It precomputes the importance scores online, as with ordinary PageRank. However, it compute multiple importance scores for each page; It computes a set of scores of the importance of a page with respect to various topics. At query time, these importance scores are combined based on the topics of the query to form a composite PageRank score for those pages matching the query. Autumn 2011

TSPR (Cont.) We have n topics on the web, The rank of page v in topic t Difference with original PageRank is in E vector (it is not uniform and we have n E vectors) The are n ranking values for each page Problem: finding the topic of a page and a query (we do not know user interest) Autumn 2011

TSPR (Cont.) Cj = category j
Given a query q, let q’ be the context of q (here q’=q) P(q’|cj) is computed from the class term-vector Dj (number of the terms in the documents below each of the 16 top-level categories, Djt simply gives the total number of occurrences of term t in documents listed below class cj) Autumn 2011

TSPR (Cont.) The quantity P(cj ) is not as straightforward. It is used uniformly, although we could personalize the query results for deferent users by varying this distribution. In other words, for some user k, we can use a prior distribution Pk(cj ) that reflects the interests of user k. This method provides an alternative framework for user-based personalization, rather than directly varying the damping vector E Autumn 2011

TrustRank Spamming in Web TrustRank is used to overcome Spamming
Good and bad pages TrustRank is used to overcome Spamming It proposes techniques to semi-automatically separate reputable, good pages from spam. It first selects a small set of seed pages to be evaluated by an expert. Once it manually identifies the reputable seed pages, then it uses the link structure of the web to discover other pages that are likely to be good. Autumn 2011

TrustRank (cont.) Idea: Good page links to other Good page and bad pages link to other bad pages Autumn 2011

TrustRank (cont.) It formalizes the notion of a human checking a page for spam by a binary oracle function O over all pages p : Autumn 2011

Trust damping and Trust Splitting
Autumn 2011

Computing Trustiness of each page
Goal: Trust Propagation (E(i) is computed from Normalized Oracle vector for example if O(5)=O(10)=O(15)=1 then E(5)=E(10)=E(15)=0.33 ): Autumn 2011

HostRank Pervious link analysis algorithms generally work on a flat link graph, ignoring the hierarchal structure of the Web graph. They suffer from two problems: the sparsity of link graph and biased ranking of newly-emerging pages. HostRank considers both the hierarchical structure and the link structure of the Web. Autumn 2011

Example of Domain, Host, Directory
Autumn 2011

Supper node & Hierarchical Structure of a Web Graph
Tupper-layer is an aggregated link graph which consists with supernodes (such as domain, host and directory). The lower-layer graph is the hierarchical tree structure, in which each node is the individual Web page in the supernode, and the edges are the hierarchical links between the pages. Autumn 2011

Hierarchical Random Walk Model
At the beginning of each browsing session, a user randomly selects the supernode. 2. After the user finished reading a page in a supernode, he may select one of the following three actions with a certain probability: Going to another page within current supernode. Jumping to another supernode that is linked by current supernode. Ending the browsing. Autumn 2011

Two stages in HostRank First computing score of each suppernode by random walk (PageRank) Second propagating score among pages inside a supernode Dissipative Heat Conductance (DHC) model Autumn 2011

Ranking & Crawling Challenges
Rich-get-richer problem Unfairness Low precision Spamming phenomenon Autumn 2011

Popularity and Quality
Definition 1:We define the popularity of page p at time t, P(p; t), as the fraction of Web users who like the page We can interpret the PageRank of page as its popularity on the web Definition 2 : We define the quality of a page p, Q(p), as the probability that an average user will like the page when user sees the page for the first time Autumn 2011

Rich-get-richer Problem
It causes young high quality pages receive less popularity It is from search-engine bias Entrenchment Effect Autumn 2011

Entrenchment Effect Search engines show entrenched (already-popular) pages at the top Users discover pages via search engines; tend to focus on top results … entrenched pages user attention new unpopular pages Autumn 2011

Popularity as a Surrogate for Quality
Search engines want to measure the “quality” of pages Quality is hard to define and measure Various “popularity” measures are used in ranking e.g., in-links, PageRank, user traffic Autumn 2011

Measuring Search-Engine Bias
Random-surfer model Users follow links randomly Never use search engines Search-dominant model Users always start with a search engine Only visit pages returned by search engines It has been found that it takes 60 times longer for a new page to become popular under Search-Dominant than Random-Surfer model. Autumn 2011

Popularity Evaluation
Random Surfer Search Dominant Autumn 2011

Some Definitions Autumn 2011

Relation between Popularity & Visit rate in Random Surfer Model
r1 is constant We can consider PageRank as Popularity (the current PageRank of a page represents the probability that a person arrives at the page if the person follows links on the Web randomly) Autumn 2011

Popularity evolution Autumn 2011

Popularity evolution (Q(p)=1)
Autumn 2011

Relation in Search Dominant Model
Autumn 2011

Random Surfer vs. Search Dominant
Autumn 2011

Search Dominant Formula Detail (Found from AltaVista log-power law)
Autumn 2011

Rank Promotion (by Pandey)
Autumn 2011

Relationship Between Popularity and Quality
aware of page p Users like page p Popularity : depends on the number of users who “like” a page relies on both quality and awareness of the page Popularity is different from quality But strongly correlated when awareness is large P(p,t) = A(p,t)*Q(p) Autumn 2011

Exploitation vs. Exploration
Popular pages High quality + Others Opportunity for no-popular pages Autumn 2011

An Intelligent Ranking Algorithm for Web Pages
DistanceRank: An Intelligent Ranking Algorithm for Web Pages Autumn 2011

DistanceRank Goals A Ranking algorithm based on connectivity with less sensitivity to Rich-get-richer problem in addition to the better ranking Autumn 2011

Definitions Definition 1. If page i points to page j then the weight of link between i and j is equal to Log10O(i) j i logO(i) Definition 2. The distance between two pages i and j is the weight of the shortest path from i to j. We call this logarithmic distance and denote it with dij i j dij Autumn 2011

An Example log4 s q t log3 p log2 r The distance between p and t is equal to log(2)+log(3) if the path p-r-t was the shortest path between p and q The distance between p and s is log(2)+log(4) Autumn 2011

Distance as a Random Surfer Model
If distance between i and j was less than between i and k (dij<dik ) the probability that a random surfer from i reach to j is more than k j dij i dik k Autumn 2011

Definition 3 If dij shows the (logarithmic) distance between two page i and j as definition 2, then dj denotes the average distance of page j and is defined as the following (N shows number of web pages): Autumn 2011

Distance as Ranking Criterion
Each page that has less average distance from others has a higher rank A page having many input links should have low average distance If pages pointing to this page have low distance then this page should have a low average distance Autumn 2011

Disadvantage of Average Distance
The main problem of the average distance is its complexity, O(|V|*|E|) So the implementation of average distance is not practical in real web with 25 billion pages Autumn 2011

The DistanceRank In general, suppose O(i) denotes the number of forwarding (outgoing) links from i and B(j) denotes the set of pages that point to page j. The DistanceRank of page j, denoted by dj , is given by: Min i di+logO(i) j dk+logO(k) k Low complexity! B(j) Autumn 2011

Intelligent Surfer Intuitively, when a user start browsing from a random page, s/he does not have any background about the web Then, by surfing and visiting web pages, s/he clicks links based on both his/her pervious experiences and the current status (content) of web pages Continuously, s/he accumulates knowledge that causing to reach his/her goal and suitable pages faster Autumn 2011

DistanceRank based on Reinforcement Learning
State: Web pages (V) Action: Move from i to j (click j) Punishment: logO(i) Minimize punishments (distance) First: Definition of RL در محيط يادگيري تقويتي يادگيري با استفاده از تعامل صورت مي‌گيرد كه يادگيرنده يا تصميم‌گيرنده عامل ناميده مي‌شود. چيزهايي كه عامل با آنها تعامل مي‌كند شامل همه شرايط خارج از كنترل عامل مي‌باشد كه اصطلاحاً محيط ناميده مي‌شود. عامل با انجام يك سري عمل با محيط تعامل كرده و محيط در پاسخ به هر عمل به عامل، پاداش يا توبيخارائه مي‌كند. عامل و محيط با يكديگر در بازه‌هاي زماني متوالي t=0,1,2,… تعامل خواهند كرد. در هر مرحله‌ي زماني عامل تصويري از حالت محيط دريافت مي‌كند كه S شامل همه حالتهاي ممكن مي‌باشد، و عمل را براي رسيدن به حالت جديد انتخاب مي‌كند. هدف اصلي عامل بيشينه (كمينه) كردن پاداش(توبيخ)هاي دريافتي در طي زمان مي‌باشد. i logO(i) j logO(j) Autumn 2011

DistanceRank is similar to Q-Learning
The learning rate: log(O(i)) is the instantaneous punishment it receives in transition state from i to j djt and dit show average distance value of page j and i in time t respectively djt+1 is average distance of page j at time t+1 It is mapped on Intelligent surfer First alpha is one and then decreases in time Gamma is discount factor It is iterative Finally distances are sorted in increasing order Autumn 2011

DistanceRank example 1 2 3 4 5 Autumn 2011 Iteration 4 Iteration 3
Distnace Formula Node log2 d1=0.5*d4+log2 1 0.25log2 d2=0.5*Min(d3,d1) 2 0.5log2 d3=0.5*d5 3 0.12log2 d4=0.5*d2 4 d5=0.5*d4+log2 5 Autumn 2011

Experimental Results We used University of California, Berkeley’s Web site which includes 5 millions web pages to evaluate DistanceRank Three scenarios were used: Crawling scheduling The goal was to find more important pages faster in the crawling process Rank ordering We compared the ordering of DistanceRank with PageRank and found their similarity Comparison with Google Autumn 2011

Rank Ordering We used Kendall's metric for correlation between two rank lists Ideal is PageRank Algorithm Kendall's Tau Breadth-first 0.11 Back Link 0.40 OPIC 0.62 DistanceRank 0.75 Autumn 2011

Rich-get-richer Problem
It is proved that the popularity of each page is computed as the following. Pp and Pc are pervious and current popularity respectively There is a relation between popularity and quality (Q=0.80, ): Autumn 2011

Popularity growth in DistanceRank (Q=0.8)
DistanceRank is closer to the page quality Autumn 2011

Experimental result DistanceRank is less sensitive to the "rich get richer" problem in in comparison with PageRank dj=Alpha*(difference between current and previous distance) + pervious distance DistanceRank is closer to the page quality Autumn 2011

DistanceRank Convergence
The ordering obtained by only 5 iterations agree closely with ordering of 20 iterations for 5 million pages the time complexity of algorithm reduces to O(p*|E|) where p<<N in that p shows number of iterations for convergence Autumn 2011

Different learning rates
Autumn 2011

Web Information retrieval (Web IR)

Similar presentations

Presentation on theme: "Web Information retrieval (Web IR)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Information retrieval (Web IR)

Similar presentations

Presentation on theme: "Web Information retrieval (Web IR)"— Presentation transcript:

Similar presentations

About project

Feedback