CS246 Link-Based Ranking
Problems of TFIDF Vector Works well on small controlled corpus, but not on the Web Top result for “American Airlines” query: accident report of American Airline flights Do users really care how many times “American Airlines” mentioned? Easy to spam Ranking purely based on page content Authors can manipulate page content to get high ranking Any idea?
Link-based Ranking People “expect” to get AA home page for the query “American Airlines” Many pages point to AA home page, but not to accident report Use link-count!
Simple Link Count Still easy to spam Create many pages and add links to a page How to avoid spam?
PageRank A page is important if it is pointed by many important pages PR( p ) = PR( p 1 )/ n 1 + … + PR( p k )/ n k p i : page pointing to p, n i : number of links in p i PageRank of p is the sum of PageRanks of its parents One equation for every page N equations, N unknown variables
Example: Web of 1842 Ne Am MS PR(n) = PR(n)/2 + PR(a)/2 PR(m) = +PR(a)/2 PR(a) = PR(n)/2 + PR(m) Netscape, Microsoft and Amazon
PageRank: Matrix Notation Web graph matrix M = { m ij } Each page i corresponds to row i and column i of the matrix M m ij = 1/ n if page i is one of the n children of page j m ij = 0 otherwise PageRank vector PageRank equation
PageRank: Iterative Computation Initially every page has a unit of importance At each round, each page shares its importance among its children and receives new importance from its parents Eventually the importance of each page reaches a limit Stochastic matrix
Example: Web of 1842 Ne Am MS
PageRank: Eigenvector PageRank equation is the principal eigenvector of M
PageRank: Random Surfer Model The probability of a Web surfer to reach a page after many clicks, following random links Random Click
Problems on the Real Web Dead end A page with no links to send importance All importance “leak out of” the Web Crawler trap A group of one or more pages that have no links out of the group Accumulate all the importance of the Web
Example: Dead End No link from Microsoft Ne Am MS Dead end
Example: Dead End Ne Am MS
Solution to Dead End Assume a surfer to jumps to a random page at a dead end Ne Am MS
Example: Crawler Trap Only self-link at Microsoft Ne Am MS Crawler trap
Example: Crawler Trap Ne Am MS
Crawler Trap: Damping Factor “Tax” each page some fraction of its importance and distribute it equally Probability to jump to a random page Assuming 20% tax
Link Spam Problem Q: What if a spammer creates a lot of pages and create a link to a single spam page? PageRank better than simple link count, but still vulnerable to link spam Q: Any way to avoid link spam?
TrustRank [Gyongyi et al. 2004] Good pages don’t point to spam pages Trust a page only if it is linked by what you trust Same as PageRank except the random jump probability term
TrustRank: Theory [Bianchini et al. 2005] consider a set of pages S S IN(S) OUT(S) DP(S)
TrustRank: Theory [Bianchini et al. 2005]
What Does It Mean? P S = 0 if B S = 0 and P IN = 0 You cannot improve your TrustRank simply by creating more pages and linking within yourself To get non-zero TrustRank, you need to be either trusted or get links from outside
Is TrustRank the Ultimate Solution? Not really… Honeypot: A page with good content with hidden links to spams Good users link to honeypot due to its quality content Blogs, forums, wikis, mailing lists Easy to add spam links Link exchange Set of sites exchanging links to boost ranking A never-ending rat race…
Anti-Spamming at Search Engines Anchor text Consider what others think about your page Give higher weights to anchors from high PageRank pages More difficult to spam TrustRank To gain importance, you need to convince many pages under other’s control or convince search engines More difficult to spam Consider inter-site links with higher weight
Hub and Authority More detailed evaluation of importance A page is useful if It has good contents or It has links to useful pages (good bookmark) Hub/Authority Authority: pages with good contents Hub: pages pointing to good content pages
Hub/Authority: Definition Recursive definition similar to PageRank Authority pages are linked to by many hub pages Hub pages link to many authority pages H( p ) = A( p 1 ) + … + A( p k ) A( p ) = H( p 1 ) + … + H( p m )
Hub/Authority: Matrix Notation Web graph matrix A = { a ij } Each page i corresponds to row i and column i of the matrix A a ij = 1 if page i points to page j a ij = 0 otherwise A is not a stochastic matrix A T : similar to PageRank matrix M, without stochastic restriction
Example: Web of 1842 Ne Am MS [ n, m, a ]: vector
Hub/Authority: Iterative Computation Hub/Authority vector : divergence scaling factor : divergence scaling factor Compute and iteratively with scaling
Hub/Authority: Eigenvector : eigenvector of : eigenvector of
Example: Web of 1842 Ne Am MS
Hub/Authority and Root Set Apply the equations on a small neighbor graph (base set) Start with, say, 100 pages on “bicycling” Add pages pointing to the 100 pages Add pages that the 100 pages are pointing to Identified pages are good “Hub” and “Authority” on “bicycling”
Hub/Authority and Web Community Hub/Authority is often used to identify Web communities Nice notion of “Hub” and “Authority” of the community Often Hub and Authority are tightly linked to each other
Any Questions?
Questions Can we apply Hub/Authority to the entire Web like PageRank?
Hub/Authority on the Entire Web? Hub/Authority works well on a topic-specific subset, but works poorly for the whole Web Easy to spam 1. Create a page pointing to many authority pages (e.g., Yahoo, Google, etc.) The page becomes a good hub page 2. On the page, add a link to your home page
Questions Can we apply PageRank to a small base set?
PageRank on a Small Subset In general, PageRank works better for larger dataset We may be able to compute “topic-specific” PageRank Any other way for “topic-specific” PageRank?
Summary: Link-Based Ranking PageRank TrustRank variation Hub/Authority