How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 3.0 Unported LicenseCreative Commons Attribution-NonCommercial- ShareAlike 3.0 Unported License
Link Analysis Content analysis is useful, but combining with link analysis allow us to rank pages much more successfully 2 popular methods – Sergey Brin and Larry Page’s PageRank – Jon Kleinberg’s Hyperlink-Induced Topic Search (HITS)
What Does a Link Mean? A A B B A recommends B A specifically does not recommend B B is an authoritative reference for something in A A & B are about the same thing (topic locality)
PageRank Developed by Brin and Page (Google) while Ph.D. students at Stanford Links are a recommendation system – The more links that point to you, the more important you are – Inlinks from important pages are weightier than inlinks from unimportant pages – The more outlinks you have, the less weight your links carry Page et al., The PageRank citation ranking: Bringing order to the web, 1998
Random Surfer Model Model helpful for understanding PageRank The Random Surfer starts at a randomly chosen page and selects a link at random to follow PageRank of a page reflects the probability that the surfer lands on that page
Example of Random Surfer Start at: B ¼ probability of going to A ¼ probability of going to C ¼ probability of going to D ¼ probability of going to E Choose: E ½ probability of going to C ½ probability of going to D
Problem 1: Dangling Node What if we go to A? We’re stuck at a dead-end! Solution: Teleport to any other page at random A A C C D D E E B B
Problem 2: Infinite Loop A A X X What if we get stuck in an a cycle? Solution: Teleport to any other page at random Y Y
Rank Sinks Dangling nodes and cycles are called rank sinks Solution is to add a teleportation probability α to every decision α% chance of getting bored and jumping somewhere else, (1- α)% chance of choosing one of the available links α =.15 is typical
PageRank Definition PageRank of page P i B Pi is set of all pages pointing to P i Sum of PR of all pages pointing to P i Number of outlinks from P j Teleportation probability Total num of pages
PageRank Example PR(C) =.15/ × (PR(B)/4 + PR(E)/2) Problem: What is PR(B) and PR(E)? Solution: Give all pages same PR to start (1/|P|) & iteratively calculate new PR
PageRank Example PR(A) = × PR(B)/4 =.0725 PR(B) = × PR(D)/1 =.2 PR(C) = (PR(B)/4 + PR(E)/2) = (.2/4 +.2/2) =.1575 PR(D) = (PR(B)/4 + PR(E)/2) =.1575 PR(E) = (PR(B)/4 + PR(C)/1) =.2425
Calculating PageRank PageRank is computed over and over until it converges, around 20 times Can also be calculated efficiently using matrix multiplication
PageRank Definition as Matrix Stated as a matrix equation where R is the vector of PageRank values and T the matrix for transition probabilities: where T ij is the probability of going from page j to i: if a link from page j to i exists, otherwise Total num of pagesTotal outlinks from page j
PageRank Matrix Example PR A PR B PR C PR D PR E = = 0.15/5 + (1 – 0.15)/4 = Init PR values 1/|P| No link from C to A so value = 0.15/5
PageRank Issues Richer-get-richer phenomenon – May be difficult for new pages with few inlinks to compete with older, highly linked pages with high PageRank – Could promote small fraction of new pages at random 1 or add decay factor to links Study 2 showed just counting number of inlinks gives similar ranking as PageRank – Study was on small scale and pages were not necessarily “typical” – Counting inlinks more susceptible to spamming 1 Pandey et al., Shuffling a stacked deck, VLDB Amento et al., Does “authority” mean quality?
HITS Hyperlink-induced topic search (HITS) by Kleinberg 1 Hub: page with outlinks to informative web pages Authority: informative/authoritative page with many inlinks Recursive definition: – Good hubs point to good authorities – Good authorities are pointed to by good hubs 1 Kleinberg, Authoritative sources in a hyperlinked environment, J. ACM, 1999
Root and Base Sets D D E E F F H H I I Base set G G Resulting subgraph between 1K – 5K pages Root set A A B B C C
Calculate H and A E is set of all directed edges in subgraph e qp is edge from page q to p inlinks outlinks
Calculating H and A H and A scores computed repetitively until they converge, about iterations Can also be calculated efficiently using matrix multiplication
Example Subgraph A A B B C C D D Adjacency matrix A B C D ABCDABCD
Auth Scores A AAABACADAAABACAD = = Authority scores Transposed adjacency matrix Init hub scores Resulting auth scores Best authority x
Hub Scores HAHBHCHDHAHBHCHD = = Hub scores Adjacency matrix Init auth scores Resulting hub scores Best hub x
Problems with HITS Has not been widely used – IBM holds patent Query dependence – Later implementations have made query independent Topic drift – Pages in expanded base set may not be on same topic as root pages – Solution is to examine link text when expanding
Link Spam There is a strong economic incentive to rank highly in a SERP White hat SEO firms follow published guidelines to improve customer rankings 1 To boost PageRank, black hat SEO practices include: – Building elaborate link farms – Exchanging reciprocal links – Posting links on blogs and forums 1 Google’s Webmaster Guidelines
Combating Link Spam Sites like Wikipedia can discourage links than only promote PageRank by using “nofollow” Go here! Davison 1 identified 75 features for comparing source and destination pages – Overlap, identical page titles, same links, etc. TrustRank 2 – Bias teleportation in PageRank to set of trusted web pages 1 Davison, Recognizing nepotistic links on the web, Gyöngyi et al., Combating web spam with TrustRank, VLDB 2004
Combating Link Spam cont. SpamRank 1 – PageRank for whole Web has power-law distribution – Penalize pages whose supporting pages do not approximate power-law dist Anti-TrustRank 2 – Give high weight to known spam pages and propagate values using PageRank – New pages can be classified spam if large contribution of PageRank from known spam pages or if high Anti- TrustRank 1 Benczúr et al., SpamRank – fully automated link spam detection, AIRWeb Krishnan & Raj, Web spam detection with Anti-TrustRank, AIRWeb 2006