Link Analysis 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis Mining Massive Datasets.

2 Link Analysis 2 Link Analysis Algorithms  PageRank  Hubs and Authorities  Topic-Sensitive PageRank  Spam Detection Algorithms  Other interesting topics we won ’ t cover  Detecting duplicates and mirrors  Mining for communities (community detection) (Refer to Chapter 10 of the textbook)

3 Link Analysis 3 Outline  PageRank  Topic-Sensitive PageRank  Hubs and Authorities  Spam Detection

4 Link Analysis 4 Ranking web pages  Web pages are not equally “ important ”  v  Inlinks as votes  has 23,400 inlinks  has 1 inlink  Are all inlinks equal?  Recursive question! PageRank

5 Link Analysis 5 Simple recursive formulation  Each link ’ s vote is proportional to the importance of its source page  If page P with importance x has n outlinks, each link gets x/n votes  Page P ’ s own importance is the sum of the votes on its inlinks PageRank

6 Link Analysis 6 Simple “ flow ” model The web in 1839 Yahoo M’softAmazon y a m y/2 a/2 m y = y /2 + a /2 a = y /2 + m m = a /2 PageRank

7 Link Analysis 7 Solving the flow equations  3 equations, 3 unknowns, no constants  No unique solution  All solutions equivalent modulo scale factor  Additional constraint forces uniqueness  y+a+m = 1  y = 2/5, a = 2/5, m = 1/5  Gaussian elimination method works for small examples, but we need a better method for large graphs PageRank

8 Link Analysis 8 Matrix formulation  Matrix M has one row and one column for each web page  Suppose page j has n outlinks  If j i, then M ij =1/n  Else M ij =0  M is a column stochastic matrix  Columns sum to 1  Suppose r is a vector with one entry per web page  r i is the importance score of page i  Call it the rank vector  |r| = 1 PageRank

9 Link Analysis 9 Example Suppose page j links to 3 pages, including i i j M r r = i 1/3 PageRank

10 Link Analysis 10 Eigenvector formulation  The flow equations can be written r = Mr  So the rank vector is an eigenvector of the stochastic web matrix  In fact, its first or principal eigenvector, with corresponding eigenvalue 1 PageRank

11 Link Analysis 11 Example Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m y = y /2 + a /2 a = y /2 + m m = a /2 r = Mr y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m PageRank

12 Link Analysis 12 Power Iteration method  Simple iterative scheme (aka relaxation)  Suppose there are N web pages  Initialize: r 0 = [1/N, ….,1/N] T  Iterate: r k+1 = Mr k  Stop when |r k+1 - r k | 1 <   |x| 1 =  1 ≤ i ≤ N |x i | is the L 1 norm  Can use any other vector norm e.g., Euclidean PageRank

13 Link Analysis 13 Power Iteration Example Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m y a = m 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6 2/5 1/5... PageRank

14 Link Analysis 14 Random Walk Interpretation  Imagine a random web surfer  At any time t, surfer is on some page P  At time t+1, the surfer follows an outlink from P uniformly at random  Ends up on some page Q linked from P  Process repeats indefinitely  Let p(t) be a vector whose i th component is the probability that the surfer is at page i at time t  p(t) is a probability distribution on pages PageRank

15 Link Analysis 15 The stationary distribution  Where is the surfer at time t+1?  Follows a link uniformly at random  p(t+1) = Mp(t)  Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t)  Then p(t) is called a stationary distribution for the random walk  Our rank vector r satisfies r = Mr  So it is a stationary distribution for the random surfer PageRank

16 Link Analysis 16 Existence and Uniqueness A central result from the theory of random walks (aka Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0. PageRank

17 Link Analysis 17 Spider traps  A group of pages is a spider trap if there are no links from within the group to outside the group  Random surfer gets trapped  Spider traps violate the conditions needed for the random walk theorem PageRank

18 Link Analysis 18 Microsoft becomes a spider trap Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 y a m y a = m 111111 1 1/2 3/2 3/4 1/2 7/4 5/8 3/8 2 003003... PageRank

19 Link Analysis 19 Random teleports  The Google solution for spider traps  At each time step, the random surfer has two options:  With probability , follow a link at random  With probability 1- , jump to some page uniformly at random  Common values for  are in the range 0.8 to 0.9  Surfer will teleport out of spider trap within a few time steps PageRank

20 Link Analysis 20 Random teleports (  ) Yahoo M’softAmazon 1/2 0.8*1/2 0.2*1/3 y 1/2 a 1/2 m 0 y 1/2 0 y 0.8* 1/3 y + 0.2* 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 PageRank

21 Link Analysis 21 Random teleports (  ) Yahoo M’softAmazon 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 y a = m 111111 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688 7/11 5/11 21/11... PageRank

22 Link Analysis 22 Matrix formulation  Suppose there are N pages  Consider a page j, with set of outlinks O(j)  We have M ij = 1/|O(j)| when j i and M ij = 0 otherwise  The random teleport is equivalent to  adding a teleport link from j to every other page with probability (1-  )/N  reducing the probability of following each outlink from 1/|O(j)| to  /|O(j)|  Equivalent: tax each page a fraction (1-  ) of its score and redistribute evenly PageRank

23 Link Analysis 23 PageRank  Construct the N * N matrix A as follows  A ij =  M ij + (1-  )/N  Verify that A is a stochastic matrix  The PageRank vector r is the principal eigenvector of this matrix  satisfying r = Ar  Equivalently, r is the stationary distribution of the random walk with teleports PageRank

24 Link Analysis 24 Dead ends  Pages with no outlinks are “ dead ends ” for the random surfer  Nowhere to go on next step PageRank

25 Link Analysis 25 Microsoft becomes a dead end Yahoo M’softAmazon y a = m 111111 1 0.6 0.787 0.547 0.387 0.648 0.430 0.333 000000... 1/2 1/2 0 1/2 0 0 0 1/2 0 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 1/15 0.8 + 0.2 Non- stochastic! PageRank

26 Link Analysis 26 Dealing with dead ends  Teleport  Follow random teleport links with probability 1.0 from dead ends  Adjust matrix accordingly  Prune and propagate  Preprocess the graph to eliminate dead ends  Might require multiple passes  Compute PageRank on reduced graph  Approximate values for dead ends by propagating values from reduced graph PageRank

27 Link Analysis 27 Computing PageRank  Key step is matrix-vector multiplication  r new = Ar old  Easy if we have enough main memory to hold A, r old, r new  Say N = 1 billion pages  We need 4 bytes for each entry (say)  2 billion entries for vectors, approx 8GB  Matrix A has N 2 entries  10 18 is a large number! PageRank

28 Link Analysis 28 Rearranging the equation r = Ar, where A ij =  M ij + (1-  )/N r i =  1 ≤ j ≤ N A ij r j r i =  1 ≤ j ≤ N [  M ij + (1-  )/N] r j =  1 ≤ j ≤ N M ij r j + (1-  )/N  1 ≤ j ≤ N r j =  1 ≤ j ≤ N M ij r j + (1-  )/N, since |r| = 1 r =  Mr + [(1-  )/N] N where [x] N is an N-vector with all entries x PageRank

29 Link Analysis 29 Sparse matrix formulation  We can rearrange the PageRank equation:  r =  Mr + [(1-  )/N] N  [(1-  )/N] N is an N-vector with all entries (1-  )/N  M is a sparse matrix!  10 links per node, approx 10N entries  So in each iteration, we need to:  Compute r new =  Mr old  Add a constant value (1-  )/N to each entry in r new PageRank

30 Link Analysis 30 Sparse matrix encoding  Encode sparse matrix using only nonzero entries  Space proportional roughly to number of links  say 10N, or 4*10*1 billion = 40GB  still won ’ t fit in memory, but will fit on disk 03 1, 5, 7 15 17, 64, 113, 117, 245 22 13, 23 source node degree destination nodes PageRank

31 Link Analysis 31 Basic Algorithm  Assume we have enough RAM to fit r new, plus some working memory  Store r old and matrix M on disk Basic Algorithm:  Initialize: r old = [1/N] N  Iterate:  Update: Perform a sequential scan of M and r old to update r new  Write out r new to disk as r old for next iteration  Every few iterations, compute |r new -r old | and stop if it is below threshold  Need to read in both vectors into memory PageRank

32 Link Analysis 32 Update step 03 1, 5, 6 14 17, 64, 113, 117 22 13, 23 srcdegreedestination 0 1 2 3 4 5 6 0 1 2 3 4 5 6 r new r old Initialize all entries of r new to (1-  )/N For each page p (out-degree n): Read into memory: p, n, dest 1,…,dest n, r old (p) for j = 1..n: r new (dest j ) +=  *r old (p)/n PageRank

33 Link Analysis 33 Analysis  In each iteration, we have to:  Read r old and M  Write r new back to disk  IO Cost = 2|r| + |M|  What if we had enough memory to fit both r new and r old ?  What if we could not even fit r new in memory?  10 billion pages PageRank

34 Link Analysis 34 Strip-based update Problem: thrashing PageRank

35 Link Analysis 35 Block Update algorithm PageRank

36 Link Analysis 36 Block Update algorithm 03 0, 1 12 0 srcdegreedestination 0 1 2 3 0 1 2 3 r new r old 32 2 03 3 12 2 PageRank 21 0 32 1

37 Link Analysis 37 Block Update algorithm  Some additional overhead  But usually worth it  Cost per iteration  |M|(1+  ) + (k+1)|r| PageRank

38 Link Analysis 38 Outline  PageRank  Topic-Sensitive PageRank  Hubs and Authorities  Spam Detection

39 Link Analysis 39 Some problems with PageRank  Measures generic popularity of a page  Biased against topic-specific authorities  Ambiguous queries e.g., jaguar  Uses a single measure of importance  Other models e.g., hubs-and-authorities  Susceptible to link spam  Artificial link topographies created in order to boost page rank Topic-Sensitive PageRank

40 Link Analysis 40 Topic-Sensitive PageRank  Instead of generic popularity, can we measure popularity within a topic?  E.g., computer science, health  Bias the random walk  When the random walker teleports, he picks a page from a set S of web pages  S contains only pages that are relevant to the topic  E.g., Open Directory (DMOZ) pages for a given topic (  For each teleport set S, we get a different rank vector r S Topic-Sensitive PageRank

41 Link Analysis 41 Matrix formulation  A ij =  M ij + (1-  )/|S| if i is in S  A ij =  M ij otherwise  Show that A is stochastic  We have weighted all pages in the teleport set S equally  Could also assign different weights to them Topic-Sensitive PageRank

42 Link Analysis 42 Example 1 23 4 Suppose S = {1},  = 0.8 NodeIteration 012…stable 300.40.080.327 4000.320.261 Note how we initialize the PageRank vector differently from the unbiased PageRank case. 0.2 0.5 1 11 0.4 0.8 Topic-Sensitive PageRank

43 Link Analysis 43 How well does TSPR work?  Experimental results [Haveliwala 2000]  Picked 16 topics  Teleport sets determined using DMOZ  E.g., arts, business, sports, …  “ Blind study ” using volunteers  35 test queries  Results ranked using PageRank and TSPR of most closely related topic  E.g., bicycling using Sports ranking  In most cases volunteers preferred TSPR ranking Topic-Sensitive PageRank

44 Link Analysis 44 Which topic ranking to use?  User can pick from a menu  Use Bayesian classification schemes to classify query into a topic  Can use the context of the query  E.g., query is launched from a web page talking about a known topic  History of queries e.g., “ basketball ” followed by “ jordan ”  User context e.g., user ’ s My Yahoo settings, bookmarks, … Topic-Sensitive PageRank

45 Link Analysis 45 Outline  PageRank  Topic-Sensitive PageRank  Hubs and Authorities  Spam Detection

46 Link Analysis 46 Hubs and Authorities  Suppose we are given a collection of documents on some broad topic  e.g., stanford, evolution, iraq  perhaps obtained through a text search  Can we organize these documents in some manner?  PageRank offers one solution  HITS (Hypertext-Induced Topic Selection) is another  proposed at approx the same time (1998) Hubs and Authorities

47 Link Analysis 47 HITS Model  Interesting documents fall into two classes  Authorities are pages containing useful information  course home pages  home pages of auto manufacturers  Hubs are pages that link to authorities  course bulletin  list of US auto manufacturers Hubs and Authorities

48 Link Analysis 48 Idealized view HubsAuthorities Hubs and Authorities

49 Link Analysis 49 Mutually recursive definition  A good hub links to many good authorities  A good authority is linked from many good hubs  Model using two scores for each node  Hub score and Authority score  Represented as vectors h and a Hubs and Authorities

50 Link Analysis 50 Transition Matrix A  HITS uses a matrix A[i, j] = 1 if page i links to page j, 0 if not  A T, the transpose of A, is similar to the PageRank matrix M, but A T has 1 ’ s where M has fractions Hubs and Authorities

51 Link Analysis 51 Example Yahoo M’softAmazon y 1 1 1 a 1 0 1 m 0 1 0 y a m A = Hubs and Authorities

52 Link Analysis 52 Hub and Authority Equations  The hub score of page P is proportional to the sum of the authority scores of the pages it links to  h = λ Aa  Constant λ is a scale factor  The authority score of page P is proportional to the sum of the hub scores of the pages it is linked from  a = μ A T h  Constant μ is scale factor Hubs and Authorities

53 Link Analysis 53 Iterative algorithm  Initialize h, a to all 1 ’ s  h = Aa  Scale h so that its max entry is 1.0  a = A T h  Scale a so that its max entry is 1.0  Continue until h, a converge Hubs and Authorities

54 Link Analysis 54 Example 1 1 1 A = 1 0 1 0 1 0 1 1 0 A T = 1 0 1 1 1 0 a(yahoo) a(amazon) a(m’soft) ====== 111111 111111 1 4/5 1 0.75 1... 1 0.732 1 h(yahoo) = 1 h(amazon) = 1 h(m’soft) = 1 1 2/3 1/3 1 0.73 0.27... 1.000 0.732 0.268 1 0.71 0.29 Hubs and Authorities

55 Link Analysis 55 Existence and Uniqueness h = λ Aa a = μ A T h h = λμ AA T h a = λμ A T A a Under reasonable assumptions about A, the dual iterative algorithm converges to vectors h* and a* such that: h* is the principal eigenvector of the matrix AA T a* is the principal eigenvector of the matrix A T A Hubs and Authorities

56 Link Analysis 56 Bipartite cores HubsAuthorities Most densely-connected core (primary core) Less densely-connected core (secondary core) Hubs and Authorities

57 Link Analysis 57 Secondary cores  A single topic can have many bipartite cores  corresponding to different meanings, or points of view  abortion: pro-choice, pro-life  evolution: darwinian, intelligent design  jaguar: auto, Mac, NFL team, panthera onca  How to find such secondary cores? Hubs and Authorities

58 Link Analysis 58 Non-primary eigenvectors  AA T and A T A have the same set of eigenvalues  An eigenpair is the pair of eigenvectors with the same eigenvalue  The primary eigenpair (largest eigenvalue) is what we get from the iterative algorithm  Non-primary eigenpairs correspond to other bipartite cores  The eigenvalue is a measure of the density of links in the core Hubs and Authorities

59 Link Analysis 59 Finding secondary cores  Once we find the primary core, we can remove its links from the graph  Repeat HITS algorithm on residual graph to find the next bipartite core  Technically, not exactly equivalent to non-primary eigenpair model Hubs and Authorities

60 Link Analysis 60 Creating the graph for HITS  We need a well-connected graph of pages for HITS to work well Hubs and Authorities

61 Link Analysis 61 PageRank and HITS  PageRank and HITS are two solutions to the same problem  What is the value of an inlink from S to D?  In the PageRank model, the value of the link depends on the links into S  In the HITS model, it depends on the value of the other links out of S  The destinies of PageRank and HITS post-1998 were very different  Why? Hubs and Authorities

62 Link Analysis 62 Outline  PageRank  Topic-Sensitive PageRank  Hubs and Authorities  Spam Detection

63 Link Analysis 63 Web Spam  Search has become the default gateway to the web  Very high premium to appear on the first page of search results  e.g., e-commerce sites  advertising-driven sites Spam Detection

64 Link Analysis 64 What is web spam?  Spamming = any deliberate action solely in order to boost a web page ’ s position in search engine results, incommensurate with page ’ s real value  Spam = web pages that are the result of spamming  This is a very broad defintion  SEO industry might disagree!  SEO = search engine optimization  Approximately 10-15% of web pages are spam Spam Detection

65 Link Analysis 65 Web Spam Taxonomy  We follow the treatment by Gyongyi and Garcia- Molina [2004]  Boosting techniques  Techniques for achieving high relevance/importance for a web page  Hiding techniques  Techniques to hide the use of boosting  From humans and web crawlers Spam Detection

66 Link Analysis 66 Boosting techniques  Term spamming  Manipulating the text of web pages in order to appear relevant to queries  Link spamming  Creating link structures that boost page rank or hubs and authorities scores Spam Detection

67 Link Analysis 67 Term Spamming  Repetition  of one or a few specific terms e.g., free, cheap, viagra  Goal is to subvert TF.IDF ranking schemes  Dumping  of a large number of unrelated terms  e.g., copy entire dictionaries  Weaving  Copy legitimate pages and insert spam terms at random positions  Phrase Stitching  Glue together sentences and phrases from different sources Spam Detection

68 Link Analysis 68 Link spam  Three kinds of web pages from a spammer ’ s point of view  Inaccessible pages  Accessible pages  e.g., web log comments pages  spammer can post links to his pages  Own pages  Completely controlled by spammer  May span multiple domain names Spam Detection

69 Link Analysis 69 Link Farms  Spammer ’ s goal  Maximize the page rank of target page t  Technique  Get as many links from accessible pages as possible to target page t  Construct “ link farm ” to get page rank multiplier effect Spam Detection

70 Link Analysis 70 Link Farms Inaccessible t AccessibleOwn 1 2 M One of the most common and effective organizations for a link farm Spam Detection

71 Link Analysis 71 Analysis Suppose rank contributed by accessible pages = x Let page rank of target page = y Rank of each “ farm ” page =  y/M + (1-  )/N y = x +  M[  y/M + (1-  )/N] + (1-  )/N = x +  2 y +  (1-  )M/N + (1-  )/N y = x/(1-  2 ) + cM/N where c =  /(1+  ) Inaccessibl e t Accessible Own 1 2 M Very small; ignore Spam Detection

72 Link Analysis 72 Analysis  y = x/(1-  2 ) + cM/N where c =  /(1+  )  For  = 0.85, 1/(1-  2 )= 3.6  Multiplier effect for “ acquired ” page rank  By making M large, we can make y as large as we want Inaccessibl e t Accessible Own 1 2 M Spam Detection

73 Link Analysis 73 Detecting Spam  Term spamming  Analyze text using statistical methods e.g., Na ï ve Bayes classifiers  Similar to email spam filtering  Also useful: detecting approximate duplicate pages  Link spamming  Open research area  One approach: TrustRank Spam Detection

74 Link Analysis 74 TrustRank idea  Basic principle: approximate isolation  It is rare for a “ good ” page to point to a “ bad ” (spam) page  Sample a set of “ seed pages ” from the web  Have an oracle (human) identify the good pages and the spam pages in the seed set  Expensive task, so must make seed set as small as possible Spam Detection

75 Link Analysis 75 Trust Propagation  Call the subset of seed pages that are identified as “ good ” the “ trusted pages ”  Set trust of each trusted page to 1  Propagate trust through links  Each page gets a trust value between 0 and 1  Use a threshold value and mark all pages below the trust threshold as spam Spam Detection

76 Link Analysis 76 Example 1 4 7 2 5 3 6 good bad Spam Detection

77 Link Analysis 77 Rules for trust propagation  Trust attenuation  The degree of trust conferred by a trusted page decreases with distance  Trust splitting  The larger the number of outlinks from a page, the less scrutiny the page author gives each outlink  Trust is “ split ” across outlinks Spam Detection

78 Link Analysis 78 Simple model  Suppose trust of page p is t(p)  Set of outlinks O(p)  For each q in O(p), p confers the trust   t(p)/|O(p)| for 0<  <1  Trust is additive  Trust of p is the sum of the trust conferred on p by all its inlinked pages  Note similarity to Topic-Specific PageRank  Within a scaling factor, trust rank = biased page rank with trusted pages as teleport set Spam Detection

79 Link Analysis 79 Picking the seed set  Two conflicting considerations  Human has to inspect each seed page, so seed set must be as small as possible  Must ensure every “ good page ” gets adequate trust rank, so need make all good pages reachable from seed set by short paths Spam Detection

80 Link Analysis 80 Approaches to picking seed set  Suppose we want to pick a seed set of k pages  PageRank  Pick the top k pages by page rank  Assume high page rank pages are close to other highly ranked pages  We care more about high page rank “ good ” pages Spam Detection

81 Link Analysis 81 Inverse page rank  Pick the pages with the maximum number of outlinks  Can make it recursive  Pick pages that link to pages with many outlinks  Formalize as “ inverse page rank ”  Construct graph G ’ by reversing each edge in web graph G  Page rank in G ’ is inverse page rank in G  Pick top k pages by inverse page rank Spam Detection

82 Link Analysis 82 Spam Mass  In the TrustRank model, we start with good pages and propagate trust  Complementary view: what fraction of a page ’ s page rank comes from “ spam ” pages?  In practice, we don ’ t know all the spam pages, so we need to estimate Spam Detection

83 Link Analysis 83 Spam mass estimation r(p) = page rank of page p r + (p) = page rank of p with teleport into “ good ” pages only r - (p) = r(p) – r + (p) Spam mass of p = r - (p)/r(p) Spam Detection

84 Link Analysis 84 Good pages  For spam mass, we need a large set of “ good ” pages  Need not be as careful about quality of individual pages as with TrustRank  One reasonable approach .edu sites .gov sites .mil sites Spam Detection

85 Link Analysis 85 Acknowledgement  Slides are from  Prof. Jeffrey D. Ullman  Dr. Anand Rajaraman

