1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec th Lecture Christian Schindelhauer
Search Algorithms, WS 2004/05 2 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter III Organization 06 Dec 2004 Mid Term Exam
Search Algorithms, WS 2004/05 3 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Mid Term Exam Wednesday, 8 Dec 2004, 1pm-1.45pm, F1.110 4 parts –1. short questions, testing general understanding –2.-4. Show that you understand Text search algorithms Searching in Compressed Text The Pagerank algorithm If you have successfully presented an exercise: –Only the best 3 of 4 parts count If you fail, or if you receive a bad grade –then the oral exam at the end of the semester will cover the complete lecture If you are happy with your grade –this grade counts half of the complete lecture –if you succeed within the oral exam (over the rest of the lecture)
Search Algorithms, WS 2004/05 4 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter III Searching the Web 06 Dec 2004
Search Algorithms, WS 2004/05 5 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching the Web Introduction The Anatomy of a Search Engine Google’s Pagerank algorithm –The Simple Algorithm –Periodicity and convergence Kleinberg’s HITS algorithm –The algorithm –Convergence The Structure of the Web –Pareto distributions –Search in Pareto-distributed graphs
Search Algorithms, WS 2004/05 6 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Simplified PageRank-Algorithmus Simplified PageRank-Algorithmus –Rank of a wep-page R(u) [0,1] –Important pages hand their rank down to the pages they link to. –c is a normalisation factor such that ||R(u)|| 1 = 1, i.e. the sum of all page ranks add to 1 –Predecessor nodes B u –sucessor nodes F u
Search Algorithms, WS 2004/05 7 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Simplifed Pagerank Algorithm and an example
Search Algorithms, WS 2004/05 8 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Matrix representaion R c M R, where R is a vector (R(1),R(2),… R(n)) and M denotes the following n n – Matrix
Search Algorithms, WS 2004/05 9 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Consider n discrete states and a sequence of random variable X 1, X 2,... over this set of states The sequence X 1, X 2,... is a Markov chain if A stochastic matrix M is the transition matrix for a finite Markov chain, also called a Markov matrix: –Elements of the matrix M must be real numbers of [0, 1]. –The sum of all column in M is 1 Observation for the matrix M of the simpl. pagerank algorithm –M is stochastic if all nodes have at least one outgoing link Stochastic Matrices
Search Algorithms, WS 2004/05 10 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Random Surfer for Simplified Pagerank Consider the following algorithm –Start in a random web-page according to a probability distribution –Repeat the following for t rounds If no link is on this page, exit and produce no output Uniformly and randomly choose a link of the web-page Follow that link and go to this web-page –Output the web-page Lemma The probability that a web-page i is output by the random surfer after t rounds started with probability distribution x 1,.., x n is described by the i-th entry of the output of the simplified Pagerank-algorithm iterated for t rounds without normalization. Proof follows applying the definition of Markov chains
Search Algorithms, WS 2004/05 11 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Eigenvalues of Stochastic Matrices Notations –Die L1-Norm of a vector x is defined as –x 0, if for all i: x i 0 –x 0, if for all i: x i 0 Lemma For every stochastic matrix M and every vector x we have || M x || 1 || x || 1 || M x || 1 = || x || 1, if x 0 or x 0 Eigenvalues of M | i | 1 Theorem For every stochastic matrix M there is an eigenvector x with eigenvalue 1 such that x 0 and ||x|| 1 = 1
Search Algorithms, WS 2004/05 12 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Periodicity - Example
Search Algorithms, WS 2004/05 13 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Necessary and Sufficient Conditions for Periodicity Theorem (necessary condition) –If the stochastic matrix M is periodic with periodicity t 2, then for the graph G of M there exists a strongly connected subgraph S of at least two nodes such that every directed graph cycle within S has a length of the form i t for natural number i. Theorem (sufficient condition) –Let the graph consist of one strongly connected subgraph and –let L 1,L 2,..., L m be the lengths all directed graph cycles of maximal length n –Then M is non-periodic if and only if gcd(L 1,L 2,..., L m ) = 1 Notation: –gcd(L 1,L 2,..., L m ) = greatest common divisor of numbers L 1,L 2,..., L m Corollary –If the graph is strongly connected and there exists a graph cycle of length 1 (i.e. a loop), then M is non-periodic.
Search Algorithms, WS 2004/05 14 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Disadvantages of the Simplified Pagerank- Algorithm The Web-graph has sinks, i.e. pages without links M is not a stochastic matrix The Web-graph is periodic Convergence is uncertain The Web-graph is not strongly connected Several convergence vectors possible Rank-sinks –Strongly connected subgraphs absorb all weight of the predecessors –All predecessors pointing to a web-page loose their weight.
Search Algorithms, WS 2004/05 15 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The (non-simplified) Pagerank-Algorithm Add to a sink links to all web-pages Uniformly and randomly choose a web-page –With some probability q < 1 perform a step of the simplified Pagerank algorithm –With probability 1-q start with the first step (and choose a random web-page) Note M ist stochastic
Search Algorithms, WS 2004/05 16 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Effect of Filling Up the Matrix ( ) )( = )( Example: q = 0.96, n = 4, Graph: Startvector x = (1,0,0,0) T Observation: –All entries of M are at least (1-q)/n (by definition; here 0.01) –All entries of M x are at least ||x|| 1 (1-q)/n (here 0.01) Fact: For all vectors x 0: (M x) i ||x|| 1 (1-q)/n –sum over all rows Fact:For all vectors x 0: (M x) i -||x|| 1 (1-q)/n 14 32
Search Algorithms, WS 2004/05 17 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer What Happens to Mixed Vectors ( ) )( ( ) )( = ( ) )( )( )( )( = = ||z|| 1 =2 ||p|| 1 = 1 ||m|| 1 = 1 For all i: ||(M p) i || 0.01 For all i: ||(M m) i || ||M z|| 1 2 - 4·0.01=1.96
Search Algorithms, WS 2004/05 18 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Converging to the first Eigenvector Observation: –Entries of M are at least (1-q)/n –Entries of M x are at least ||x|| 1 (1-q)/n Fact: –For all vectors z 0: (M z) i ||z|| 1 (1-q)/n –For all vectors z 0: (M z) i -||z|| 1 (1-q)/n Lemma –For x 0 or x 0: ||Mx|| 1 = ||x|| 1 Let x be an eigenvector with eigenvalue 1 For arbitrary vector y 0 let z = x-y –Decompose z = m + p, –where p 0 and m 0 and ||p|| 1 + ||m|| 1 = |z| 1 ||x-M y|| 1 = ||M(x-y)|| 1 = || M z || 1 = || M (p+m) || 1 = || Mp + Mm || 1 = ||∑ i (Mp) i + (Mm) i || 1 ∑ i max{|(Mp) i |,|(Mm) i |} - ||z|| 1 (1-q)/n |∑ i (Mp) i | + |∑ i (Mm) i | - ||z|| 1 (1-q) = || p || 1 + || m || 1 - || z || 1 (1-q) = || z || 1 - || z || 1 (1-q) = q || x-y || 1 After each iteration the distance between y and x decreases by a factor of q
Search Algorithms, WS 2004/05 19 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Properties of the Pagerank-Algorithm Lemma There is a unique (real) eigen-vector with eigenvalue 1 for the matrix of the Pagerank-algorithm. Proof: Let x be an eigen-vector with eigenvalue 1, then this lemma follows from for all y x: ||x-M y|| 1 = q || x-y || 1 < || x-y || 1 Lemma Let x be the (unique real) eigen-vector, M be the matrix of the Pagerank-algorithm, and q the probability parameter of Pagerank. Then for all real vectors y: || M (x-y) || 1 q ||x-y|| 1 Theorem PageRank converges to an (1+ )- approximation of the unique eigenvector in at most (ln - ln n) / ln q iterations.
Search Algorithms, WS 2004/05 20 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Discussion q = probability to use simpl. pagerank If q is small –Pagerank converges faster –Smaller hops are considered –Less structural information is available –The page-ranks become more the same If q is large –Pagerank (possibly) converges slower –Longer hops play a role –Web-sinks collect more weight Therefore Google deletes web-sinks from the Web-Graph Problem: –How to choose q –Is it reasonable to give every web-page a pagerank independently from the search pattern?
Search Algorithms, WS 2004/05 21 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Kleinberg’s HITS-Algorithm (HyperText Induced Search) Jon Kleinberg, „Authoritative Sources in a Hyperlinked Environment“, Journal of the ACM 46(5): (1999) Idea of the Algorithm –Pages can serve as Authorities (like in pagerank) or Hubs –Hub pages point to interesting links to authorities = relevant pages E.g. railway fans collect links of railway companies –Authorities are targets of hub pages Mutually enforcing relationship Hubs Authorities
Search Algorithms, WS 2004/05 22 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Constructing a Focused Subgraph For a search pattern choose S 1.S is relatively small. 2.S is rich in relevant pages. 3.S contains most (or many) of the strongest authorities. Start with the output of a standard text based search engine Enhance the set of pages by the predecessors and the successors of these pages (w.r.t. links)
Search Algorithms, WS 2004/05 23 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Edge Selection Offset the effect of links that serve purely a navigational function Types of links –transverse if it is between pages with different domain names –intrinsic if it is between pages with the same domain name Often intrinsic links very often exist purely for navigation –give much less information than transverse links about the authority of the pages they point to –therefore delete all intrinsic links from the focused subgraph Other simple heuristics –Suppose a large number of pages from a single domain all point to a single page p. –often corresponds to a mass advertisement for example, the phrase “This site designed by...” and a corresponding link at the bottom of each page in a given domain. –To eliminate this phenomenon allow a maximum number of links from a domain pointing to a page
Search Algorithms, WS 2004/05 24 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Mutual Enforcing Relationship Weights –Authority weight of a web-page i: x i –Hub weight of a web-page i: y i Authority indicated by hub pages (I-Operation) Hub pages indicated by authority pages (O-Operation) –c 1, c 2 are normalization factors w.r.t to the L2-Norm
Search Algorithms, WS 2004/05 25 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The HITS-Algorithm
Search Algorithms, WS 2004/05 26 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Computing the Output Does the algorithm converge? How good is the output?
Search Algorithms, WS 2004/05 27 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Matrix Representation Adjacency matrix A: Authorities: Hub weights: After t Iterations:
Search Algorithms, WS 2004/05 28 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer When does HITS converge? M = A A T is symmetric matrix For all symmetric matrices –all eigenvalues are real –all eigenvectors are orthogonal There exists a representation such that for the columns S i If the largest eigenvalue 1 is larger than 2, the second eigenvalue, then HITS converges
Search Algorithms, WS 2004/05 29 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Web-Graph (1999) (next time)
30 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Thanks for your attention End of 8th lecture Next lecture:Mo 13 Dec 2004, am, FU 116 Midterm exam:We 8 Nov 2004, 1pm, F1.110 Next exercise class: Mo 13 Dec 2004, 1.15 pm, F0.530 or We 16 Dec 2004, 1.00 pm, E2.316