Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 20 Dec.

Similar presentations


Presentation on theme: "1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 20 Dec."— Presentation transcript:

1 1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture Christian Schindelhauer schindel@upb.de

2 Search Algorithms, WS 2004/05 2 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter III Searching the Web 20 Dec 2004

3 Search Algorithms, WS 2004/05 3 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching the Web  Introduction  The Anatomy of a Search Engine  Google’s Pagerank algorithm –The Simple Algorithm –Periodicity and convergence  Kleinberg’s HITS algorithm –The algorithm –Convergence  The Structure of the Web –Pareto distributions –Search in Pareto-distributed graphs

4 Search Algorithms, WS 2004/05 4 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Webgraph  G WWW : –Static HTML-pages are nodes –links are directed edges  Outdegree of a node: number of links of a web-page  Indegree of a node: number of links to a web-page  Directed path from node u to v –series of web-pages, where one follows links from the page u to page v  Undirected path (u=w 0,w 2,…,w m-1,v=w m ) from page u to page v –For all i: There is a link from w i zu w i+1 or from w i+1 to w i  Strong (weak) connected subgraph –minimal node set including all nodes which have a directed (undirected) path from and to a reference node

5 Search Algorithms, WS 2004/05 5 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Web-Graph (1999)

6 Search Algorithms, WS 2004/05 6 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Distributions of indegree/outdegree  In and Out-degree obey a power law –i.e. in- and out-degree appear with probability ~ 1/i α  According to experiments of –Kumar et al 97: 40 million Webpages –Barabasi et al 99: Domain *.nd.edu + Web-pages with distance 3 –Broder et al 00: 204 million webpages (Scan May and Oct 1999)

7 Search Algorithms, WS 2004/05 7 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Is the Web-Graph a Random graph? No!  Random graph G n,p : –n nodes –Every directed edge occurs with probability p  Is the Web-graph a random graph G n,p ?  The probability of high degrees decrease exponentially  In a random graph degrees are distributed according to a Poisson distribution  Therefore: The degree of a random graph does not obey a power law

8 Search Algorithms, WS 2004/05 8 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Pareto Distribution  Discrete Pareto (power law) distribution for x  {1,2,3,…} with constant factor (also known as the Riemann Zeta function)  Heavy tail property –not all moments E[X k ] are defined –Expected value exists if and only if α>2 –Variance and E[X 2 ] exist if and only if α>3 –E[X k ] defined if and only if α>k+1  Density function of the continuous function for x>x 0

9 Search Algorithms, WS 2004/05 9 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Special Case: Zipf Distribution  George Kinsley Zipf claimed that the frequency of the n-th most frequent word occurs with frequency f(n) such that f(n) n = c  Zipf probability distribution for x  {1,2,3,…} with constant factor c only defined for finite sets, since tends to infinity for growing n  Zipf distributions refer to ranks –The Zipf exponent  can be larger than 1, i.e. f(n) = c/n   Pareto distributions refer to absolute size –e.g. number of inhabitants

10 Search Algorithms, WS 2004/05 10 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Pareto-Verteilung (I)  Example for Power Laws (= Pareto distributions) –Pareto 1897:Wealth/income in population –Yule 1944:Word frequency in languages –Zipf 1949:Size of towns –Length of molecule chaings –File length of UNIX-files –…. –Access density of web-pages –Access density of a web-surfer at a particular web-page –…

11 Search Algorithms, WS 2004/05 11 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer City Size Distribution Scaling Laws and Urban Distributions, Denise Pumain, 2003 Zipf distribution

12 Search Algorithms, WS 2004/05 12 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Zipf’s Law and the Internet Lada A. Adamic, Bernardo A. Huberman, 2002 Pareto distribution

13 Search Algorithms, WS 2004/05 13 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Zipf’s Law and the Internet Lada A. Adamic, Bernardo A. Huberman, 2002

14 Search Algorithms, WS 2004/05 14 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Zipf’s Law and the Internet Lada A. Adamic, Bernardo A. Huberman, 2002

15 Search Algorithms, WS 2004/05 15 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Heavy-Tailed Probability Distributions in the World Wide Web Mark Crovella, Murad, Taqqu, Azer Bestavros, 1996

16 Search Algorithms, WS 2004/05 16 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Size of connected components  Strong and weak connected components obey a power law  A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. “Graph Structure in the Web: Experiments and Models.” In Proc. of the 9th World Wide Web Conference, pp. 309—320. Amsterdam: Elsevier Science, 2000.  Large weak connected component with 91% of all web-pages  Largest strong connected component has size 28% –Diameter ≥ 28

17 Search Algorithms, WS 2004/05 17 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching in Power Law Networks  Task: –Given a network with undirected edges –Degrees underlie a power law –From a source node –Find a target node  Features –Keep it simple no markers –Visit one node at a time –Every node knows its neighbor (and its degree) From Adamik, Lukose, Puniyani, Huberman, “Search in power-law networks”, Physical Review E, Vol.86, 046135  Three approaches –Neighbors of random nodes –Neighbors of a random walk: First random neighbor and continue –Neighbors of High Degree Seeking: Start with random node Prefer neighbors with larger degree

18 Search Algorithms, WS 2004/05 18 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Power Law Networks  Undirected graph of n nodes –The probability that a node has k neighbors is p k –where p k = c k -  for a normalization factor c  For search in power law network –Consider largest connected component and –exponent t with 2<  <3  Theorem –For large enough power law graphs with exponent  For  <1 the graph is almost surely connected For 1<  <2: There is a giant connected component of size  (n) For 2<  <3.4785: There is a giant component and all smaller components are of size O(log n) For  >3.4785: The graph has almost surely no giant component, ie. all components have size o(n) For  >4: All connected components underlie a power law –by William Aiello Fan Chung Linyuan Lu, A Random Graph Model for Massive Graphs, Symposium on Theory of Computing (STOC) 2000)

19 Search Algorithms, WS 2004/05 19 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Random Walk  Random Walk: Start with random node as node u while neighbor of u is not target do u  random neighbor of u od  Theorem In undirected connected graphs every node is visited by a random walk with probability proportional to its degree (on the long run).  Conclusion: –High degree nodes are preferred  Possible improvement –Avoid going back –Avoid visiting already visited nodes –Scan also second degree neighbors for target node  RW: Random walk in 2.1 power law graph –avoiding going back –second degree scanning

20 Search Algorithms, WS 2004/05 20 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Degree Seeking  Degree Seeking Start with random node as node u while neighbor of u is not target do u  neighbor of u with highest degree that was not visited so far od  Improvement: –Scan also second degree neighbors for target  Observation: –The search in Power Law networks is considerably faster  Why?  RW: Random walk in 2.1 power law graph  DS: Degree Seeking in the same graph –avoiding already visited neighbors –second degree scanning

21 Search Algorithms, WS 2004/05 21 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Comparison Random Walk and Degree Seeking

22 Search Algorithms, WS 2004/05 22 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Probability Generating Functions  For a discrete probability distribution X over {0,1,2,3,4..} let p k be the probability that event k  {0,1,2,3,...}  Then the generating function for the probability distribution is  Probability values –where G (k) is the k-th derivative of G  For probability distributions X and Y and their distribution generating functions G X, G Y we have

23 Search Algorithms, WS 2004/05 23 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Probability Generating Functions Properties  Sum of probabilites  Expectation  If X i are independent discrete random variables and G Xi the generating function then for  the generating function is  This implies for S=X 1 -X 2, where X 1 and X 2 are independent  Let N be an independent random variable. Let X 1,X 2,.., independent and identically random variables. Then for the random variable X N the generating function is given by

24 Search Algorithms, WS 2004/05 24 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer  Remember that  Example: –Consider the random variable –then the generating function is  Poisson probability distribution with –Generating function:  Pareto (power law) probability distribution Probability Generating Functions Examples

25 Search Algorithms, WS 2004/05 25 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Analyzing Power Law Graphs  Consider the generating function for the degree  Let p k = 0 for all k > m= n 1/  and k=0  Hence, the generating function is  Choose the normalization factor c such that  Then, the average degree is given by If m>n 1/  then p m <n -1 This means less than one edge exists in the expectation

26 Search Algorithms, WS 2004/05 26 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Average Degree  Average degree of a node  A random edge chooses high degree nodes with higher probability, –if a node has k edges then the probability increases (for large networks) by a factor of k –i.e. probability p’(k) = k p k –the corresponding normalized generating probability function is  The probability function of a node after one random walk is given by this function shifted by one place, i.e.

27 Search Algorithms, WS 2004/05 27 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer  Let z 2b denote the average number of second neighbors starting from a node chosen by a random edge –Choose N according to G 1 –Choose X i according to G 1 –Consider X N and the generating function –Then The Neighbor’s Degree  Assume that –a node “knows” the degree of all neighbors –the probability that any second neighbor is connected to more than one first neighbor can be neglected Then, the degree of the first neighbors and second neighbors are independent Second neighbors are the neighbors in the next step  Let z 2a denote the average number of second neighbors starting from a random node –Choose N according to G 0 –Choose X i according to G 1 –Consider X N and the generating function –Then

28 Search Algorithms, WS 2004/05 28 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Random Walks outperform Random Nodes  Let z 2a denote the average number of second neighbors starting from a random node  The degree is dependent on the cut-off value m =  (n 1/  )  For 2<  <3 one can obtain  Hence,  Let z 2b denote the average number of second neighbors starting from a node chosen by a random edge  The degree is dependent on the cut-off value m =  (n 1/  )  For 2<  <3 one can obtain  Hence,

29 Search Algorithms, WS 2004/05 29 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Conclusions  The number of nodes that is in the neighborhood of nodes of a random walk is approximately a square of the number of nodes neighbored to random points of the network  This effect can be increased if we prefer the neighbor with the highest degree  This improves the search in power law networks –because more neighbors are in reach  In random graphs (Poisson graphs) this technique does not help such much –since the the degree distribution is sharply concentrated around the expectation.

30 30 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Thanks for your attention End of 10th lecture Happy X-mas and a happy new year Next lecture:Mo 10 Jan 2005, 11.15 am, FU 116 Next exercise class:Mo 20 Dec 2004, 1.15 pm, F0.530 or We 22 Dec 2004, 1.00 pm, E2.316


Download ppt "1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 20 Dec."

Similar presentations


Ads by Google