1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 13 Dec.

Slides:



Advertisements
Similar presentations
Analysis and Modeling of Social Networks Foudalis Ilias.
Advertisements

Information Networks Link Analysis Ranking Lecture 8.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Information Networks Small World Networks Lecture 5.
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Mining and Searching Massive Graphs (Networks)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 Complex systems Made of many non-identical elements connected by diverse interactions. NETWORK New York Times Slides: thanks to A-L Barabasi.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Peer-to-Peer and Grid Computing Exercise Session 3 (TUD Student Use Only) ‏
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Jan.
Link Structure and Web Mining Shuying Wang
The Web as Network Networked Life CSE 112 Spring 2006 Prof. Michael Kearns.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
How is this going to make us 100K Applications of Graph Theory.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
The Erdös-Rényi models
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Lectures 6 & 7 Centrality Measures Lectures 6 & 7 Centrality Measures February 2, 2009 Monojit Choudhury
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Nov.
Link Analysis on the Web An Example: Broad-topic Queries Xin.
HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Algorithms for Radio Networks Exercise 12 Stefan Rührup
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Nov.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Oct.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Complex Networks: Models Lecture 2 Slides by Panayiotis TsaparasPanayiotis Tsaparas.
Chapter 5: Link Analysis for Authority Scoring
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
Peer-to-Peer Networks 02: Napster & Gnutella Christian Schindelhauer Technical Faculty Computer-Networks and Telematics University of Freiburg.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Models and Algorithms for Complex Networks Introduction and Background Lecture 1.
Clusters Recognition from Large Small World Graph Igor Kanovsky, Lilach Prego Emek Yezreel College, Israel University of Haifa, Israel.
How Do “Real” Networks Look?
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
Models of Web-Like Graphs: Integrated Approach
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
HITS Hypertext-Induced Topic Selection
Link-Based Ranking Seminar Social Media Mining University UC3M
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Modelling and Searching Networks Lecture 2 – Complex Networks
Presentation transcript:

1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec th Lecture Christian Schindelhauer

Search Algorithms, WS 2004/05 2 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter III Searching the Web 13 Dec 2004

Search Algorithms, WS 2004/05 3 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching the Web  Introduction  The Anatomy of a Search Engine  Google’s Pagerank algorithm –The Simple Algorithm –Periodicity and convergence  Kleinberg’s HITS algorithm –The algorithm –Convergence  The Structure of the Web –Pareto distributions –Search in Pareto-distributed graphs

Search Algorithms, WS 2004/05 4 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Kleinberg’s HITS-Algorithm (HyperText Induced Search)  Jon Kleinberg, „Authoritative Sources in a Hyperlinked Environment“, Journal of the ACM 46(5): (1999)  Idea of the Algorithm –Pages can serve as Authorities (like in pagerank) or Hubs –Hub pages point to interesting links to authorities = relevant pages E.g. railway fans collect links of railway companies –Authorities are targets of hub pages  Mutually enforcing relationship Hubs Authorities

Search Algorithms, WS 2004/05 5 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Constructing a Focused Subgraph  For a search pattern  choose S  1.S is relatively small. 2.S is rich in relevant pages. 3.S contains most (or many) of the strongest authorities.  Start with the output of a standard text based search engine  Enhance the set of pages by the predecessors and the successors of these pages (w.r.t. links)

Search Algorithms, WS 2004/05 6 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Edge Selection  Offset the effect of links that serve purely a navigational function  Types of links –transverse if it is between pages with different domain names –intrinsic if it is between pages with the same domain name  Often intrinsic links very often exist purely for navigation –give much less information than transverse links about the authority of the pages they point to –therefore delete all intrinsic links from the focused subgraph  Other simple heuristics –Suppose a large number of pages from a single domain all point to a single page p. –often corresponds to a mass advertisement for example, the phrase “This site designed by...” and a corresponding link at the bottom of each page in a given domain. –To eliminate this phenomenon allow a maximum number of links from a domain pointing to a page

Search Algorithms, WS 2004/05 7 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Mutual Enforcing Relationship  Weights –Authority weight of a web-page i: x i –Hub weight of a web-page i: y i  Authority indicated by hub pages (I-Operation)  Hub pages indicated by authority pages (O-Operation) –c 1, c 2 are normalization factors w.r.t to the L2-Norm

Search Algorithms, WS 2004/05 8 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The HITS-Algorithm

Search Algorithms, WS 2004/05 9 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Computing the Output  Does the algorithm converge?  How good is the output?

Search Algorithms, WS 2004/05 10 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Matrix Representation  Adjacency matrix A:  Authorities:  Hub weights:  After t Iterations:

Search Algorithms, WS 2004/05 11 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer When does HITS converge?  M = A A T is symmetric matrix  For all symmetric matrices –all eigenvalues are real –all eigenvectors are orthogonal  There exists a representation  such that for the columns S i  If the largest eigenvalue 1 is larger than 2, the second eigenvalue, then HITS converges

Search Algorithms, WS 2004/05 12 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Webgraph  G WWW : –Static HTML-pages are nodes –links are directed edges  Outdegree of a node: number of links of a web-page  Indegree of a node: number of links to a web-page  Directed path from node u to v –series of web-pages, where one follows links from the page u to page v  Undirected path (u=w 0,w 2,…,w m-1,v=w m ) from page u to page v –For all i: There is a link from w i zu w i+1 or from w i+1 to w i  Strong (weak) connected subgraph –minimal node set including all nodes which have a directed (undirected) path from and to a reference node

Search Algorithms, WS 2004/05 13 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Web-Graph (1999)

Search Algorithms, WS 2004/05 14 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Distributions of indegree/outdegree  In and Out-degree obey a power law –i.e. in- and out-degree appear with probability ~ 1/i α  According to experiments of –Kumar et al 97: 40 million Webpages –Barabasi et al 99: Domain *.nd.edu + Web-pages with distance 3 –Broder et al 00: 204 million webpages (Scan May and Oct 1999)

Search Algorithms, WS 2004/05 15 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Is the Web-Graph a Random graph?  Random graph G n,p : – n nodes –Every directed edges occurs with probability p  Is the Web-graph a random graph G n,p ?  Expected in/out-degree of G n,p = (n-1)p –Average degree of G WWW is constant, so choose –Consider a web-page w Let X be the number of links pointing from w Let X i =1 if link (w,i) exists, and X i =0, else Then P[X i =1]=p und P[X i =0]=1-p

Search Algorithms, WS 2004/05 16 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The in/out degree distribution of the random graph  What is the probability that at least k links apear 1.Markov‘s inequality –This implies

Search Algorithms, WS 2004/05 17 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The in/out degree distribution of the random graph  What is the probability that at least k links apear 2.Chebyshev‘s inequality –Since X i are independent –This implies

Search Algorithms, WS 2004/05 18 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The in/out degree distribution of the random graph 3.Chernoff bound –For independent Bernoulli variable X i and with –This implies for –So, the probability decrease exponentially –Therefore: The degree of a random graph does not obey a power law

Search Algorithms, WS 2004/05 19 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Pareto Distribution  Discrete Pareto (power law) distribution for x  {1,2,3,…} with constant factor (also known as the Riemann Zeta function)  Heavy tail property –not all moments E[X k ] are defined –Expected value exists if and only if α>2 –Variance and E[X 2 ] exist if and only if α>3 –E[X k ] defined if and only if α>k+1  Density function of the continuous function for x>x 0

Search Algorithms, WS 2004/05 20 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Special Case: Zipf Distribution  George Kinsley Zipf claimed that the frequency of the n-th most frequent word occurs with frequency f(n) such that f(n) n = c  Zipf probability distribution for x  {1,2,3,…} with constant factor c only defined for finite sets, since tends to infinity for growing n  Zipf distributions refer to ranks –The Zipf exponent  can be larger than 1, i.e. f(n) = c/n   Pareto distributions refer to absolute size –e.g. number of inhabitants

Search Algorithms, WS 2004/05 21 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Pareto-Verteilung (I)  Example for Power Laws (= Pareto distributions) –Pareto 1897:Wealth/income in population –Yule 1944:Word frequency in languages –Zipf 1949:Size of towns –Length of molecule chaings –File length of UNIX-files –…. –Access density of web-pages –Access density of a web-surfer at a particular web-page –…

Search Algorithms, WS 2004/05 22 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer City Size Distribution Scaling Laws and Urban Distributinos, Denise Pumain, 2003 Zipf distribution

Search Algorithms, WS 2004/05 23 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Zipf’s Law and the Internet Lada A. Adamic, Bernardo A. Huberman, 2002 Pareto distribution

Search Algorithms, WS 2004/05 24 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Zipf’s Law and the Internet Lada A. Adamic, Bernardo A. Huberman, 2002

Search Algorithms, WS 2004/05 25 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Zipf’s Law and the Internet Lada A. Adamic, Bernardo A. Huberman, 2002

Search Algorithms, WS 2004/05 26 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Heavy-Tailed Probability Distributions in the World Wide Web Mark Crovella, Murad, Taqqu, Azer Bestavros, 1996

Search Algorithms, WS 2004/05 27 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Size of connected components  Strong and weak connected components obey a power law  A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. “Graph Structure in the Web: Experiments and Models.” In Proc. of the 9th World Wide Web Conference, pp. 309—320. Amsterdam: Elsevier Science,  Large weak connected component with 91% of all web-pages  Largest strong connected component has size 28% –Diameter ≥ 28

Search Algorithms, WS 2004/05 28 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Web-Graph (1999)

29 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Thanks for your attention End of 9th lecture Next lecture:Mo 20 Dec 2004, am, FU 116 Results and solutions of exam:Mo 13 Dec 2004, 1.15 pm, F0.530 or We 16 Dec 2004, 1.00 pm, E2.316