1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 06 Dec.

Slides:



Advertisements
Similar presentations
. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Advertisements

Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 2.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Markov Chains Lecture #5
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
Expanders Eliyahu Kiperwasser. What is it? Expanders are graphs with no small cuts. The later gives several unique traits to such graph, such as: – High.
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
The effect of New Links on Google Pagerank By Hui Xie Apr, 07.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Lectures 6 & 7 Centrality Measures Lectures 6 & 7 Centrality Measures February 2, 2009 Monojit Choudhury
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Nov.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Overview of Web Ranking Algorithms: HITS and PageRank
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Nov.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Oct.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Markov Chains Mixing Times Lecture 5
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Piyush Kumar (Lecture 2: PageRank)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Adjacency Matrices and PageRank
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
COMP5331 Web databases Prepared by Raymond Wong
Presentation transcript:

1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec th Lecture Christian Schindelhauer

Search Algorithms, WS 2004/05 2 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter III Organization 06 Dec 2004 Mid Term Exam

Search Algorithms, WS 2004/05 3 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Mid Term Exam  Wednesday, 8 Dec 2004, 1pm-1.45pm, F1.110  4 parts –1. short questions, testing general understanding –2.-4. Show that you understand Text search algorithms Searching in Compressed Text The Pagerank algorithm  If you have successfully presented an exercise: –Only the best 3 of 4 parts count  If you fail, or if you receive a bad grade –then the oral exam at the end of the semester will cover the complete lecture  If you are happy with your grade –this grade counts half of the complete lecture –if you succeed within the oral exam (over the rest of the lecture)

Search Algorithms, WS 2004/05 4 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter III Searching the Web 06 Dec 2004

Search Algorithms, WS 2004/05 5 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching the Web  Introduction  The Anatomy of a Search Engine  Google’s Pagerank algorithm –The Simple Algorithm –Periodicity and convergence  Kleinberg’s HITS algorithm –The algorithm –Convergence  The Structure of the Web –Pareto distributions –Search in Pareto-distributed graphs

Search Algorithms, WS 2004/05 6 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Simplified PageRank-Algorithmus  Simplified PageRank-Algorithmus –Rank of a wep-page R(u)  [0,1] –Important pages hand their rank down to the pages they link to. –c is a normalisation factor such that ||R(u)|| 1 = 1, i.e. the sum of all page ranks add to 1 –Predecessor nodes B u –sucessor nodes F u

Search Algorithms, WS 2004/05 7 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Simplifed Pagerank Algorithm and an example

Search Algorithms, WS 2004/05 8 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Matrix representaion R  c M R, where R is a vector (R(1),R(2),… R(n)) and M denotes the following n  n – Matrix

Search Algorithms, WS 2004/05 9 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer  Consider n discrete states and a sequence of random variable X 1, X 2,... over this set of states  The sequence X 1, X 2,... is a Markov chain if  A stochastic matrix M is the transition matrix for a finite Markov chain, also called a Markov matrix: –Elements of the matrix M must be real numbers of [0, 1]. –The sum of all column in M is 1  Observation for the matrix M of the simpl. pagerank algorithm –M is stochastic if all nodes have at least one outgoing link Stochastic Matrices

Search Algorithms, WS 2004/05 10 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Random Surfer for Simplified Pagerank  Consider the following algorithm –Start in a random web-page according to a probability distribution –Repeat the following for t rounds If no link is on this page, exit and produce no output Uniformly and randomly choose a link of the web-page Follow that link and go to this web-page –Output the web-page Lemma The probability that a web-page i is output by the random surfer after t rounds started with probability distribution x 1,.., x n is described by the i-th entry of the output of the simplified Pagerank-algorithm iterated for t rounds without normalization. Proof follows applying the definition of Markov chains

Search Algorithms, WS 2004/05 11 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Eigenvalues of Stochastic Matrices  Notations –Die L1-Norm of a vector x is defined as –x  0, if for all i: x i  0 –x  0, if for all i: x i  0  Lemma For every stochastic matrix M and every vector x we have || M x || 1  || x || 1 || M x || 1 = || x || 1, if x  0 or x  0  Eigenvalues of M | i |   1  Theorem For every stochastic matrix M there is an eigenvector x with eigenvalue 1 such that x  0 and ||x|| 1 = 1

Search Algorithms, WS 2004/05 12 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Periodicity - Example

Search Algorithms, WS 2004/05 13 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Necessary and Sufficient Conditions for Periodicity  Theorem (necessary condition) –If the stochastic matrix M is periodic with periodicity t  2, then for the graph G of M there exists a strongly connected subgraph S of at least two nodes such that every directed graph cycle within S has a length of the form i t for natural number i.  Theorem (sufficient condition) –Let the graph consist of one strongly connected subgraph and –let L 1,L 2,..., L m be the lengths all directed graph cycles of maximal length n –Then M is non-periodic if and only if gcd(L 1,L 2,..., L m ) = 1  Notation: –gcd(L 1,L 2,..., L m ) = greatest common divisor of numbers L 1,L 2,..., L m  Corollary –If the graph is strongly connected and there exists a graph cycle of length 1 (i.e. a loop), then M is non-periodic.

Search Algorithms, WS 2004/05 14 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Disadvantages of the Simplified Pagerank- Algorithm  The Web-graph has sinks, i.e. pages without links  M is not a stochastic matrix  The Web-graph is periodic  Convergence is uncertain  The Web-graph is not strongly connected  Several convergence vectors possible  Rank-sinks –Strongly connected subgraphs absorb all weight of the predecessors –All predecessors pointing to a web-page loose their weight.

Search Algorithms, WS 2004/05 15 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The (non-simplified) Pagerank-Algorithm  Add to a sink links to all web-pages  Uniformly and randomly choose a web-page –With some probability q < 1 perform a step of the simplified Pagerank algorithm –With probability 1-q start with the first step (and choose a random web-page)  Note M ist stochastic

Search Algorithms, WS 2004/05 16 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Effect of Filling Up the Matrix ( ) )( = )( Example: q = 0.96, n = 4, Graph: Startvector x = (1,0,0,0) T Observation: –All entries of M are at least (1-q)/n (by definition; here 0.01) –All entries of M x are at least ||x|| 1 (1-q)/n (here 0.01)  Fact: For all vectors x  0: (M x) i  ||x|| 1 (1-q)/n –sum over all rows  Fact:For all vectors x  0: (M x) i  -||x|| 1 (1-q)/n 14 32

Search Algorithms, WS 2004/05 17 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer What Happens to Mixed Vectors ( ) )( ( ) )( = ( ) )( )( )( )( = = ||z|| 1 =2 ||p|| 1 = 1 ||m|| 1 = 1 For all i: ||(M p) i ||  0.01 For all i: ||(M m) i ||  ||M z|| 1  2 - 4·0.01=1.96

Search Algorithms, WS 2004/05 18 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Converging to the first Eigenvector Observation: –Entries of M are at least (1-q)/n –Entries of M x are at least ||x|| 1 (1-q)/n  Fact: –For all vectors z  0: (M z) i  ||z|| 1 (1-q)/n –For all vectors z  0: (M z) i  -||z|| 1 (1-q)/n  Lemma –For x  0 or x  0: ||Mx|| 1 = ||x|| 1  Let x be an eigenvector with eigenvalue 1  For arbitrary vector y  0 let z = x-y –Decompose z = m + p, –where p  0 and m  0 and ||p|| 1 + ||m|| 1 = |z| 1 ||x-M y|| 1 = ||M(x-y)|| 1 = || M z || 1 = || M (p+m) || 1 = || Mp + Mm || 1 = ||∑ i (Mp) i + (Mm) i || 1  ∑ i max{|(Mp) i |,|(Mm) i |} - ||z|| 1 (1-q)/n  |∑ i (Mp) i | + |∑ i (Mm) i | - ||z|| 1 (1-q) = || p || 1 + || m || 1 - || z || 1 (1-q) = || z || 1 - || z || 1 (1-q) = q || x-y || 1  After each iteration the distance between y and x decreases by a factor of q

Search Algorithms, WS 2004/05 19 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Properties of the Pagerank-Algorithm Lemma There is a unique (real) eigen-vector with eigenvalue 1 for the matrix of the Pagerank-algorithm. Proof: Let x be an eigen-vector with eigenvalue 1, then this lemma follows from for all y  x: ||x-M y|| 1 = q || x-y || 1 < || x-y || 1 Lemma Let x be the (unique real) eigen-vector, M be the matrix of the Pagerank-algorithm, and q the probability parameter of Pagerank. Then for all real vectors y: || M (x-y) || 1  q ||x-y|| 1 Theorem PageRank converges to an (1+  )- approximation of the unique eigenvector in at most (ln  - ln n) / ln q iterations.

Search Algorithms, WS 2004/05 20 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Discussion  q = probability to use simpl. pagerank  If q is small –Pagerank converges faster –Smaller hops are considered –Less structural information is available –The page-ranks become more the same  If q is large –Pagerank (possibly) converges slower –Longer hops play a role –Web-sinks collect more weight Therefore Google deletes web-sinks from the Web-Graph  Problem: –How to choose q –Is it reasonable to give every web-page a pagerank independently from the search pattern?

Search Algorithms, WS 2004/05 21 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Kleinberg’s HITS-Algorithm (HyperText Induced Search)  Jon Kleinberg, „Authoritative Sources in a Hyperlinked Environment“, Journal of the ACM 46(5): (1999)  Idea of the Algorithm –Pages can serve as Authorities (like in pagerank) or Hubs –Hub pages point to interesting links to authorities = relevant pages E.g. railway fans collect links of railway companies –Authorities are targets of hub pages  Mutually enforcing relationship Hubs Authorities

Search Algorithms, WS 2004/05 22 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Constructing a Focused Subgraph  For a search pattern  choose S  1.S is relatively small. 2.S is rich in relevant pages. 3.S contains most (or many) of the strongest authorities.  Start with the output of a standard text based search engine  Enhance the set of pages by the predecessors and the successors of these pages (w.r.t. links)

Search Algorithms, WS 2004/05 23 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Edge Selection  Offset the effect of links that serve purely a navigational function  Types of links –transverse if it is between pages with different domain names –intrinsic if it is between pages with the same domain name  Often intrinsic links very often exist purely for navigation –give much less information than transverse links about the authority of the pages they point to –therefore delete all intrinsic links from the focused subgraph  Other simple heuristics –Suppose a large number of pages from a single domain all point to a single page p. –often corresponds to a mass advertisement for example, the phrase “This site designed by...” and a corresponding link at the bottom of each page in a given domain. –To eliminate this phenomenon allow a maximum number of links from a domain pointing to a page

Search Algorithms, WS 2004/05 24 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Mutual Enforcing Relationship  Weights –Authority weight of a web-page i: x i –Hub weight of a web-page i: y i  Authority indicated by hub pages (I-Operation)  Hub pages indicated by authority pages (O-Operation) –c 1, c 2 are normalization factors w.r.t to the L2-Norm

Search Algorithms, WS 2004/05 25 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The HITS-Algorithm

Search Algorithms, WS 2004/05 26 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Computing the Output  Does the algorithm converge?  How good is the output?

Search Algorithms, WS 2004/05 27 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Matrix Representation  Adjacency matrix A:  Authorities:  Hub weights:  After t Iterations:

Search Algorithms, WS 2004/05 28 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer When does HITS converge?  M = A A T is symmetric matrix  For all symmetric matrices –all eigenvalues are real –all eigenvectors are orthogonal  There exists a representation  such that for the columns S i  If the largest eigenvalue 1 is larger than 2, the second eigenvalue, then HITS converges

Search Algorithms, WS 2004/05 29 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Web-Graph (1999) (next time)

30 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Thanks for your attention End of 8th lecture Next lecture:Mo 13 Dec 2004, am, FU 116 Midterm exam:We 8 Nov 2004, 1pm, F1.110 Next exercise class: Mo 13 Dec 2004, 1.15 pm, F0.530 or We 16 Dec 2004, 1.00 pm, E2.316