Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 

Slides:



Advertisements
Similar presentations
Lecture 18: Link analysis
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Graphs, Node importance, Link Analysis Ranking, Random walks
Link Analysis: PageRank
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
The Web as Network Networked Life CSE 112 Spring 2006 Prof. Michael Kearns.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
Information Retrieval Link-based Ranking. Ranking is crucial… “.. From our experimental data, we could observe that the top 20% of the pages with the.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
Using Hyperlink structure information for web search.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Overview of Web Ranking Algorithms: HITS and PageRank
CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
Quality of a search engine
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Link analysis and Page Rank Algorithm
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Link-Based Ranking Seminar Social Media Mining University UC3M
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Link Analysis Rong Jin

Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge  What does it look likes?  Web is a small world  The diameter of the web is 19 e.g. the average number of clicks from one web site to another is 19

Bowtie Structure Strongly Connected Component Broder et al., 2001

Bowtie Structure Sites that link towards the ‘center’ of the web Broder et al., 2001

Bowtie Structure Sites that link from the ‘center’ of the web Broder et al., 2001

Inlinks and Outlinks  Both degrees of incoming and outgoing links follow power law Broder et al., 2001

Connected Components  Power law for weakly connected component (i.e., wcc) and strongly connected component (i.e., scc) Broder et al., 2001

Why Link Analysis Useful for IR ?  Assumption 1: A hyperlink is a quality signal. A hyperlink reflects the author perceived relevance.  Assumption 2: The anchor text describes the target page d2.

Anchor Text for IR  Example: Query IBM  You could match IBM’s copyright page, IBM’s wikipedia, and spam pages But not IBM home page if it is mostly graphical

Anchor Text for IR  Example: Query IBM  You could match IBM’s copyright page, IBM’s wikipedia, and spam pages But not IBM home page if it is mostly graphical  Searching on anchor text is better for the query IBM. Represent each page by all the anchor text pointing to it. In this representation, the page with the most occurrences of IBM is

Example Anchor Text for IBM  Pointing to “IBM acquires Webify” “New IBM optical chip” “IBM faculty award recipients”  Anchor text is often a better description of a page’s content than the page itself.  weight anchor texts more than the page text

Google Bomb  Example: “who is a failure”  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text. Because Anchor text is weighted more than page text.  Google introduced a new weighting function in January 2007 that fixed many google bombs.

PageRank: Citation Analysis  Citation analysis: analysis of citations in the scientific literature  Example “Miller (2001) has shown that physical activity alters the metabolism of cells.” “Miller (2001)” is a hyperlink link two scientific articles.  Application of these “hyperlinks” Measure the similarity of two articles  co-citation Measure the impact of scientific articles

PageRank: Citation Analysis  Citation frequency can be used to measure the impact of an article. Each article gets one vote. Not a very accurate measure dada dbdb Impact(d a ) = 4 Impact(d b ) = 3 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6

PageRank: Citation Analysis  Citation frequency can be used to measure the impact of an article. Each article gets one vote. Not a very accurate measure  Better measure: weighted citation frequency / citation rank An article’s vote is weighted according to its citation impact. dada dbdb Impact(d a ) = 6 Impact(d b ) = 8 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 Impact(d 1 ) = 1 Impact(d 2 ) = 1 Impact(d 3 ) = 1 Impact(d 4 ) = 4 Impact(d 5 ) = 1 Impact(d 6 ) = 3

PageRank: Citation Analysis  Citation frequency can be used to measure the impact of an article. Each article gets one vote. Not a very accurate measure  Better measure: weighted citation frequency / citation rank An article’s vote is weighted according to its citation impact. dada dbdb Impact(d a ) = 6 Impact(d b ) = 8 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 Impact(d 1 ) = 1 Impact(d 2 ) = 1 Impact(d 3 ) = 1 Impact(d 4 ) = 4 Impact(d 5 ) = 1 Impact(d 6 ) = 3 Circular Definition

PageRank: Citation Analysis  Citation frequency can be used to measure the impact of an article. Each article gets one vote. Not a very accurate measure  Better measure: weighted citation frequency / citation rank An article’s vote is weighted according to its citation impact.  Circular definition No: can be formalized in a well-defined way. This is basically PageRank.

Link Analysis for IR  A simple approach of using links for ranking web pages for a given query Assumption: more inlink  higher popularity First, retrieve all pages for a given query Select the top K (e.g., 100) web pages Reorder the top K web pages by their in-links  But, this is prone to spam.

Random Walk Model Consider a random walk through the Web graph ?? ? ?? Start at a random page At each step, go out of the current page along one of the links on that page, with equal probability

Random Walk Model Consider a random walk through the Web graph Start at a random page At each step, go out of the current page along one of the links on that page, with equal probability

Random Walk Model Consider a random walk through the Web graph ?? ? ? Start at a random page At each step, go out of the current page along one of the links on that page, with equal probability

Random Walk Model Consider a random walk through the Web graph Impact = Long-term rate: What is portion of time that the surfer will spend on each site as time goes to infinity? Start at a random page At each step, go out of the current page along one of the links on that page, with equal probability

Page Rank r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r i : long term rate for the ith web site

Page Rank r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r i : long term rate for the ith web site

Page Rank r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r i : long term rate for the ith web site

Page Rank r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r i : long term rate for the ith web site

Page Rank r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r i : long term rate for the ith web site

Page Rank r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r i : long term rate for the ith web site

Page Rank r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 B r =

Page Rank r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 B r =

Page Rank r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 This is an eigenvector problem: r is the principal eigenvector of B (i.e. eigenvector with the large eigenvalue) B r =

Page Rank r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 How to compute this transition matrix?

PageRank adjancy matrix row normalization transition matrix

Page Rank

Observation # of inlinks from high ranked page

Adding Self Loop  Allow surfer to decide to stay on the same place B =

Dead Ends  The web is full of dead ends.  Random walk can get stuck in dead ends.  If there are dead ends, long-term visit rates are not well-defined (or non-sensical). What happens to the long-term rate if the graph has dead ends?

PageRank: Teleporting  At a dead end, jump to a random web page  At any non-dead end, with probability 10%, jump to a random web page With remaining probability (90%), go out on a random hyperlink 10% is a parameter.

PageRank: Implementation  Compute PageRank Given graph of links, build matrix B Apply self-loop and teleportation Compute the page rank scores by the iterative procedure

Page Rank: Implementation r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7  Iterative procedure  Initialize  Update  Repeat till it converges Does not scale to the size of web

 Iterative procedure  Initialize  Update  Repeat till it converges Page Rank: Implementation r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 0 0

 Iterative procedure  Initialize  Update  Repeat till it converges Page Rank: Implementation r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 0

 How to utilize PageRank for information retrieval Apply the standard IR approach to identify the top K (i.e., K = 100) ranked we pages Re-rank the top ranked pages by their PageRank Query Document Collection Lucene D1 D2 D PR D3 D2 D1

How Important is PageRank?  Frequent claim: PageRank is the most important component of web ranking.  The reality: There are several components that are at least as important (e.g., anchor text, phrases, proximity, tiered indexes...) Rumor has it that PageRank in its original form (as presented here) has a negligible impact on ranking!  However, variants of a page’s PageRank are still an essential part of ranking.  Adressing link spam is difficult and crucial.

HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v  V in a subgraph of interest: A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites a(v) - the authority of v h(v) - the hubness of v

Authority and Hubness a(1) = h(2) + h(3) + h(4)

Authority and Hubness a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

Authority and Hubness: Version 1 HubsAuthorities(G) 1 1  [1,…,1] Є R 2 a  h  1 3 t  1 4 repeat 5 for each v in V 6 do a (v)  Σ h (w) 7 h (v)  Σ a (w) 8t  t until || a – a || + || h – h || < ε 10 return (a, h ) 00 t t t t tt t -1 w Є pa[v] |V|

Authority and Hubness: Version 1 HubsAuthorities(G) 1 1  [1,…,1]  R 2 a  h  1 3 t  1 4 repeat 5 for each v in V 6 do a (v)  Σ h (w) 7 h (v)  Σ a (w) 8t  t until || a – a || + || h – h || < ε 10 return (a, h ) 00 t t t t tt t -1 w  pa[v] w  ch[v] |V|

Authority and Hubness: Version 1 HubsAuthorities(G) 1 1  [1,…,1]  R 2 a  h  1 3 t  1 4 repeat 5 for each v in V 6 do a (v)  Σ h (w) 7 h (v)  Σ a (w) 8t  t until || a – a || + || h – h || < ε 10 return (a, h ) 00 t t t t tt t -1 w  pa[v] w  ch[v] |V| Problems ?

Authority and Hubness: Version 1 HubsAuthorities(G) 1 1  [1,…,1]  R 2 a  h  1 3 t  1 4 repeat 5 for each v in V 6 do a (v)  Σ h (w) 7 h (v)  Σ a (w) 8t  t until || a – a || + || h – h || < ε 10 return (a, h ) 00 t t t t tt t -1 w  pa[v] w  ch[v] |V| Problems ?

Authority and Hubness: Version 2 + Normalization HubsAuthorities(G) 1 1  [1,…,1] Є R 2 a  h  1 3 t  1 4 repeat 5 for each v in V 6 do a (v)  Σ h (w) 7 h (v)  Σ a (w) 8 a  a / || a || 9 h  h / || h || 10 t  t until || a – a || + || h – h || < ε 12 return (a, h ) 00 t t t t t t t t t t tt t -1 w Є pa[v] |V|

HITS Example Results Authority Hubness Authority and hubness weights

HITS Example Results Authority Hubness Authority and hubness weights h(1) = a(5) + a(6) + a(7)

HITS Example Results Authority Hubness Authority and hubness weights h(1) = a(5) + a(6) + a(7)

HITS Example Results Authority Hubness Authority and hubness weights h(1) = a(5) + a(6) + a(7)

HITS Example Results Authority Hubness Authority and hubness weights a(1) = h(2) + h(3) + h(4)

HITS Example Results Authority Hubness Authority and hubness weights a(1) = h(2) + h(3) + h(4)

HITS Example Results Authority Hubness Authority and hubness weights a(1) = h(2) + h(3) + h(4)

Authority and Hubness  Authority score Not only depends on the number of incoming links But also the ‘quality’ (e.g., hubness) of the incoming links  Hubness score Not only depends on the number of outgoing links But also the ‘quality’ (e.g., hubness) of the outgoing links

Authority and Hubness Convergence  Question Will the above iteration converge? If we start with a random assignment to both hub and authority values, will the above iteration converge? Will the converged authority and hub values depend on the initial assignments?

Convergence  Column vector a: a i is the authority score for the i-th site  Column vector h: h i is the hub score for the i-th site  Matrix M: M =

Authority and Hub  Vector a: a i is the authority score for the i-th site  Vector h: h i is the hub score for the i-th site  Matrix M: Recursive dependency : a(v)  Σ h(w) h(v)  Σ a(w) w Є pa[v] w Є ch[v]

Authority and Hub  Column vector a: a i is the authority score for the i-th site  Column vector h: h i is the hub score for the i-th site  Matrix M: Recursive dependency : a(v)  Σ h(w) h(v)  Σ a(w) w Є pa[v] w Є ch[v]

Authority and Hub  Column vector a: a i is the authority score for the i-th site  Column vector h: h i is the hub score for the i-th site  Matrix M: Recursive dependency : a(v)  Σ h(w) h(v)  Σ a(w) w Є pa[v] w Є ch[v] Normalization Procedure

Authority and Hub  Apply SVD to matrix M  Authority scores  left principal singular vector of M  Hub scores  right principal singular vector of M