Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Link Analysis.

Slides:



Advertisements
Similar presentations
Link Analysis and Anti-Spam Tie-Yan Liu Microsoft Research Asia.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Information Networks Link Analysis Ranking Lecture 8.
Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.
Graphs, Node importance, Link Analysis Ranking, Random walks
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ICS 278: Data Mining Lecture 15: Mining Web Link Structure
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
1 Conducting a Web Search: Problems & Algorithms Anna Rumshisky.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Presented by Zheng Zhao Originally designed by Soumya Sanyal
Computer Science 1 Web as a graph Anna Karpovsky.
CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure.
Information Retrieval Link-based Ranking. Ranking is crucial… “.. From our experimental data, we could observe that the top 20% of the pages with the.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Random Walks on Graphs Lecture 18 Monojit Choudhury
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Using Hyperlink structure information for web search.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Roberto Battiti, Mauro Brunato
The PageRank Citation Ranking: Bringing Order to the Web
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
CSE 454 Advanced Internet Systems University of Washington
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Link Structure Analysis
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Link Analysis

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments indirectly

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 3 Outline 1. Early Approaches to Link Analysis 2. Hubs and Authorities: HITS 3. Page Rank 4. Stability 5. Probabilistic Link Analysis 6. Limitation of Link Analysis

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 4 Early Approaches Hyperlinks contain information about the human judgment of a site The more incoming links to a site, the more it is judged important Basic Assumptions Bray 1996 The visibility of a site is measured by the number of other sites pointing to it The luminosity of a site is measured by the number of other sites to which it points  Limitation: failure to capture the relative importance of different parents (children) sites

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 5 Early Approaches To calculate the score S of a document at vertex v S(v) = s(v) + 1 | ch[v] | Σ S(w) w Є |ch(v)| v: a vertex in the hypertext graph G = (V, E) S(v): the global score s(v): the score if the document is isolated ch(v): children of the document at vertex v Limitation: - Require G to be a directed acyclic graph (DAG) - If v has a single link to w, S(v) > S(w) - If v has a long path to w and s(v) S (w)  unreasonable Mark 1988

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 6 Early Approaches Marchiori (1997) S(v) = s(v) + h(v) - S(v): overall information - s(v): textual information - h(v): hyper information Hyper information should complement textual information to obtain the overall information h(v) = Σ F S(w) w Є |ch[v]| r(v, w) - F: a fading constant, F Є (0, 1) - r(v, w): the rank of w after sorting the children of v by S(w)  a remedy of the previous approach (Mark 1988)

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 7 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites a(v) - the authority of v h(v) - the hubness of v

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 8 Authority and Hubness a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 9 Authority and Hubness Convergence Recursive dependency : a(v)  Σ h(w) h(v)  Σ a(w) Using Linear Algebra, we can prove: w Є pa[v] w Є ch[v] a(v) and h(v) converge

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 10 HITS Example {1, 2, 3, 4} - nodes relevant to the topic Expand the root set R to include all the children and a fixed number of parents of nodes in R  A new set S (base subgraph)  Start with a root set R {1, 2, 3, 4} Find a base subgraph:

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 11 HITS Example BaseSubgraph( R, d) 1.S  r 2.for each v in R 3.do S  S U ch[v] 4. P  pa[v] 5. if |P| > d 6. then P  arbitrary subset of P having size d 7. S  S U P 8.return S

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 12 HITS Example HubsAuthorities(G) 1 1  [1,…,1] Є R 2 a  h  1 3 t  1 4 repeat 5 for each v in V 6 do a (v)  Σ h (w) 7 h (v)  Σ a (w) 8 a  a / || a || 9 h  h / || h || 10 t  t until || a – a || + || h – h || < ε 12 return (a, h ) Hubs and authorities: two n-dimensional a and h 00 t t t t t t t t t t tt t -1 w Є pa[v] |V|

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 13 HITS Example Results Authority Hubness Authority and hubness weights

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 14 HITS Improvements Brarat and Henzinger (1998) HITS problems 1)The document can contain many identical links to the same document in another host 2)Links are generated automatically (e.g. messages posted on newsgroups) Solutions 1)Assign weight to identical multiple edges, which are inversely proportional to their multiplicity 2)Prune irrelevant nodes or regulating the influence of a node with a relevance weight

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 15 PageRank Introduced by Page et al (1998) –The weight is assigned by the rank of parents Difference with HITS –HITS takes Hubness & Authority weights –The page rank is proportional to its parents’ rank, but inversely proportional to its parents’ outdegree

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 16 Matrix Notation Adjacent Matrix A = *

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 17 Matrix Notation r = α B r = M r α : eigenvalue r : eigenvector of B A x = λ x | A - λI | x = 0 B = Finding Pagerank  to find eigenvector of B with an associated eigenvalue α

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 18 Matrix Notation PageRank: eigenvector of P relative to max eigenvalue B = P D P -1 D: diagonal matrix of eigenvalues {λ 1, … λ n } P: regular matrix that consists of eigenvectors PageRank r 1 = normalized

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 19 Matrix Notation Confirm the result # of inlinks from high ranked page hard to explain about 5&2, 6&7 Interesting Topic How do you create your homepage highly ranked?

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 20 Markov Chain Notation Random surfer model –Description of a random walk through the Web graph –Interpreted as a transition matrix with asymptotic probability that a surfer is currently browsing that page Does it converge to some sensible solution (as t  oo) regardless of the initial ranks ? r t = M r t-1 M: transition matrix for a first-order Markov chain (stochastic)

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 21 Problem “Rank Sink” Problem –In general, many Web pages have no inlinks/outlinks –It results in dangling edges in the graph E.g. no parent  rank 0 M T converges to a matrix whose last column is all zero no children  no solution M T converges to zero matrix

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 22 Modification Surfer will restart browsing by picking a new Web page at random M = ( B + E ) E : escape matrix M : stochastic matrix Still problem? –It is not guaranteed that M is primitive –If M is stochastic and primitive, PageRank converges to corresponding stationary distribution of M

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 23 PageRank Algorithm * Page et al, 1998

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 24 Distribution of the Mixture Model The probability distribution that results from combining the Markovian random walk distribution & the static rank source distribution r = εe + (1- ε)x ε: probability of selecting non-linked page PageRank Now, transition matrix [εH + (1- ε)M] is primitive and stochastic r t converges to the dominant eigenvector

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 25 Stability Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don’t change significantly? The connectivity of a portion of the graph is changed arbitrary –How will it affect the results of algorithms?

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 26 Stability of HITS δ : eigengap λ 1 – λ 2 d: maximum outdegree of G Ng et al (2001) A bound on the number of hyperlinks k that can added or deleted from one page without affecting the authority or hubness weights It is possible to perturb a symmetric matrix by a quantity that grows as δ that produces a constant perturbation of the dominant eigenvector

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 27 Stability of PageRank The parameter ε of the mixture model has a stabilization role If the set of pages affected by the perturbation have a small rank, the overall change will also be small tighter bound by Bianchini et al (2001) δ(j) >= 2 depends on the edges incident on j Ng et al (2001) V: the set of vertices touched by the perturbation

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 28 SALSA SALSA (Lempel, Moran 2001) –Probabilistic extension of the HITS algorithm –Random walk is carried out by following hyperlinks both in the forward and in the backward direction Two separate random walks –Hub walk –Authority walk

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 29 Forming a Bipartite Graph in SALSA

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 30 Random Walks Hub walk –Follow a Web link from a page u h to a page w a (a forward link) and then –Immediately traverse a backlink going from w a to v h, where (u,w) Є E and (v,w) Є E Authority Walk –Follow a Web link from a page w(a) to a page u(h) (a backward link) and then –Immediately traverse a forward link going back from v h to w a where (u,w) Є E and (v,w) Є E

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 31 Computing Weights Hub weight computed from the sum of the product of the inverse degree of the in- links and the out-links

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 32 Why We Care Lempel and Moran (2001) showed theoretically that SALSA weights are more robust that HITS weights in the presence of the Tightly Knit Community (TKC) Effect. –This effect occurs when a small collection of pages (related to a given topic) is connected so that every hub links to every authority and includes as a special case the mutual reinforcement effect The pages in a community connected in this way can be ranked highly by HITS, higher than pages in a much larger collection where only some hubs link to some authorities TKC could be exploited by spammers hoping to increase their page weight (e.g. link farms)

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 33 A Similar Approach Rafiei and Mendelzon (2000) and Ng et al. (2001) propose similar approaches using reset as in PageRank Unlike PageRank, in this model the surfer will follow a forward link on odd steps but a backward link on even steps The stability properties of these ranking distributions are similar to those of PageRank (Ng et al. 2001)

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 34 Overcoming TKC Similarity downweight sequencing and sequential clustering (Roberts and Rosenthal 2003) –Consider the underlying structure of clusters –Suggest downweight sequencing to avoid the Tight Knit Community problem –Results indicate approach is effective for few tested queries, but still untested on a large scale

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 35 PHITS and More PHITS: Cohn and Chang (2000) –Only the principal eigenvector is extracted using SALSA, so the authority along the remaining eigenvectors is completely neglected Account for more eigenvectors of the co-citation matrix See also Lempel, Moran (2003)

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 36 Limits of Link Analysis META tags/ invisible text –Search engines relying on meta tags in documents are often misled (intentionally) by web developers Pay-for-place –Search engine bias : organizations pay search engines and page rank –Advertisements: organizations pay high ranking pages for advertising space With a primary effect of increased visibility to end users and a secondary effect of increased respectability due to relevance to high ranking page

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 37 Limits of Link Analysis Stability –Adding even a small number of nodes/edges to the graph has a significant impact Topic drift – similar to TKC –A top authority may be a hub of pages on a different topic resulting in increased rank of the authority page Content evolution –Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine 38 Further Reading R. Lempel and S. Moran, Rank Stability and Rank Similarity of Link-Based Web Ranking Algorithms in Authority Connected Graphs, Submitted to Information Retrieval, special issue on Advances in Mathematics/Formal Methods in Information Retrieval, 2003.Rank Stability and Rank Similarity of Link-Based Web Ranking Algorithms in Authority Connected Graphs M. Henzinger, Link Analysis in Web Information Retreival, Bulletin of the IEEE computer Society Technical Committee on Data Engineering, 2000.Link Analysis in Web Information Retreival L. Getoor, N. Friedman, D. Koller, and A. Pfeffer. Relational Data Mining, S. Dzeroski and N. Lavrac, Eds., Springer-Verlag, 2001 Relational Data Mining