Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Ch 5 + Anatomy of the Long Tail (Goel et al., WSDM 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Web as Network Networked Life CSE 112 Spring 2006 Prof. Michael Kearns.
(hyperlink-induced topic search)
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Computer Science 1 Web as a graph Anna Karpovsky.
Link Analysis HITS Algorithm PageRank Algorithm.
Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Social Networking Algorithms related sections to read in Networked Life: 2.1,
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Overview of Web Ranking Algorithms: HITS and PageRank
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Convergence of PageRank and HITS Algorithms Victor Boyarshinov Eric Anderson 12/5/02.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 CS 430: Information Discovery Lecture 5 Ranking.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
DATA MINING Introductory and Advanced Topics Part III – Web Mining
HITS Hypertext-Induced Topic Selection
Link-Based Ranking Seminar Social Media Mining University UC3M
Text & Web Mining 9/22/2018.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
A Comparative Study of Link Analysis Algorithms
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Presentation transcript:

Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

Web Search Hard problem – Hats off to ‘information retrieval’ – Complex information needs Keywords Synonyms, polysemy (multiple meanings) – True homonyms: row (oar) row (argue); delta (greek and of a river) – Polysemous homonyms: mouth (of a river), mouth (of an animal); right ‘hand’ person, ‘hand’ it to me – The age of intermediaries (BRS After Dark) – Diversity in writing + Diversity in queries + Diversity in Indexing + Diversity in motivations – Controlled vocabularies vs free-texts – Majority rule? ‘Cornell’

Web Search Peculiarities Compared to the good old days Needle in a haystack problem; many needles in many haystacks! Which ones to look for? – How distinct is this from the “traditional” methods for IR? Libraries etc. – Can we do without libraries? Quality – a serious question? – Does redundancy promote quality? – Does collaboration promote quality? Scale – Retrieve and FILTER/ORGANIZE – Satisfying versus satisficing

Link Analysis In-links and out-links; in-degree and out- degree – A matter of endorsement! (directional) – Akin to citations – What are differences? Must one out-link? – Power laws all the way through!

Some studies (Kumar et. al. 99): Alexa web crawl from 1997 over 40 million nodes. Trawling the Web for cyber communities, Proc. 8th WWW, Apr 1999 Probability page has in-degree k = 1/k 2 Probability page has at least in-degree k = 1/k Actual exponent slightly larger than 2. Barabasi and Albert 1999 – studied the U. Notre Dame web site with some extensions

Broder et al. Graph Structure of the Web Note that the exponent is different. Note also the deviation In the low end of the out-degree.

Fractals? Broder et al “almost fractal like quality for the power law in-degree and out-degree distributions, in that it appears both as a macroscopic phenomenon on the entire web, as a microscopic phenomenon at the level of a single university website, and at intermediate levels between these two.” Graph structure in the web

Similar Studies Donato et al. ACM TOIT, The Web as a Graph: How Far We Are – In-degree: power law; exponent 2.1 (Fig. 4) – Out-degree: not so good (Fig. 5) – Check out Fig. 8: SCC distribution (number of SCCs versus Size of SCC). Power law; exponent 2.09 Webbase, 200 Million Stanford crawl (2001) – 39% OUT; 11% IN; 13% Tendrils; 33% SCC (48 million) next SCC: 10 thousand!

Hubs & Authorities In-links: votes HITS algorithm: Hyperlink induced topic search. – A good hub is one that points to good authorities [lists; directories] – A good authority is one that is pointed to by good hubs – A good hub need not be an authority and vice versa. – Those who have knowledge; those who know well about those who have knowledge – Dynamic estimation; repeated application of update rules. Converges!

Algorithm First conduct retrieval. Compute Hubs and Authorities on relevant set – Rank the retrieved set by a list of hubs and a list of authorities Initialize hub and authority scores (say to all 1, or some other positive number) – Apply authority score update rule – Apply hub score update rule Example: fig and (problem 3)

Its all about convergence First show how the update rule works with matrices M and M T Then show the same using eigenvectors Then show that the initialization of hub scores really does not matter. As long as it is a positive vector, i.e., all hub scores are initialized to a positive number

PageRank Endorsements repeatedly move through out- links. A  B Principle of repeated improvement: – Weight of ‘current’ endorsement depends on ‘current’ estimate of A’s PageRank. – More important nodes convey higher endorsements. – Stabilize ~ till the network changes

Calculation Initialize: each node has a PageRank = 1/n where n is the number of nodes Basic PageRank Update Rule: – A node divides its PageRank equally over its out-links. If no out-links, it keeps its PageRank. – The PageRank of a node = sum of PageRanks it receives in that iteration. – Total PageRank stays constant, so no need for normalizing. Iterate till convergence OR a number of iterations.

Equilibrium No further changes in PageRanks Degenerate cases exist (Scaled PageRank Updates) Values need not be unique except where the network is strongly connected.

Slow leaks?

Scaled PageRank Update Rule Scaling factor: (between 0 and 1) generally (0.8 and 0.9) – Apply basic PageRank update rule. For each page: – Scale down all by some value s (say 0.9), so each gets 0.9 * PageRank.. – Total PageRank = s – Divide remaining PageRank (1-s) equitably over all nodes. Get a unique set of values for each setting of s. [shown later in proofs] Random walk model [Browsing not Searching]: probability of reaching a page is equal to prob(coming across an in-link) + prob(getting there at random)

Summary Link based analysis – Power laws: in-links, out-links etc. Hubs and Authorities – convergence PageRank – convergence