1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005

Slides:



Advertisements
Similar presentations
Analysis and Modeling of Social Networks Foudalis Ilias.
Advertisements

Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
Link Analysis: PageRank
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
SILVIO LATTANZI, D. SIVAKUMAR Affiliation Networks Presented By: Aditi Bhatnagar Under the guidance of: Augustin Chaintreau.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006.
1 A Random-Surfer Web-Graph Model (Joint work with Avrim Blum & Hubert Chan) Mugizi Rwebangira.
6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 2.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
Advanced Topics in Data Mining Special focus: Social Networks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 10 June 4, 2006
(hyperlink-induced topic search)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
1 A Random-Surfer Web-Graph Model Avrim Blum, Hubert Chan, Mugizi Rwebangira Carnegie Mellon University.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Computer Science 1 Web as a graph Anna Karpovsky.
Link Analysis HITS Algorithm PageRank Algorithm.
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
On-line Social Networks - Anthony Bonato 1 Dynamic Models of On-Line Social Networks Anthony Bonato Ryerson University WAW’2009 February 13, 2009 nt.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Clusters Recognition from Large Small World Graph Igor Kanovsky, Lilach Prego Emek Yezreel College, Israel University of Haifa, Israel.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Markov Chains Mixing Times Lecture 5
Link-Based Ranking Seminar Social Media Mining University UC3M
Uniform Sampling from the Web via Random Walks
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Advanced Topics in Data Mining Special focus: Social Networks
Presentation transcript:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13,

2 Web Structure II : Bipartite Cores and Bow Tie Structure

3 Outline Bipartite cores The copying model Bow-tie structure of the web

4 Web as a Social Network Small-world network: Low (average) diameter High clustering coefficient v Many of v’s neighbors are neighbors of each other. Reason: Web is built of communities.

5 Cyber Communities Cyber community:  A group of people sharing a common interest.  Web pages authored/cited by these people. Examples  Israeli student organizations in the United States  Large automobile manufacturers  Oil spills off the coast of Japan  Britney Spears fans

6 Structure of Cyber Communities [Kumar et al, 1999] Hubs: resource pages about the community’s shared interest  Examples: Directory of Israeli Student Organizations in the US Yahoo! Autos Oil spills near Japan: bookmarks Donna’s Britney Spears links Authorities: central pages on the community’s shared interest  Examples: ISO: Stanford’s Israeli Student Organization Mazda.com Britney Spears: The official site

7 Dense Bipartite Subgraphs Hubs  Cite many authorities  Have overlapping citations Authorities  Cited by many hubs  Frequently co-cited Therefore: a cyber community is characterized by a dense directed bipartite subgraphs hubsauthorities

8 Bipartite Cores (i,j)-bipartite core: (H’,A’)  H’: a subset of H of size i  A’: a subset of A of size j  Subgraph induced on (H’,A’) is a complete bipartite graph Hypothesis: “Most” dense bipartite subgraphs of the web have cores. Therefore: bipartite cores are footprints of cyber communities. HA

9 Finding Cyber Communities Bipartite cores can be found efficiently from a crawl  A few one-pass scans of the data  A few sorts Web is rife with cyber communities  About 200k disjoint (3,*)-cores in a 1996 crawl  Crawl had ~200M pages  A random graph of this size is not likely to have even a single (3,3) core!

10 The Copying Model [Kleinberg et al 1999] [Kumar et al 2000] Initialization: A single node Evolution: At every step, a new node v is added to the graph. v connects to d out-neighbors. Prototype selection: v chooses a random node u from the graph. Bernoulli copying: For each i = 1,…,d,  v tosses a coin with heads probability of   If coin is heads, v connects to a random node  If coin is tails, v connects to the i-th out-neighbor of u

11 The Copying Model: Motivation When a new page is created, author has some “topic” in mind Author chooses links from a “prototype” u about the topic Author introduces his own spin on the topic, by linking to new “random” pages.

12 The Copying Model: Degree Distribution If  = 0, then i-th neighbor of v is u with probability indeg(u)/  w indeg(w)  Identical to the preferential attachment model  In the limit, fraction of pages with in-degree k is 1/k 2. For arbitrary   Fraction of pages with in-degree k is 1/k (2-  )/(1 -  )  Similar analysis

13 Erdős-Rényi Random Graph: Bipartite Cores G n,p with p = d/n  Fix any A,B  G n,p, |A| = i, |B| =j  Probability A,B form a complete bipartite graph:  # of such pairs A,B:  Expected # of (i,j)-bipartite cores is at most

14 The Copying Model: Bipartite Cores Consider the graph after n steps Theorem: For any i < log n, expected # of (i,d) bipartite cores is  (n/c i ) Definition: v is a duplicator of u, if it copies all its neighbors from the prototype u. Observation: If v 1,…,v i are duplicators of u, then v 1,…,v i and their neighbors form an (i,d)-bipartite core.

15 The Copying Model: Bipartite Cores (cont.) Lemma: w.h.p. (almost all) the first O(n/c i ) nodes added to the graph have at least i duplicators. Probability a new node v is a duplicator: (1-  ) d Define: c = 2 1/(1-  ) d Let u be any node born at some step t < O(n/c i ) Probability a node v born at step t’ > t chooses u as a prototype: 1/(t’ – 1).

16 The Copying Model: Bipartite Cores (cont.) Split steps t+1,…,n into O(log(n/t)) “epochs”: (t,2t], (2t,4t],(4t,8t],…,(n/2,n]  Probability at least one node at the first epoch chooses u as a prototype:  Same for the other epochs  Expected # of duplicators of u is at least  # of duplicators is sharply concentrated about the mean

17 Bow Tie Structure of the Web [Broder et al 2000]

18 Random Sampling of Web Pages

19 Outline Problem definition Random sampling of web pages according to their PageRank Uniform sampling of web pages (Henzinger et al) Uniform sampling of web pages (Bar- Yossef et al)

20 Random Sampling of Web Pages W = a snapshot of the “indexable web”  Consider only “static” HTML web pages  = a probability distribution over W Goal: Design an efficient algorithm for generating samples from W distributed according to . Our focus:   = PageRank   = Uniform Indexable web

21 Random Sampling of Web Pages: Motivation Compute statistics about the web  Ex: What fraction of web pages belong to.il?  Ex: What fraction of web pages are written in Chinese?  Ex: What fraction of hyperlinks are advertisements? Compare coverage of search engines  Ex: Is Google larger than MSN?  Ex: What is the overlap between Google and Yahoo? Data mining of the web  Ex: How frequently computer science pages cite biology pages?  Ex: How are pages distributed by topic?

22 Random Sampling of Web Pages: Challenges Naïve solution: crawl, index, sample  Crawls cannot get complete coverage  Web is constantly changing  Crawling is slow and expensive Our goals:  Accuracy: generate samples from a snapshot of the entire indexable web  Speed: samples should be generated quickly  Low cost: sampling procedure should run on a desktop PC

23 A Random Walk Approach Design a random walk on W whose stationary distribution is   P = Random walk’s probability transition matrix   P =  Run random walk for sufficiently many steps  Recall: For any initial distribution q,  Mixing time: # of steps required to get close to the limit Use reached node as a sample Repeat for as many samples as needed

24 A Random Walk Approach: Advantages & Issues Advantages:  Accuracy: random walk can potentially visit every page on the web  Speed: no need to scan the whole web  Low cost: no need for large storage or multiple processors Issues:  How to design the random walk so it converges to  ?  How to analyze the mixing time of the random walk?

25 PageRank Sampling [Henzinger et al 1999] Use the “random surfer” random walk:  Start at some initial node v 0  When visiting a page v Toss a coin with heads probability  If coin is heads, go to a uniformly chosen page If coin is tails, go to a random out-neighbor of v Limit distribution: PageRank Mixing time: fast (will see later)

26 PageRank Sampling: Reality Problem: how to pick a page uniformly at random? Solutions:  Jump to a random page from the history of the walk Creates bias towards dense web-sites  Pick a random host from the hosts in the walk’s history and jump to a random page from the pages visited on that host Not converging to PageRank anymore Experiments indicate it is still fine

27 Uniform Sampling via PageRank Sampling [Henzinger et al 2000] Sampling algorithm: 1.Use previous random walk to generate a sample w according to the PageRank distribution 2.Toss a coin with heads probability 3.If coin is heads, output w as a sample 4.If coin is tails, goto step 1 Analysis:   Need C/|W| iterations until getting a single sample

28 Uniform Sampling via PageRank Sampling: Reality How to estimate PR(w)?  Use the random walk itself: VR(w) = visit ratio of w (# of times w was visited by the walk divided by length of the walk) Approximation is very crude  Use the subgraph spanned by nodes visited to compute PageRank Bias towards to neighborhood of the initial page  Use Google

29 Uniform Sampling by RW on Regular Graphs [Bar-Yossef et al 2000] Fact: A random walk on an undirected, connected, non-bipartite, and regular graph converges to a uniform distribution. Proof:  P: random walk’s probability transition matrix P is stochastic 1 is a right eigenvector with e.v. 1: P1 = 1  Graph is connected  RW is irreducible  Graph is non-bipartite  RW is aperiodic  Hence, RW is ergodic, and thus has a stationary distribution  :  is a left eigenvector of P with e.v. 1:  P = 

30 Random Walks on Regular Graphs Proof (cont.):  d: graph’s degree,  A: graph’s adjacency matrix Symmetric, because graph is undirected  P = (1/d) A Hence, also P is symmetric Its left eigenvectors and right eigenvectors are the same  = (1/n) 1

31 Web as a Regular Graph Problems  Web is not connected  Web is directed  Web is non-regular Solutions  Focus on indexable web, which is connected  Ignore directions of links  Add a weighted self loop to each node weight(w) = deg max – deg(w) All pages then have degree deg max Overestimate on deg max doesn’t heart

32 Mixing Time Analysis Theorem Mixing time of a random walk is log(|W|) / (1 - 2 )  : spectral gap of P Experiment (over a large web crawl):  1 – 2 ~ 1/100,000  log(|W|) ~ 34 Hence: mixing time ~ 3.4 million steps  Self loop steps are free  About 1 in 30,000 steps is not a self loop step (deg max ~ 300,000, deg avg ~ 10)  Actual mixing time: ~ 115 steps!

33 Random Walks on Regular Graphs: Reality How to get incoming links?  Search engines Potential bias towards search engine index Do not provide full list of in-links? Costly communication  Random walk’s history Important for avoiding dead ends Requires storage How to estimate deg(w)? Solution: run random walk on the sub-graph of W spanned by the available links  Sub-graph may no longer have the good mixing time properties

34 Top 20 Internet Domains (Summer 2003)

35 Search Engine Coverage (Summer 2000)

36 End of Lecture 6