Uniform Sampling from the Web via Random Walks

Slides:



Advertisements
Similar presentations
1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
Advertisements

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Department of Electrical Engineering, Technion Maxim Gurevich Department of Electrical Engineering,
Markov Models.
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Information Networks Link Analysis Ranking Lecture 8.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Approximation Algorithms for Unique Games Luca Trevisan Slides by Avi Eyal.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006.
Hierarchy in networks Peter Náther, Mária Markošová, Boris Rudolf Vyjde : Physica A, dec
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
1 Walking on a Graph with a Magnifying Glass Stratified Sampling via Weighted Random Walks Maciej Kurant Minas Gjoka, Carter T. Butts, Athina Markopoulou.
Markov Chains Lecture #5
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Automatic Evaluation Of Search Engines Project Poster Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Link Analysis, PageRank and Search Engines on the Web
1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 10 June 4, 2006
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Entropy Rate of a Markov Chain
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
Liang Ge.  Introduction  Important Concepts in MCL Algorithm  MCL Algorithm  The Features of MCL Algorithm  Summary.
Using Hyperlink structure information for web search.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory Presented By Liang Tian 7/13/2010 1Adaptive On-Line Page Importance Computation.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Date: 2005/4/25 Advisor: Sy-Yen Kuo Speaker: Szu-Chi Wang.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
Data Structures and Algorithm Analysis Lecture 5
The PageRank Citation Ranking: Bringing Order to the Web
The Structure of Broad Topics on the Web
UbiCrawler: a scalable fully distributed Web crawler
Quality of a search engine
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
CS 326A: Motion Planning Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces (1996) L. Kavraki, P. Švestka, J.-C. Latombe,
Search Engines and Link Analysis on the Web
Markov Chains Mixing Times Lecture 5
Link-Based Ranking Seminar Social Media Mining University UC3M
DTMC Applications Ranking Web Pages & Slotted ALOHA
Path Coupling And Approximate Counting
Haim Kaplan and Uri Zwick
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS246 Web Characteristics.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Algorithmic Problems Related To The Internet
Graph and Link Mining.
CS246: Web Characteristics
3.2 Graph Traversal.
Presentation transcript:

Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at Berkeley

Motivation: Web Measurements Main goal: Develop a cheap method to sample uniformly from the Web Use a random sample of web pages to approximate: search engine coverage domain name distribution (.com, .org, .edu) percentage of porn pages average number of links in a page average page length Note: A web page is a static html page

The Structure of the Web (Broder et al., 2000) large strongly connected component left side 1/4 right side 1/4 1/4 indexable web 1/4 tendrils & isolated regions

Why is Web Sampling Hard? Obvious solution: sample from an index of all pages Maintaining an index of Web pages is difficult Requires extensive resources (storage, bandwidth) Hard to implement There is no consistent index of all Web pages Difficult to get complete coverage Month to crawl/index most of the Web Web is changing every minute

Our Approach: Random Walks for Random Sampling Random walk on a graph provides a sample of nodes Graph is undirected and regular  sample is uniform Problems: The Web is neither undirected nor regular Our solution Incrementally create an undirected regular graph with the same nodes as the Web Perform the walk on this graph

Related Work Monika Henzinger, et al. (2000) Random walk produces pages distributed by Google’s page rank. Weight these pages to produce a nearly uniform sample. Krishna Bharat & Andrei Broder (1998) Measured relative size and overlap of search engines using random queries. Steve Lawrence & Lee Giles (1998, 1999) Size of the web by probing IP addresses and crawling servers. Search engine coverage in response to certain queries.

Random Walks: Definitions probability distribution qt qt(v) = prob. v is visited at step t v u From node v pick any outgoing edge with equal probability. Go to u. Transition matrix A qt+1 = qtA Stationary distribution Limit as t grows of qt if it exists and is independent of q0 Markov process The probability of a transition depends only on the current state. Mixing time # of steps required to approach the stationary distribution

Straightforward Random Walk on the Web amazon.com Follow a random out-link at each step netscape.com 4 7 1 6 9 3 5 8 2 www.cs.berkeley.edu/~zivi Gets stuck in sinks and in dense Web communities Biased towards popular pages Converges slowly, if at all

WebWalker: Undirected Regular Random Walk on the Web 3 5 amazon.com Follow a random out-link or a random in-link at each step Use weighted self loops to even out pages’ degrees 3 2 3 4 netscape.com 1 4 3 3 2 1 1 3 2 2 w(v) = degmax - deg(v) 2 4 www.cs.berkeley.edu/~zivi Fact: A random walk on a connected undirected regular graph converges to a uniform stationary distribution.

WebWalker: Mixing Time Theorem [Markov chain folklore]: A random walk’s mixing time is at most log(N)/(1 - 2) where N = size of the graph 1 - 2 = eigenvalue gap of the transition matrix Experiment (using an extensive Alexa crawl of the web from 1996) WebWalker’s eigenvalue gap: 1 - 2  10-5 Result: Webwalker’s mixing time is 3.1 million steps Self loop steps are free Only 1 in 30,000 steps is not a self loop step (degmax  3x105, degavg= 10) Result: Webwalker’s actual mixing time is only 100 steps!

WebWalker: Mixing Time (cont.) Mixing time on the current Web may be similar Some evidence that the structure of the Web today is similar to the structure in 1996 (Kumar et al., 1999, Broder et al., 2000)

WebWalker: Realization (1) Webwalker(v): Spend expected degmax/deg(v) steps at v Pick a random link incident to v (either v  u or u  v) Webwalker(u) Problems The in-links of v are not available deg(v) is not available Partial sources of in-links: Previously visited nodes Reverse link services of search engines

WebWalker: Realization (2) WebWalker uses only available links: out-links in-links from previously visited pages first r in-links returned from the search engines WebWalker walks on a sub-graph of the Web sub-graph induced by available links to ensure consistency: as soon as a page is visited its incident edge list is fixed for the rest of the walk

WebWalker’s Induced Sub-Graph WebWalker: Example WebWalker’s Induced Sub-Graph Web Graph v6 v5 v6 v5 v1 v2 v3 v1 1 2 v2 v3 1 v4 v4 1 w 1 covered by search engines not covered by search engines available link non-available link

WebWalker: Bad News WebWalker becomes a true random walk only after its induced sub-graph “stabilizes” Induced sub-graph is random Induced sub-graph misses some of the nodes Eigenvalue gap analysis does not hold anymore

WebWalker: Good News WebWalker eventually converges to a uniform distribution on the nodes of its induced sub-graph WebWalker is a “close approximation” of a random walk much before the sub-graph stabilizes Theorem: WebWalker’s induced sub-graph is guaranteed to eventually cover the whole indexable Web. Corollary: WebWalker can produce uniform samples from the indexable Web.

Evaluation of WebWalker’s Performance Questions to address in experiments: Structure of induced sub-graphs Mixing time Potential bias in early stages of the walk: towards high degree pages towards the search engines towards the starting page’s neighborhood

WebWalker: Evaluation Experiments Run WebWalker on the 1996 copy of the Web 37.5 million pages 15 million indexable pages degavg= 7.15 degmax= 300,000 Designate a fraction p of the pages as the search engine index Use WebWalker to generate a sample of 100,000 pages Check the resulting sample against the actual values

Evaluation: Bias towards High Degree Nodes Percent of nodes from walk High Degree Low Degree Deciles of nodes ordered by degree

Evaluation: Bias towards the Search Engines Estimate of search engine size 30% 50% Search engine size

Evaluation: Bias towards the Starting Node’s Neighborhood Percent of nodes from walk Close to Starting Node Far from Starting Node Deciles of nodes by distance from starting node

WebWalker: Experiments on the Web Run WebWalker on the actual Web Two runs of 34,000 pages each Dates: July 8, 2000 - July 15, 2000 Used four search engines for reversed links: AltaVista, HotBot, Lycos, Go

Domain Name Distribution

Search Engine Coverage

Web Page Parameters Average page size: 8,390 Bytes Average # of images on a page: 9.3 Images Average # of hyperlinks on a page: 15.6 Links

Conclusions Uniform sampling of Web pages by random walks Good news: walk provably converges to a uniform distribution easy to implement and run with few resources encouraging experimental results Bad news: no theoretical guarantees on the walk’s mixing time some biases towards high degree nodes and the search engines Future work: obtain a better theoretical analysis eliminate biases deal with dynamic content

Thank You!