Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.

Slides:

Advertisements

Similar presentations

The Structure of the Web Mark Levene (Follow the links to learn more!)

Advertisements

Let's say we want to access domain - reliablescribe.com First we need to buy a computer We need to subscribe to an Internet Service Provider (ISP) The.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Measuring the Web. What? Use, size –Of entire Web, of sites (popularity), of pages –Growth thereof Technologies in use (servers, media types) Properties.

1 Mazes In The Theory of Computer Science Dana Moshkovitz.

The PageRank Citation Ranking “Bringing Order to the Web”

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.

Link Analysis, PageRank and Search Engines on the Web

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 10 June 4, 2006

Complexity 1 Mazes And Random Walks. Complexity 2 Can You Solve This Maze?

CS 345 Data Mining Lecture 1 Introduction to Web Mining.

Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.

1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.

Analysing the link structures of the Web sites of national university systems Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton,

Sampling a web subgraph Paraskevas V. Lekeas Proceedings of the 5 th Algorithms, Scientific Computing, Modeling and Simulation (ASCOMS), Web conference,

Internet Research Search Engines & Subject Directories.

An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.

PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.

Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.

Adversarial Information Retrieval The Manipulation of Web Content.

Web Characterization: What Does the Web Look Like?

Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.

CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.

Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.

WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

Week 3 LBSC 690 Information Technology Web Characterization Web Design.

COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.

Ranking Link-based Ranking (2° generation) Reading 21.

Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.

1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.

1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.

1 CS 430: Information Discovery Lecture 17 Web Crawlers.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)

Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.

Mathematics of the Web Prof. Sara Billey University of Washington.

GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.

1 Chapter 5 (3 rd ed) Your library is an excellent resource tool. Your library is an excellent resource tool.

CS 115: COMPUTING FOR THE SOCIO-TECHNO WEB FINDING INFORMATION WITH SEARCH ENGINES.

DTMC Applications Ranking Web Pages & Slotted ALOHA

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Uniform Sampling from the Web via Random Walks

Search Engines & Subject Directories

CS246 Web Characteristics.

Search Engines & Subject Directories

Search Engines & Subject Directories

Presentation transcript:

Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State

Studying the Web To study the characteristics of the Web Statistics Topology Behavior … Why Scientific curiosity Practical values Eg, search engine coverage 2 Nature 1999

Web as Platform Web becomes a new computation platform Pauses new challenges Scale Efficiency Heterogeneity Impact to People’s lives 3

Eg, How Big is the Web? Q1: How many web sites? Q2: How many web pages? Q3: How many surface/deep web pages? Research Method Mostly used Experimental method to validate novel solutions 4

Q1: How Many Web Sites? DNS Registrars List of domain names Issues Not every domain is web site A domain contains more than one web site Registrars are under no obligations for their correctness So many of them … 5

6 How Many Web Sites? Brute-force: Polling every IP IPv4: ^32 = 4 billion IPv6: 2^ sec/IP, 1000 simultaneous connection: 2^32*10/(1000*24*60*60) = 460 days Not going to work !!

7 How Many Web Sites? 2 nd attempt: Sampling T: All 4 Billion IPs S: Sampled IPs V: Valid reply

8 How Many Web Sites? 1.Select |S| random IPs 2.Send HTTP requests to port 80 at the selected IPs 3.Count valid replies: “HTTP 200 OK” = |V| 4.|T| = 2^32 Q: What are the issues here?

9 Issues Virtual hosting Ports other than 80 Temporarily unavailable sites …

10 OCLC Survey (2002) OCLC (Online Computer Library) Results Still room for growth (at least for Web sites) ??

NetCraft Web Server Survey (2010) Goal is to measure web server market share Also record # of sites their crawlers visited August 2010: 213,458,815 distinct sites 11

NetCraft Web Server Survey (2013) Goal is to measure web server market share Also record # of sites their crawlers visited August 2013: 716,822,317 distinct sites 12

NetCraft Web Server Survey (2014) Goal is to measure web server market share Also record # of sites their crawlers visited August 2013: 992,177,228 distinct sites 13

14 Q2: How Many Web Pages? Sampling based? Issue here? T: All URLs S: Sampled URLs V: Valid reply

15 How Many Web Pages? Method #1: For each site with valid reply, download all pages Measure average # of pages per site Avg # of pages X total # of sites Result [Lawrence & Giles, 1999] 289 pages per site, 2.8M sites 289 * 2.8M =~ 800M web pages

16 Further Issues A small #of sites with TONS of pages Sampling could miss these sites Majority of sites with small # of pages Lots of samples necessary 99.99% of the sites

17 How Many Web Pages? Method #2: Random sampling Assume: T: All pages B: Base set S: Random samples

18 Random Page? Idea: Random walk Start from a Portal home page (eg, Yahoo) Estimate the size of the portal: B Follow random links, say 10,000 times Select the pages At the end, a set of random web pages S are gathered

19 Straightforward Random Walk google.com amazon.com pike.psu.edu Follow a random out- link at each step Issues?

20 Straightforward Random Walk google.com amazon.com pike.psu.edu Follow a random out- link at each step Gets stuck in sinks and in dense Web communities 2. Biased towards popular pages 3. Converges slowly, if at all Issues?

21 Going to Converge? Random walks on regular, undirected graph  uniformly distributed sample Theorem [Markov chain folklore]: After steps, a random walk reaches the stationary distribution  : depends on the graph structure N: number of nodes Idea: Transform the Web graph to a regular, undirected graph Perform a random walk Problem Web is neither regular nor undirected

22 Intuition Random walk on undirected Web graph (not regular) High chance to be at a “popular” node at a particular time Increase the chance to be at a “unpopular” node by staying there longer through self loop. Unpopular nodesPopular node

23 WebWalker: Undirected Regular Random Walk on the Web Fact: A random walk on a connected undirected regular graph converges to a uniform stationary distribution after certain # of steps. w(v) = deg max - deg(v) google.com pike.psu.edu amazon.com Follow a random out- link or a random in-link at each step Use weighted self loops to even out pages’ degrees

24 Ideal Random Walk Generate the regular, undirected graph: Make edges undirected Decide d the maximum # of edges per page: say, 300,000 If edge(n) < 300,000, then add self-loop Perform random walks on the graph   for the 1996 Web, N  10 9

25 WebWalker Results (2000) Size of the Web pages Altavista: |B| = 250M |B  S|/|S| = 35% Estimated |T| = ~ 720M Avg page size: 12K Avg # of out-links: 10 Ziv Bar-Yossef, Alexander Berg, Steve Chien, Jittat Fakcharoenphol, and Dror Weitz, Approximating Aggregate Queries about Web Pages via Random Walks. VLDB, 2000

How large is SE’s Index? Prepare a representative corpus (eg, DMOZ) Draw a word W with known frequency percentage F Eg, “The” is present in 60% of all documents within the corpus Submit W to a search engine E If E reports there are X number of documents containing W, one can extrapolate the total size of E’s index as=~ X / F Repeat multiple times for computing average 26

(2010) Billions

(2011) Billions

(2013) Billions

(2013) Billions

Google Reveals Itself (2008) 1998: 26 Million URLs 2000: 1 Billion URLs 2008: 1 trillion URLs Not all of them are indexed Duplicates Auto-generated (eg, Calendar) Spams Experts suspect (2010) Google index at least 40 Billions 31

32 Deep Web (aka Hidden Web) HTML FORM Interface Query Answers

33 Q3: Size of Deep Web? Deep Web: Information reachable only through query interface (eg, HTML FORM) Often backed by DBMS Estimation: How to estimate? By sampling (Avg size of record) X (Avg # of records per site) X (Total # of Deep Web sites)

34 Size of Deep Web? Total # of Deep Web sites: |B  S|/|S| Avg size of a record: Issue random queries Estimate reply size Avg # of records per site: Permute all possible queries for the FORM Issue all queries and count valid return

35 Size of Deep Web (2005) BrightPlanet report estimates: Avg size of a record: 14KB Avg # of records per site: 5MB Total # of Deep Web sites: 200,000 Size of the Deep Web: 10^16 (10 petabytes) 1,000 times larger than the “Surface Web” How to access it? Wrapper/Mediator (aka. Web scrapping) : obsolete now