Measuring the Web. What? Use, size –Of entire Web, of sites (popularity), of pages –Growth thereof Technologies in use (servers, media types) Properties.

Slides:



Advertisements
Similar presentations
Network Security Highlights Nick Feamster Georgia Tech.
Advertisements

Basic Internet Terms Digital Design. Arpanet The first Internet prototype created in 1965 by the Department of Defense.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Coping with copies on the Web: Investigating Deduplication by Major Search Engines CWI, Amsterdam, The Netherlands
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
4.01 How Web Pages Work.
Dave Krause ANRCS Web Action Team.  Data is collected from a web site based on what the user does during the visit.
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
Model Fitting Jean-Yves Le Boudec 0. Contents 1 Virus Infection Data We would like to capture the growth of infected hosts (explanatory model) An exponential.
CS 345A Data Mining Lecture 1
The PageRank Citation Ranking “Bringing Order to the Web”
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
Detecting Fraudulent Clicks From BotNets 2.0 Adam Barth Joint work with Dan Boneh, Andrew Bortz, Collin Jackson, John Mitchell, Weidong Shao, and Elizabeth.
Hardware-based Load Generation for Testing Servers Lorenzo Orecchia Madhur Tulsiani CS 252 Spring 2006 Final Project Presentation May 1, 2006.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Topics in this presentation: The Web and how it works Difference between Web pages and web sites Web browsers and Web servers HTML purpose and structure.
Sampling a web subgraph Paraskevas V. Lekeas Proceedings of the 5 th Algorithms, Scientific Computing, Modeling and Simulation (ASCOMS), Web conference,
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
P2P Architecture Case Study: Gnutella Network
Web Characterization: What Does the Web Look Like?
Internet Technology I د. محمد البرواني. Project Number 3 Computer crimes in the cybernet Computer crimes in the cybernet Privacy in the cybernet Privacy.
14 Publishing a Web Site Section 14.1 Identify the technical needs of a Web server Evaluate Web hosts Compare and contrast internal and external Web hosting.
Implementing ISA Server Publishing. Introduction What Are Web Publishing Rules? ISA Server uses Web publishing rules to make Web sites on protected networks.
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Introduction to Web ScienceSlide 1 of 51 What turns an area into a science?  Why is it „Web Science“ and not „Web practice“
Data Structures & Algorithms and The Internet: A different way of thinking.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Week 3 LBSC 690 Information Technology Web Characterization Web Design.
TDTS21: Advanced Networking Lecture 7: Internet topology Based on slides from P. Gill and D. Choffnes Revised 2015 by N. Carlsson.
© All Rights Reserved Understanding URLs During this unit, you will be finding out about some of the following things: What a URL means.
© All Rights Reserved
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
CHAPTER 8: Producing Data Sampling ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Internet Architecture and Governance
NATIONAL AGENCY FOR EDUCATION Check the Source! - Web Evaluation
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Web Measurement. The Web is Different from other Commuication Media More precise measurement of activity on Web sites is available More precise measurement.
Statistical Properties of Text
1 More About HTML Images and Links. 22 Objectives You will be able to Include images in your HTML page. Create links to other pages on your HTML page.
Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.
A s s i g n m e n t W e e k 7 : T h e I n t e r n e t B Y : P a t r i c k O b i s p o.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Realestateby.net Logo Goes Here. Web Sites Internet exposure for yourself or agency at your own domain Show your listings.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
Ideal in addition to fast VPN Service Provider using Secure VPN Access.
Basic Internet Skills. What is the internet? A large group of computers connected to one another Its purpose is to send information back and forth to.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
4.01 How Web Pages Work.
4.01 How Web Pages Work.
4.01 How Web Pages Work.
Evaluation Anisio Lacerda.
Making Statistical Inferences
Section 14.1 Section 14.2 Identify the technical needs of a Web server
CS246 Web Characteristics.
CS276A Text Information Retrieval, Mining, and Exploitation
4.01 How Web Pages Work.
Internet Vocabulary Terms
4.01 How Web Pages Work.
Presentation transcript:

Measuring the Web

What? Use, size –Of entire Web, of sites (popularity), of pages –Growth thereof Technologies in use (servers, media types) Properties –Traffic (periodicity, self-similarity, timeouts,..) –Page-change properties (frequency, amount,..) –Links (self-similarity,

Why? Improve Web technologies Improve sites Improve search Justify prices Science

How? Surveys Instrumentation –Proxy/router logs, server logs, Sampling and statistical inference

A few survey (services) Nielson/NetRatings Pew Internet Project DLF/CLIR study

Some survey results NetRatings (Dec 2002) –168M US “Home” Internet Users –Use Web 7 hours/week to view 17 sites Pew Study (July 2002) –111M US Internet Users –33M of them search engine once/day

Simple sampling Netcraft server survey –Generate crawling and URL submission –35M sites in 2002 (Archive has 50M) OCLC Host survey –Generate random IP addrs and look for hosts –9,040,000 IP addrs with web servers in ,712,000 Unique Web sites

OCLC technique Generate 1% * 2^32 random IP numbers Screen out “bad ones” –Private addresses, IANA lists HTTP to port 80 of remainder Multiply number of responses by 1000 Use heuristics to eliminate “duplicates”

IP sampling and virtual hosting Netcraft says 1/2 of domain names virtually hosted on 100K IP numbers In 2000, OCLC said 3M IP addrs serving data, versus 3.4M IP addrs found by Netcraft

Interlude: “Size of Web” Size in (virtual) hosts, probably 40-60M –Based on Netcraft, OCLC, and Archive data Size in pages: infinite –People are obsessed with provide page- estimates, but this is a silly thing to do!

Heavy-tailed distributions Zipf, Pareto, power laws, lognormal Chic to find such things (Web, physics, bio) –…and then postulate “generative models” Statistics are squirrelly –For example, averages can be misleading

Heavy-tails on the Web Host and page: –Links (in and out) –Sizes –Popularity –In page case, both inter- and intrasite Page-size-to-popularity (Zipfian) Page and user reading times

Tripping on heavy tails How not to compute size of Web: –Use OCLC approach to find random hosts –Crawl each of these to measure average size –Multiply average size by host count Problem: heavy-tailed distribution of host size means that host sample is biased towards smaller hosts

Advanced inference Determine relative size of search engines A,B Pr[A&B|A] = |A+B|/|A| Pr[A&B|B] = |A+B|/|B| => |A|/|B| = Pr[A&B|B] / Pr[A&B|A]

Advanced inference Sample URLs from A –Issue random conjunctive query with <200 results, select a random result Test if present in B –Query with 8 rarest words and look for result Assume Pr[A&B|A] = # URLs discovered in A also found to be in B URL sampling biased to long documents Biased by ranking and details of engine

Conclusions Measuring Web is hard because it cannot be enumerated or even reliably sampled Statistical methods impacted by biases that cannot be quantified Validation is not possible The problem is getting harder (e.g., link spam) Quantitative studies are fascinating and a good research problem