CS246 Web Characteristics.

Slides:



Advertisements
Similar presentations
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Advertisements

The Structure of the Web Mark Levene (Follow the links to learn more!)
Searching the World Wide Web
Peer-to-Peer and Social Networks An overview of Gnutella.
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Measuring the Web. What? Use, size –Of entire Web, of sites (popularity), of pages –Growth thereof Technologies in use (servers, media types) Properties.
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Information Retrieval (9) Prof. Dragomir R. Radev
Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
CS 115: COMPUTING FOR THE SOCIO-TECHNO WEB FINDING INFORMATION WITH SEARCH ENGINES.
Data Structures and Algorithm Analysis Lecture 5
Lecture 3: Uninformed Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
22C:145 Artificial Intelligence
Data Center Network Architectures
The Structure of Broad Topics on the Web
Graphs Representation, BFS, DFS
Internet Searching: Finding Quality Information
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Introduction to Web Mining
DTMC Applications Ranking Web Pages & Slotted ALOHA
IST 516 Fall 2011 Dongwon Lee, Ph.D.
On the Scale and Performance of Cooperative Web Proxy Caching
Uniform Sampling from the Web via Random Walks
CS246 Page Refresh.
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Lecture 22 SVD, Eigenvector, and Web Search
Instructor: Shengyu Zhang
CS246 Search Engine Scale.
Searching for Truth: Locating Information on the WWW
Panagiotis G. Ipeirotis Luis Gravano
Searching for Truth: Locating Information on the WWW
CS246: Search-Engine Scale
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Searching for Truth: Locating Information on the WWW
Junghoo “John” Cho UCLA
CS246: Web Characteristics
CS276A Text Information Retrieval, Mining, and Exploitation
Junghoo “John” Cho UCLA
CS 345A Data Mining Lecture 1
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
3.2 Graph Traversal.
CS 345A Data Mining Lecture 1
Searching the Web.
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Presentation transcript:

CS246 Web Characteristics

Junghoo "John" Cho (UCLA Computer Science) Web Characteristics What is the Web like? Any questions on some of the characteristics and/or properties of the Web? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Web Characteristics Size of the Web Search engine coverage Link structure of the Web Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How Many Web Sites? Polling every IP 2^32 = 4B sites, 10 sec/IP, 1000 simultaneous connection: 2^32*10/(1000*24*60*60) = 460 days Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How Many Web Sites? Sampling based T: All IPs S: Sampled IPs V: Valid reply Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How Many Web Sites? Select |S| random IPs Send HTTP requests to port 80 at the selected IPs Count valid replies: “HTTP 200 OK” = |V| |T| = 2^32 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How Many Web Sites? OCLC (Online Computer Library) results http://wcp.oclc.org Total number of available IPs: 2^32 = 4.2 Billion 1998 1999 2000 2001 2002 Sites 2,636,000 4,662,000 7,128,000 8,443,000 8,712,000 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Issues Multi-hosted servers cnn.com: 207.25.71.5, 207.25.71.20, … Select the lowest IP address For each sampled IP: Look up domain name Resolve the name to IP Is our sampled IP the lowest? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Issues Virtual hosting Multiple sites on the same IP Find the average number of hosted sites per IP 7.4M sites on 3.4M IPs by polling all available site names [Netcraft, 2000] Other ports? Temporarily unavailable sites? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Questions? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How Many Web Pages? Sampling based? T: All URLs S: Sampled URLs V: Valid reply Infinite number of URLs Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How Many Web Pages? Solution 1: Estimate the average number of pages per site: (average no of pages) * (total no of sites) Algorithm: For each site with valid reply, download all pages Take average Result [LG99]: 289 pages per site, 2.8M sites 800M pages Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Issues 99.99% of the sites A small number of sites with TONS of pages Very likely to miss these sites Lots of samples necessary Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How Many Pages? Solution 2: Sampling-based T: All pages B: Base set S: Random samples Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Related Question How many deer in Yosemite National Park? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Random Page? Idea: Random walk Start from the Yahoo home page Follow random links, say 10,000 times Select the page Problem: Biased to “popular” pages. e.g., Microsoft, Google Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Random Page? Random walks on regular, undirected graph  uniform random sample Regular graph: an equal number of edges for all nodes After 1/𝜺 log(N) steps 𝜺 : depends on the graph structure N: number of nodes Idea: Transform the Web graph to a regular, undirected graph Perform a random walk Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Ideal Random Walk Generate the regular, undirected graph: Make edges undirected Decide d the maximum # of edges per page: say, 300,000 If edge(n) < 300,000, then add self-loop Perform random walks on the graph 𝜺 ~10-5 for the 1996 Web, N ~109 3,000,000 steps, but mostly self-loops 100 actual walk Junghoo "John" Cho (UCLA Computer Science)

Different Interpretation Random walk on irregular Web graph High chance to be at a “popular” node at a particular time Increase the chance to be at an “unpopular” node by staying there longer through self loops. Popular node Unpopular nodes Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Issues How to get edges to/from node n? Edges discovered so far From search engines, like Altavista, HotBot Still limited incoming links Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) WebWalker [BBCF00] Our graph does not have to be the same as the real Web Construct regular undirected graphs while performing the random walk Add new node n when it visits n Find edges for node n at that time Edges discovered so far From search engines Add self-loops as necessary Ignore any more edges to n later Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) WebWalker d = 5 1 2 2 3 1 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) WebWalker Why ignore “new incoming” edges? Make the graph regular. “Discovered parts” of the graph do not change “Uniformity theorem” still holds Can we arrive at “all reachable” pages? We ignore only the edges to “visited nodes” Can we use the same 𝜺? No Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) WebWalker results Size of the Web Altavista: |B| = 250M |B⋂S|/|S| = 35% |T| = 720M Avg page size: 12K Avg no of out-links: 10 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) WebWalker results Pages by domain .com: 49% .edu: 8% .org: 7% .net: 6% .de: 4% .jp: 3% .uk: 3% .gov: 2% Junghoo "John" Cho (UCLA Computer Science)

What About Other Web Pages? Pages that are Available within corporate Intranet Protected by authentication Not reachable through following links E.g., pages within e-commerce sites Deep Web vs Hidden Web Information reachable through search interface What if a page is reachable both through links and search interface?

Junghoo "John" Cho (UCLA Computer Science) Size of Deep Web? Estimation: (Avg no of records per site) * (Total no of Deep Web sites) How to estimate? By sampling Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Size of Deep Web? Total # of Deep Web sites: |B⋂S|/|S| Avg no of records per site: Contact the site directly Use “Not zzxxyyxx,” if the site reports no of matches Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Size of Deep Web BrightPlanet report Avg no of records per site: 5 million Total no of Deep Web sites: 200,000 Avg size of a record: 14KB Size of the Deep Web: 10^16 (10 petabytes) 1000 larger than the “Surface Web” How to access it? Junghoo "John" Cho (UCLA Computer Science)

How Stable Are the Sites? Monitor a set of random sites Percentage of Web servers available: (similar results for other years) Year 1998 1999 2000 2001 2002 Available 100% 56% 35% 25% 13% Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Web Characteristics Size of the Web Search engines Link structure of the Web Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Search Engines Coverage Overlap Dead links Indexing delay Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Coverage? Q: How to estimate coverage? A: Create a random sample and measure how many of them are indexed by a search engine In 1999 Estimated Web size: 800M, 1999 Reported indexed pages: 128M (Northern light)  16% No reliable Web size estimate at this point Search engines often claim ~20B index Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Overlap? How many pages are commonly indexed? Method 1 Create a random sample and measure how many are indexed only by A or B and commonly by A and B Method 2 Send common queries, compare returned pages, and measure overlap Result from method 2: Little overlap E.g., Infoseek and AltaVista: 20% overlap [Bharat and Broder 1997] Is it still true? Results seem to converge Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Dead Links? Q: How can we measure what fraction of pages in search engines are dead? A: Issue random queries and check and see whether returned pages are dead? Result in Feb 2000 AltaVista: 13.7% Excite: 8.7% Google: 4.3% Search engines have got much better due to better recrawling algorithms A topic for later study Junghoo "John" Cho (UCLA Computer Science)

How Early Pages Get Indexed? Method 1: Create pages at random locations Check when they are available at search engines Cons: Difficult to create pages at random locations Method 2: Repeatedly issue same queries over time When a new page appears in the result, record the “last modified date” Cons: last modified date is only a “lower bound” Junghoo "John" Cho (UCLA Computer Science)

How Early are Pages Indexed? Mean time [Lawrence and Giles 2000] Northern Light: 141 days AltaVista: 166 days HotBot: 192 days Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Web Characteristics Size of the Web Search engines Link structure of the Web Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Web As A Graph Page: Node Link: Edge Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Link Degree How many links? In-degree Power law Why consistently 2.1? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Link Degree Out-degree Junghoo "John" Cho (UCLA Computer Science)

Large-Scale Structure? Study by AltaVista & IBM, 1999 Based on 200M pages downloaded by AltaVista crawler “Bow-tie” result based on two experiments Junghoo "John" Cho (UCLA Computer Science)

Experiment 1: Strongly Connected Components Strongly connected component (SCC): C is a strongly connected component if: ∀a, b in C, there are paths from a to b and from b to a a SCC? a b c b c No Yes Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Result 1: SCC Identified all SCCs from 200M pages Biggest SCC: 50M (25%) Other SCCs are small Second largest: 150K Mostly fewer than 1000 nodes Junghoo "John" Cho (UCLA Computer Science)

Experiment 2: Reachability How many pages can we reach starting from a random page? Experiment Pick 500 random pages Follow links in the Breadth-first manner until no more links Repeated the same experiments following links in the “reverse direction” Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Result 2: Reachability Out-links (forward direction) 50% reaches 100M 50% reaches fewer than 1000 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Result 2: Reachability In-links (reverse direction) 50% reaches 100M 50% reaches fewer than 1000 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) What Can We Conclude? 50M (25%) SCC SCC (50M, 25%) Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) What Can We Conclude? How many nodes would we reach from SCC? Clearly not 1000, then 100M 50M more pages reachable from SCC (no way back, though) Out (50M, 25%) SCC (50M, 25%) Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) What Can We Conclude? Similar result for “in-links” when we followed links backwards 50M more pages reachable by following in-links SCC (50M, 25%) In (50M, 25%) Out (50M, 25%) Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) What Can We Conclude? 25% Miscellaneous (50M, 25%) SCC (50M, 25%) In (50M, 25%) Out (50M, 25%) Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Questions How did they “crawl” 50M In and 50M Misc nodes in the first place? There may be much more In and Misc nodes that were not crawled (25% is lower bounds) Only 25% SCC surprising (will be explained) Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) SCC If there are only two links, A  B and B  A, then A and B becomes one SCC. A B Junghoo "John" Cho (UCLA Computer Science)

Links between In, SCC and Out No single link from SCC to In No single link from Out to SCC At least 50% of the Web “unknown” to the core SSC Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Diameter of SCC On average, 16 links between two nodes in SCC The “maximum distance” (diameter) is at least 28 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Questions? Junghoo "John" Cho (UCLA Computer Science)

More Sources For Web Characteristics OCLC (Online Computer Library) http://wcp.oclc.org Netcraft Survey http://www.netcraft.com/survey/ NEC Web Analysis http://www.webmetrics.com Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How To Sample? Method 1: Take the last page and repeat Many “wasted” visits Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How To Sample? Method 2: Take last k pages Are they random samples? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How To Sample? Theorem: If k is large enough, they are approximately random pages Intuition: If we visit many pages, we visit all different pages Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How To Sample? Goal: Estimate A/N by m/k. Make A/N ~ m/k, i.e., if N A k m Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How To Sample? Assuming A is 20% of the Web  = 0.1: less than 10% error  = 0.01: 99% confidence  = 10^-5: the value from 1996 Web crawl k = 350,000,000 12,000 non-self-loop Junghoo "John" Cho (UCLA Computer Science)