CS345 Data Mining Crawling the Web. Web Crawling Basics get next url get page extract urls to visit urls visited urls web pages Web Start with a “seed.

Slides:

Advertisements

Similar presentations

Near-Duplicates Detection

Advertisements

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

1 Searching the Web Junghoo Cho UCLA Computer Science.

Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

1 Lecture 18 Syntactic Web Clustering CS

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

Evaluating Hypotheses

Near Duplicate Detection

1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.

Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Lecture 10: Search Structures and Hashing

16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University.

1 Locality-Sensitive Hashing Basic Technique Hamming-LSH Applications.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Simulation Output Analysis

Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.

1 Chapter 5 Flow Lines Types Issues in Design and Operation Models of Asynchronous Lines –Infinite or Finite Buffers Models of Synchronous (Indexing) Lines.

Random Sampling, Point Estimation and Maximum Likelihood.

FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.

Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.

Crawling Slides adapted from

David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.

May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.

© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.

Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.

1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.

The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

1 Crawler (AKA Spider) (AKA Robot) (AKA Bot). What is a Web Crawler? A system for bulk downloading of Web pages Used for: –Creating corpus of search engine.

Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.

Mining Data Streams (Part 1)

Near Duplicate Detection

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

Streaming & sampling.

Counting How Many Elements Computing “Moments”

CS246 Page Refresh.

Minwise Hashing and Efficient Search

On the resemblance and containment of documents (MinHash)

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Anwar Alhenshiri.

Presentation transcript:

CS345 Data Mining Crawling the Web

Web Crawling Basics get next url get page extract urls to visit urls visited urls web pages Web Start with a “seed set” of to-visit urls

Crawling Issues  Load on web servers  Insufficient resources to crawl entire web Which subset of pages to crawl?  How to keep crawled pages “fresh”?  Detecting replicated content e.g., mirrors  Can’t crawl the web from one machine Parallelizing the crawl

Polite Crawling  Minimize load on web servers by spacing out requests to each server E.g., no more than 1 request to the same server every 10 seconds  Robot Exclusion Protocol Protocol for giving spiders (“robots”) limited access to a website

Crawl Ordering  Not enough storage or bandwidth to crawl entire web  Visit “important” pages first  Importance metrics In-degree  More important pages will have more inlinks Page Rank  To be discussed later  For now, assume it is a metric we can compute

Crawl Order  Problem: we don’t know the actual indegree or page rank of a page until we have the entire web!  Ordering heuristics Partial in-degree Partial page rank Breadth-first search (BFS) Random Walk -- baseline

stanford.edu experiment  179K pages Source: Cho et al (1998) Overlap with best x% by indegree x% crawled by O(u)

Larger study (328M pages) BFS crawling brings in high quality pages early in the crawl Source: Najork and Wiener (2001)

Maintaining freshness  How often do web pages change?  What do we mean by freshness?  What strategy should we use to refresh pages?

How often do pages change?  Cho et al (2000) experiment  270 sites visited (with permission) identified 400 sites with highest “PageRank” contacted administrators  720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new & old pages) ran only 9pm - 6am, 10 seconds between site requests

Average change interval Source: Cho et al (2000)

Modeling change  Assume changes to a web page are a sequence of random events that happen independently at a fixed average rate  Poisson process with parameter  Let X(t) be a random variable denoting the number of changes in any time interval t Pr[X(t)=k] = e -t (t) k /k! for k = 0,1,…  “Memory-less” distribution

Poisson processes  Let us compute the expected number of changes in unit time E[X(1)] =  k ke k /k! =  is therefore the average number of changes in unit time  Called the rate parameter

Time to next event  Let T be a random variable denoting the time to the next event  Verify that Pr[T>t] = e -t (t>0)  The corresponding density function is f(t) = e -t  The distribution of change intervals should follow an exponential distribution

Change intervals of pages for pages that change every 10 days on average interval in days fraction of changes with given interval Poisson model Source: Cho et al (2000)

Change Metrics (1) - Freshness  Freshness of element e i at time t is F(p i ;t ) = 1 if e i is up-to-date at time t 0 otherwise pipi pipi... webdatabase Freshness of the database S at time t is F( S ; t ) = F( p i ; t ) (Assume “equal importance” of pages)  N 1 N i=1

Change Metrics (2) - Age  Age of element e i at time t is A( p i ; t ) = 0 if e i is up-to-date at time t t-(modification time p i ) otherwise pipi pipi... webdatabase Age of the database S at time t is A( S ; t ) = A( p i ; t ) (Assume “equal importance” of pages)  N 1 N i=1

Change Metrics (3) - Delay  Suppose crawler visits page p at times  0,  1,…  Suppose page p is updated at times t 1, t 2,… t k between times  0 and  1  The delay associated with update t i is D(t i ) =  1 - t i  The total delay associated with the changes to page p is D(p) =  i D(t i )  At time time  1 the delay drops back to 0  Total crawler delay is sum of individual page delays

Comparing the change metrics F(p) A(p) time update refresh D(p)

Resource optimization problem  Crawler can fetch M pages in time T  Suppose there are N pages p 1,…,p N with change rates 1,…, N  How often should the crawler visit each page to minimize delay?  Assume uniform interval between visits to each page

A useful lemma Lemma. For page p with update rate, if the interval between refreshes is , the expected delay during this interval is  2 /2. Proof. Number of changes generated at between times t and t+dt = dt Delay for these changes = -t Total delay = =  2 /2 0  t dt

Optimum resource allocation Total number of accesses in time T = M Suppose allocate m i fetches to page p i Then Interval between fetches of page p i = T/m i Delay for page p i between fetches = Total delay for page p = Minimize subject to

Method of Lagrange Multipliers To maximize or minimize a function f(x 1,…x n ) subject to the constraint g(x 1,…x n )=0 Introduce a new variable  and define h = f - g Solve the system of equations: for i = 1,…,n g(x 1,…x n )=0 n+1 equations in n+1 variables

Optimum refresh policy Applying the Lagrange multiplier method to our problem, we have

Optimum refresh policy  To minimize delay, we must allocate to each page a number of visits proportional to the square root of its average rate of change  Very different answer to minimze the freshness and age metrics; see references.

Estimating the rate parameter  Simple estimator Visit page N times in time interval T Suppose we see that page has changed X times Estimate = X/T  What is the problem with this estimator?

A better estimator  The page may have actually changed more than once between visits We therefore tend to understimate  A better estimator is For N=10, X=3, we get:  = 0.3 with the simple estimator  = 0.34 with the improved estimator.  Details in Cho and Garcia-Molina (2000)

Detecting duplicate pages  Duplicates waste crawler resources  We can quickly test for duplicates by computing a fingerprint for every page (e.g., MD5 hash) Make fingerprint long enough to make collisions very unlikely  Problem: pages may differ only in ads or other formatting changes Also an issue in counting changes for rate estimation

Detecting approximate duplicates  Can compare the pages using known distance metrics e.g., edit distance Takes far too long when we have millions of pages!  One solution: create a sketch for each page that is much smaller than the page  Assume: we have converted page into a sequence of tokens Eliminate punctuation, HTML markup, etc

Shingling  Given document D, a w-shingle is a contiguous subsequence of w tokens  The w-shingling S(D,w) of D, is the set of all w-shingles in D  e.g., D=(a,rose,is,a,rose,is,a,rose)  S(D,W) = {(a,rose,is,a),(rose,is,a,rose), (is,a,rose,is)}  Can also define S(D,w) to be the bag of all shingles in D We’ll use sets in the lecture to keep it simple

Resemblance  Let us define the resemblance of docs A and B as:  In general, 0  r w (A,B)  1 Note r w (A,A) = 1 But r w (A,B)=1 does not mean A and B are identical!  What is a good value for w?

Sketches  Set of all shingles is large Bigger than the original document  Can we create a document sketch by sampling only a few shingles?  Requirement Sketch resemblance should be a good estimate of document resemblance  Notation: Omit w from r w and S(A,w) r(A,B)=r w (A,B) and S(A) = S(A,w)

Choosing a sketch  Random sampling does not work! E.g., suppose we have identical documents A & B each with n shingles M(A) = set of s shingles from A, chosen uniformly at random; similarly M(B)  For s=1: E[|M(A)  M(B)|] = 1/n But r(A,B) = 1 So the sketch overlap is an underestimate  Verify that this is true for any value of s

Choosing a sketch  Better method: M(A) is “smallest” shingle in A Assume set of shingles is totally ordered Also define M(AB) = min(M(A)  M(B)) = smallest shingle across A and B Define r’(A,B) = |M(AB)  M(A)  M(B)|  For identical documents A & B r’(A,B) = 1 = r(A,B)

Example  |S(A)|=|S(B)|=100, |S(A)  S(B)|= 80  |S(A)  S(B)| = = 120  r(A,B) = 80/120 = 2/3  Suppose M(AB) = x x could be any one of 120 shingles  Pr[x  M(A)  M(B)] = Pr[x  S(A)  S(B)] = 80/120 = 2/3  E[r’(A,B)] = E[M(AB)  M(A)  M(B)] = 1(2/3) + 0(1/3) = 2/3  In general, for any documents A and B, it’s easy to generalize: E[r’(A,B)] = r(A,B)

Larger sketches  With a sketch of size 1, there is still a high probability of error We can improve the estimate by picking a larger sketch of size s  For set W of shingles, let MIN s (W) = set of s smallest shingles in W Assume documents have at least s shingles  Define M(A) = MIN s (S(A)) M(AB) = MIN s (M(A)  M(B)) r’(A,B) = |M(AB)  M(A)  M(B)| / s

Larger sketches  Easy to show that E[r’(A,B)] = r(A,B) Similar to the case s = 1 Repeat argument for each element of M(AB)  By increasing sample size (s) we can make it very unlikely r’(A,B) is significantly different from r(A,B) shingles is sufficient in practice

Corner case  Some docs have fewer than s shingles MIN s (W) = set of smallest s elements of W, if |W| s W, otherwise  M(A) = MIN s (S(A))  M(AB) = MIN s (M(A)  M(B))  r’(A,B) = |M(AB)  M(A)  M(B)| / |M(AB)|

Random permutations  One problem Test is biased towards “smaller” shingles May skew results  Solution Let  be the set of all shingles Pick a permutation : uniformly at random Define M(A) = MIN s ((S(A)))

Random permutation   

Implementation  Size of each shingle is still large e.g., each shingle = 7 English words = bytes 100 shingles = 4K-5K  Compute a fingerprint f for each shingle (e.g., Rabin fingerprint) 40 bits is usually enough to keep estimates reasonably accurate Fingerprint also eliminates need for random permutation

Finding all near-duplicates  Naïve implementation makes O(N^2) sketch comparisons Suppose N=100 million  Divide-Compute-Merge (DCM) Divide data into batches that fit in memory Operate on individual batch and write out partial results in sorted order Merge partial results  Generalization of external sorting

doc1: s11,s12,…,s1k doc2: s21,s22,…,s2k … DCM Steps s11,doc1 s12,doc1 … s1k,doc1 s21,doc2 … Invert t1,doc11 t1,doc12 … t2,doc21 t2,doc22 … sort on shingle_id doc11,doc12,1 doc11,doc13,1 … doc21,doc22,1 … Invert and pair doc11,doc12,1 … doc11,doc13,1 … sort on <docid1, docid2> doc11,doc12,2 doc11,doc13,10 … Merge

Finding all near-duplicates 1.Calculate a sketch for each document 2.For each document, write out the pairs 3.Sort by shingle_id (DCM) 4.In a sequential scan, generate triplets of the form for pairs of docs that share a shingle (DCM) 5.Sort on (DCM) 6.Merge the triplets with common docids to generate triplets of the form (DCM) 7.Output document pairs whose resemblance exceeds the threshold

Some optimizations  “Invert and Pair” is the most expensive step  We can speed it up eliminating very common shingles Common headers, footers, etc. Do it as a preprocessing step  Also, eliminate exact duplicates up front

Mirrors  Replication at a higher level Entire collections of documents Java APIs, perl manuals,…  Use document comparison as a building block