Download presentation
Presentation is loading. Please wait.
Published byKelly White Modified over 8 years ago
1
1 Crawler (AKA Spider) (AKA Robot) (AKA Bot)
2
What is a Web Crawler? A system for bulk downloading of Web pages Used for: –Creating corpus of search engine –Web archiving –Data mining of Web for statistical properties (e.g., Attributor monitors for copyright infringement) 2
3
Push and Pull Models Push Model: –Web content providers push content of interest to aggregators Pull Model: –Aggregators scour Web for new / updated information Crawlers use the pull model –Low overhead for content providers –Easier for new aggregators 3
4
4 Basic Crawler Start with a seed set of URLs Download all pages with these URLs Extract URLs from downloaded pages Repeat
5
5 Basic Crawler Frontier removeURL( ) getPage( ) getLinksFromPage( ) Will this terminate?
6
Reminder: Downloading Web Pages 6 Suppose our robot wants to download: http://www.undergraduate. technion.ac.il/ catalog/facs009.html
7
7 DNS www.undergraduate.technion.ac.il 132.68.238.21 Web Server File System HTTP Request HTTP Response 132.68.238.21 catalog/facs009.html Got it! Now, lets parse it and continue on…
8
Challenges Scale –huge Web, changing constantly Content selection tradeoffs –Cannot download entire Web and be constantly up-to-date –Balance coverage, freshness with per-site limitations 8
9
Challenges Social obligations –Do not over burden web sites being crawled –(Avoid mistakenly performing denial of service attacks) Adversaries –Content provides way try to misrepresent their content (e.g., cloaking) 9
10
10 Goals of a Crawler Download a large set of pages Refresh downloaded pages Find new pages Make sure to have as many “good” pages as possible
11
Structure of the Web http://www9.org/w9cdrom/160/160.html 11 What does this mean for seed choice?
12
Topics Crawler architecture Traversal order Duplicate Detection 12
13
13 Crawler Architecture and Issues
14
14 Async DNS Lookup Text Indexing, Analysis Hyperlink Extractor, Normalizer Is Page Known? URL Approval guard Is URL Visited? Pool of work URLs Load Monitor Queue DNS Cache Page Index Crawl Metadata Wait For DNS Wait Http Socket Http Send, Receive Queue
15
15 Delays Crawler should download fetch many pages quickly. Delays follow from: –Resolving IP address from URL using a DNS –Connecting to the server and sending request –Waiting for response
16
16 Reducing Time for DNS lookup Cache DNS addresses Perform pre-fetching –Look up address immediately when URL is inserted into pool, so it will be ready when URL will be downloaded
17
17 Async DNS Lookup Text Indexing, Analysis Hyperlink Extractor, Normalizer Is Page Known? URL Approval guard Is URL Visited? Pool of work URLs Load Monitor Queue DNS Cache Page Index Crawl Metadata Wait For DNS Wait Http Socket Http Send, Receive Queue
18
18 How and Where should the Crawler Crawl? When a crawler crawls a site, it uses the site’s resources: –The web server needs to find the file in file system –The web server needs to send the file in the network If a crawler asks for many of the pages and at a high speed it may –crash the sites web server or –be banned from the site (how?) Do not ask for too many pages from the same site without waiting enough time in between!
19
19 Directing Crawlers Sometimes people want to direct automatic crawling over their resources “Do not visit my files!” “Do not index my files!” “Only my crawler may visit my files!” “Please, follow my useful links…” “Please update your data after X time…” Solution: publish instructions in some known format Crawlers are expected to follow these instructions
20
20 Robots Exclusion Protocol A method that allows Web servers to indicate which of their resources should not be visited by crawlers Put the file robots.txt at the root directory of the server –http://www.cnn.com/robots.txt –http://www.w3.org/robots.txt –http://www.ynet.co.il/robots.txt –http://www.google.com/robots.txt
21
21 robots.txt Format A robots.txt file consists of several records Each record consists of a set of some crawler id’s and a set of URLs these crawlers are not allowed to visit –User-agent lines: Names of crawlers –Disallowed lines: Which URLs are not to be visited by these crawlers (agents)?
22
Examples 22 User-agent: * Disallow: / User-agent: * Disallow: User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/ User-agent: BadBot Disallow: / User-agent: Google Disallow: User-agent: * Disallow: /
23
23 Robots Meta Tag A Web-page author can also publish directions for crawlers These are expressed by the meta tag with name robots, inside the HTML file Format: Options: –index or noindex: index or do not index this file –follow or nofollow: follow or do not follow the links of this file
24
24 Robots Meta Tag... … An Example: How should a crawler act when it visits this page?
25
25 Revisit Meta Tag Web page authors may want Web applications to have an up-to-date copy of their page Using the revisit meta tag, page authors can give crawlers some idea of how often the page is being updated For example:
26
26 Stronger Restrictions It is possible for a (non-polite) crawler to ignore the restrictions imposed by robots.txt and robots meta directions Therefore, if one wants to ensure that automatic robots do not visit her resources, she has to use other mechanisms –For example, password protections
27
Interesting Questions Can you be sued for copyright infringement if you follow the robots.txt protocol? Can you be sued for copyright infringement if you don’t follow the robots.txt protocol? Can the site owner determine who is actually crawling their site? Can the site owner protect against unauthorized access by specific robots? 27
28
28 Async DNS Lookup Text Indexing, Analysis Hyperlink Extractor, Normalizer Is Page Known? URL Approval guard Is URL Visited? Pool of work URLs Load Monitor Queue DNS Cache Page Index Crawl Metadata Wait For DNS Wait Http Socket Http Send, Receive Queue
29
29 Duplication There are many “near duplicate” pages on the web, e.g., mirror sites Different host names can resolve to the same IP address Different URLs can resolve to the same physical address Challenge: Avoid repeat downloads
30
30 Async DNS Lookup Text Indexing, Analysis Hyperlink Extractor, Normalizer Is Page Known? URL Approval guard Is URL Visited? Pool of work URLs Load Monitor Queue DNS Cache Page Index Crawl Metadata Wait For DNS Wait Http Socket Http Send, Receive Queue
31
31 Multiple Crawlers Normally, multiple crawlers must work together to download pages URLs are divided among crawlers –if there are k crawlers, compute a hash of host name (with value 1,…,k) and assign pages appropriately
32
32 Ordering the URL Queues Politeness: do not hit a web server too frequently Freshness: crawl some pages more often than others –E.g., pages (such as News sites) whose content changes often These goals may conflict each other.
33
33 Async DNS Lookup Text Indexing, Analysis Hyperlink Extractor, Normalizer Is Page Known? URL Approval guard Is URL Visited? Pool of work URLs Load Monitor Queue DNS Cache Page Index Crawl Metadata Wait For DNS Wait Http Socket Http Send, Receive Queue
34
34 Traversal Order
35
35 Choosing Traversal Order We are constantly inserting links into the list (frontier) and removing links from the list In which order should this be done? Why does it matter?
36
Importance Metrics Efficient Crawling through Url Ordering, Cho, Garcia-Molina, Page, 1998. Similarity to driving query Backlink count Pagerank Location metric Each of these imply an importance Imp(p) for pages, and we want to crawl best pages 36
37
Measuring Success: Crawler Models (1) Crawl and Stop: Starts at page P 0, crawls k pages p 1,…,p k and stops. Suppose the “best” k pages on the Web are r 1,…,r k Suppose that | {p i s.t. Imp(p i ) >= Imp(r k ) } | = M Performance: M / k What is ideal performance? How would a random crawler perform? 37
38
Measuring Success: Crawler Models (2) Crawl and Stop with Threshold: Crawler visits k pages p 1,…,p k and stops. We are given an importance target G Suppose that total number of pages in Web with threshold >= G is H Suppose that | {p i s.t. Imp(p i ) >= G } | = M Performance: M / H What is ideal performance? How would a random crawler perform? 38
39
Measuring Success: Crawler Models (3) Limited Buffer Crawl: Crawler can keep B pages in its buffer. After buffer fills up, must flush some out. –Can visit T pages, where T is the number of pages on the web Performance: Measured by percentage of pages in the buffer at the end which are at least as good as G 39
40
Ordering Metrics The crawler stores a queue of URLs, and chooses the best from this queue to download and traverse An ordering metric O(p) is used to define the ordering: we will traverse page with highest O(p) Should we choose O(p) = Imp(p)? Can we choose O(p) = Imp(p)? 40
41
41 Example Ordering Metrics: BFS, DFS DFS (Depth First Search = חיפוש לעומק) BFS (Breadth First Search = חיפוש לרוחב) Advantages? Disadvantages?
42
42 BFS versus DFS Which is better? What advantages do each have? –DFS: locality of requests (save time on DNS) –BFS: get shallow pages first (usually more important), pages are distributed among many sites, load distribution
43
43 Example Ordering Metrics: Estimated Importance Each URL has a ranking that determines how important it is Always remove the highest ranking URL first How is the ranking defined? –By estimating importance measures. How?
44
44 Similarity to Driving Query Compile a list of words Q that describes a topic of interest Define the importance of a page P by its textual similarity to Q by using a formula that combines –The number of appearances of words from Q in P –For each word of Q, its frequency throughout the web (why is this important?) Problem: We must decide if a link is important without (1) seeing the page and (2) knowing how rare a word is Estimated Importance: Use an estimate, e.g., context of the link, or entire page containing the link
45
45 Backlink Count, PageRank The importance of a page P is proportional to the number of pages with a link to P or to its PageRank As before, need to estimate this amount Estimated importance computed by evaluating these values for the portion of the Web already downloaded
46
46 Location Metric The importance of P is a function of its URL Example: –Words appearing on URL (e.g., edu or ac) –Number of “/” on the URL We can compute this precisely and use it as an ordering metric, if desired (no need for estimation)
47
Bottom Line: What Ordering Metric to Use? In general, they have shown that PageRank outperforms BFS and estimated backlinks if our importance metric is PageRank or Backlinks Estimated similarity to query (combined with PageRank for cases in which the estimated similarity is 0) works well for finding pages that are similar to a query For more details, see the paper 47
48
Duplicate Detection 48
49
49 How to Avoid Duplicates Store normalized versions of URLs Store hash values of the pages. Compare each new page to hashes of previous pages –Example: use MD5 hashing function –Does not detect near duplicates
50
50 Detecting approximate duplicates Can compare the pages using known distance metrics e.g., edit distance –Takes far too long when we have millions of pages! One solution: create a sketch for each page that is much smaller than the page Assume: we have converted page into a sequence of tokens –Eliminate punctuation, HTML markup, etc
51
51 Shingling Given document D, a w-shingle is a contiguous subsequence of w tokens The w-shingling S(D,w) of D, is the set of all w- shingles in D Example: D=(a rose is a rose is a rose) –a rose is a – rose is a rose – is a rose is – a rose is a – rose is a rose –S(D,4) = {(a,rose,is,a),(rose,is,a,rose), (is,a,rose,is)}
52
52 Resemblance The resemblance of docs A and B is defined as: In general, 0 r w (A,B) 1 –Note r w (A,A) = 1 –But r w (A,B)=1 does not mean A and B are identical! What is a good value for w? Remember this?
53
53 Sketches Set of all shingles is large –Bigger than the original document We create a document sketch by sampling only a few shingles Requirement –Sketch resemblance should be a good estimate of document resemblance Notation: From now on, w will be a fixed value, so we omit w from r w and S(A,w) –r(A,B)=r w (A,B) and S(A) = S(A,w)
54
54 Choosing a sketch Random sampling does not work! –Suppose we have identical documents A, B each with n shingles –Let m A be a shingle from A, chosen uniformly at random; similarly m B For k=1: E[{m A } {m B }|] = 1/n –But r(A,B) = 1 –So the sketch overlap is an underestimate
55
55 Choosing a sketch Assume the set of all possible shingles is totally ordered Define: –m A is the “smallest” shingle in A –m B is the “smallest” shingle in B –r’(A,B) = | {m A } {m B } | For identical documents A & B –r’(A,B) = 1 = r(A,B)
56
Problems With a sketch of size 1, there is still a high probability of error –Especially if we choose minimum shingle, e.g., in alphabetical order Size of each shingle is still large –e.g., if w = 7, about 40-50 bytes –100 shingles = 4K-5K 56
57
57 Solution We can improve the estimate by picking more than one shingle Idea: –Choose many different random orderings on the set of all shingles. –For each ordering, take minimal shingles, as explained before How do we choose random orderings of shingles?
58
58 Solution (cont) Compute K fingerprint(s) for each shingle –For example, use MD5 hashing function with different parameters For each choices of parameters to hashing function, choose lexicographically first shingle in each document Near duplication likelihood is computed by checking how many of the K smallest shingles are common to both documents
59
59 Solution (cont) Observe that this gives us: –An efficient way to compute random orders of shingles –Significant reduction in size of shingles stored (40 bits is usually enough to keep estimates reasonably accurate)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.