1 Crawler (AKA Spider) (AKA Robot) (AKA Bot). What is a Web Crawler? A system for bulk downloading of Web pages Used for: –Creating corpus of search engine.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Crawling the WEB Representation and Management of Data on the Internet.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
Near Duplicate Detection
Searching the Web. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
CS345 Data Mining Crawling the Web. Web Crawling Basics get next url get page extract urls to visit urls visited urls web pages Web Start with a “seed.
Crawling The Web. Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines,
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 7: Planning a DNS Strategy.
1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
A Web Crawler Design for Data Mining
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.
Crawling Slides adapted from
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Crawling The Web. 2 Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines,
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
1 Searching the Web Representation and Management of Data on the Internet.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Efficient Crawling Through URL Ordering By: Junghoo Cho, Hector Garcia-Molina, and Lawrence Page Presenter : Omkar S. Kasinadhuni Simerjeet Kaur.
1 Web Search Spidering (Crawling)
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Near Duplicate Detection
Crawler (AKA Spider) (AKA Robot) (AKA Bot)
The Anatomy of a Large-Scale Hypertextual Web Search Engine
IST 497 Vladimir Belyavskiy 11/21/02
12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Anwar Alhenshiri.
Presentation transcript:

1 Crawler (AKA Spider) (AKA Robot) (AKA Bot)

What is a Web Crawler? A system for bulk downloading of Web pages Used for: –Creating corpus of search engine –Web archiving –Data mining of Web for statistical properties (e.g., Attributor monitors for copyright infringement) 2

Push and Pull Models Push Model: –Web content providers push content of interest to aggregators Pull Model: –Aggregators scour Web for new / updated information Crawlers use the pull model –Low overhead for content providers –Easier for new aggregators 3

4 Basic Crawler Start with a seed set of URLs Download all pages with these URLs Extract URLs from downloaded pages Repeat

5 Basic Crawler Frontier removeURL( ) getPage( ) getLinksFromPage( ) Will this terminate?

Reminder: Downloading Web Pages 6 Suppose our robot wants to download: technion.ac.il/ catalog/facs009.html

7 DNS Web Server File System HTTP Request HTTP Response catalog/facs009.html Got it! Now, lets parse it and continue on…

Challenges Scale –huge Web, changing constantly Content selection tradeoffs –Cannot download entire Web and be constantly up-to-date –Balance coverage, freshness with per-site limitations 8

Challenges Social obligations –Do not over burden web sites being crawled –(Avoid mistakenly performing denial of service attacks) Adversaries –Content provides way try to misrepresent their content (e.g., cloaking) 9

10 Goals of a Crawler Download a large set of pages Refresh downloaded pages Find new pages Make sure to have as many “good” pages as possible

Structure of the Web 11 What does this mean for seed choice?

Topics Crawler architecture Traversal order Duplicate Detection 12

13 Crawler Architecture and Issues

14 Async DNS Lookup Text Indexing, Analysis Hyperlink Extractor, Normalizer Is Page Known? URL Approval guard Is URL Visited? Pool of work URLs Load Monitor Queue DNS Cache Page Index Crawl Metadata Wait For DNS Wait Http Socket Http Send, Receive Queue

15 Delays Crawler should download fetch many pages quickly. Delays follow from: –Resolving IP address from URL using a DNS –Connecting to the server and sending request –Waiting for response

16 Reducing Time for DNS lookup Cache DNS addresses Perform pre-fetching –Look up address immediately when URL is inserted into pool, so it will be ready when URL will be downloaded

17 Async DNS Lookup Text Indexing, Analysis Hyperlink Extractor, Normalizer Is Page Known? URL Approval guard Is URL Visited? Pool of work URLs Load Monitor Queue DNS Cache Page Index Crawl Metadata Wait For DNS Wait Http Socket Http Send, Receive Queue

18 How and Where should the Crawler Crawl? When a crawler crawls a site, it uses the site’s resources: –The web server needs to find the file in file system –The web server needs to send the file in the network If a crawler asks for many of the pages and at a high speed it may –crash the sites web server or –be banned from the site (how?) Do not ask for too many pages from the same site without waiting enough time in between!

19 Directing Crawlers Sometimes people want to direct automatic crawling over their resources “Do not visit my files!” “Do not index my files!” “Only my crawler may visit my files!” “Please, follow my useful links…” “Please update your data after X time…” Solution: publish instructions in some known format Crawlers are expected to follow these instructions

20 Robots Exclusion Protocol A method that allows Web servers to indicate which of their resources should not be visited by crawlers Put the file robots.txt at the root directory of the server – – – –

21 robots.txt Format A robots.txt file consists of several records Each record consists of a set of some crawler id’s and a set of URLs these crawlers are not allowed to visit –User-agent lines: Names of crawlers –Disallowed lines: Which URLs are not to be visited by these crawlers (agents)?

Examples 22 User-agent: * Disallow: / User-agent: * Disallow: User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/ User-agent: BadBot Disallow: / User-agent: Google Disallow: User-agent: * Disallow: /

23 Robots Meta Tag A Web-page author can also publish directions for crawlers These are expressed by the meta tag with name robots, inside the HTML file Format: Options: –index or noindex: index or do not index this file –follow or nofollow: follow or do not follow the links of this file

24 Robots Meta Tag... … An Example: How should a crawler act when it visits this page?

25 Revisit Meta Tag Web page authors may want Web applications to have an up-to-date copy of their page Using the revisit meta tag, page authors can give crawlers some idea of how often the page is being updated For example:

26 Stronger Restrictions It is possible for a (non-polite) crawler to ignore the restrictions imposed by robots.txt and robots meta directions Therefore, if one wants to ensure that automatic robots do not visit her resources, she has to use other mechanisms –For example, password protections

Interesting Questions Can you be sued for copyright infringement if you follow the robots.txt protocol? Can you be sued for copyright infringement if you don’t follow the robots.txt protocol? Can the site owner determine who is actually crawling their site? Can the site owner protect against unauthorized access by specific robots? 27

28 Async DNS Lookup Text Indexing, Analysis Hyperlink Extractor, Normalizer Is Page Known? URL Approval guard Is URL Visited? Pool of work URLs Load Monitor Queue DNS Cache Page Index Crawl Metadata Wait For DNS Wait Http Socket Http Send, Receive Queue

29 Duplication There are many “near duplicate” pages on the web, e.g., mirror sites Different host names can resolve to the same IP address Different URLs can resolve to the same physical address Challenge: Avoid repeat downloads

30 Async DNS Lookup Text Indexing, Analysis Hyperlink Extractor, Normalizer Is Page Known? URL Approval guard Is URL Visited? Pool of work URLs Load Monitor Queue DNS Cache Page Index Crawl Metadata Wait For DNS Wait Http Socket Http Send, Receive Queue

31 Multiple Crawlers Normally, multiple crawlers must work together to download pages URLs are divided among crawlers –if there are k crawlers, compute a hash of host name (with value 1,…,k) and assign pages appropriately

32 Ordering the URL Queues Politeness: do not hit a web server too frequently Freshness: crawl some pages more often than others –E.g., pages (such as News sites) whose content changes often These goals may conflict each other.

33 Async DNS Lookup Text Indexing, Analysis Hyperlink Extractor, Normalizer Is Page Known? URL Approval guard Is URL Visited? Pool of work URLs Load Monitor Queue DNS Cache Page Index Crawl Metadata Wait For DNS Wait Http Socket Http Send, Receive Queue

34 Traversal Order

35 Choosing Traversal Order We are constantly inserting links into the list (frontier) and removing links from the list In which order should this be done? Why does it matter?

Importance Metrics Efficient Crawling through Url Ordering, Cho, Garcia-Molina, Page, Similarity to driving query Backlink count Pagerank Location metric Each of these imply an importance Imp(p) for pages, and we want to crawl best pages 36

Measuring Success: Crawler Models (1) Crawl and Stop: Starts at page P 0, crawls k pages p 1,…,p k and stops. Suppose the “best” k pages on the Web are r 1,…,r k Suppose that | {p i s.t. Imp(p i ) >= Imp(r k ) } | = M Performance: M / k What is ideal performance? How would a random crawler perform? 37

Measuring Success: Crawler Models (2) Crawl and Stop with Threshold: Crawler visits k pages p 1,…,p k and stops. We are given an importance target G Suppose that total number of pages in Web with threshold >= G is H Suppose that | {p i s.t. Imp(p i ) >= G } | = M Performance: M / H What is ideal performance? How would a random crawler perform? 38

Measuring Success: Crawler Models (3) Limited Buffer Crawl: Crawler can keep B pages in its buffer. After buffer fills up, must flush some out. –Can visit T pages, where T is the number of pages on the web Performance: Measured by percentage of pages in the buffer at the end which are at least as good as G 39

Ordering Metrics The crawler stores a queue of URLs, and chooses the best from this queue to download and traverse An ordering metric O(p) is used to define the ordering: we will traverse page with highest O(p) Should we choose O(p) = Imp(p)? Can we choose O(p) = Imp(p)? 40

41 Example Ordering Metrics: BFS, DFS DFS (Depth First Search = חיפוש לעומק) BFS (Breadth First Search = חיפוש לרוחב) Advantages? Disadvantages?

42 BFS versus DFS Which is better? What advantages do each have? –DFS: locality of requests (save time on DNS) –BFS: get shallow pages first (usually more important), pages are distributed among many sites, load distribution

43 Example Ordering Metrics: Estimated Importance Each URL has a ranking that determines how important it is Always remove the highest ranking URL first How is the ranking defined? –By estimating importance measures. How?

44 Similarity to Driving Query Compile a list of words Q that describes a topic of interest Define the importance of a page P by its textual similarity to Q by using a formula that combines –The number of appearances of words from Q in P –For each word of Q, its frequency throughout the web (why is this important?) Problem: We must decide if a link is important without (1) seeing the page and (2) knowing how rare a word is Estimated Importance: Use an estimate, e.g., context of the link, or entire page containing the link

45 Backlink Count, PageRank The importance of a page P is proportional to the number of pages with a link to P or to its PageRank As before, need to estimate this amount Estimated importance computed by evaluating these values for the portion of the Web already downloaded

46 Location Metric The importance of P is a function of its URL Example: –Words appearing on URL (e.g., edu or ac) –Number of “/” on the URL We can compute this precisely and use it as an ordering metric, if desired (no need for estimation)

Bottom Line: What Ordering Metric to Use? In general, they have shown that PageRank outperforms BFS and estimated backlinks if our importance metric is PageRank or Backlinks Estimated similarity to query (combined with PageRank for cases in which the estimated similarity is 0) works well for finding pages that are similar to a query For more details, see the paper 47

Duplicate Detection 48

49 How to Avoid Duplicates Store normalized versions of URLs Store hash values of the pages. Compare each new page to hashes of previous pages –Example: use MD5 hashing function –Does not detect near duplicates

50 Detecting approximate duplicates Can compare the pages using known distance metrics e.g., edit distance –Takes far too long when we have millions of pages! One solution: create a sketch for each page that is much smaller than the page Assume: we have converted page into a sequence of tokens –Eliminate punctuation, HTML markup, etc

51 Shingling Given document D, a w-shingle is a contiguous subsequence of w tokens The w-shingling S(D,w) of D, is the set of all w- shingles in D Example: D=(a rose is a rose is a rose) –a rose is a – rose is a rose – is a rose is – a rose is a – rose is a rose –S(D,4) = {(a,rose,is,a),(rose,is,a,rose), (is,a,rose,is)}

52 Resemblance The resemblance of docs A and B is defined as: In general, 0  r w (A,B)  1 –Note r w (A,A) = 1 –But r w (A,B)=1 does not mean A and B are identical! What is a good value for w? Remember this?

53 Sketches Set of all shingles is large –Bigger than the original document We create a document sketch by sampling only a few shingles Requirement –Sketch resemblance should be a good estimate of document resemblance Notation: From now on, w will be a fixed value, so we omit w from r w and S(A,w) –r(A,B)=r w (A,B) and S(A) = S(A,w)

54 Choosing a sketch Random sampling does not work! –Suppose we have identical documents A, B each with n shingles –Let m A be a shingle from A, chosen uniformly at random; similarly m B For k=1: E[{m A }  {m B }|] = 1/n –But r(A,B) = 1 –So the sketch overlap is an underestimate

55 Choosing a sketch Assume the set of all possible shingles is totally ordered Define: –m A is the “smallest” shingle in A –m B is the “smallest” shingle in B –r’(A,B) = | {m A }  {m B } | For identical documents A & B –r’(A,B) = 1 = r(A,B)

Problems With a sketch of size 1, there is still a high probability of error –Especially if we choose minimum shingle, e.g., in alphabetical order Size of each shingle is still large –e.g., if w = 7, about bytes –100 shingles = 4K-5K 56

57 Solution We can improve the estimate by picking more than one shingle Idea: –Choose many different random orderings on the set of all shingles. –For each ordering, take minimal shingles, as explained before How do we choose random orderings of shingles?

58 Solution (cont) Compute K fingerprint(s) for each shingle –For example, use MD5 hashing function with different parameters For each choices of parameters to hashing function, choose lexicographically first shingle in each document Near duplication likelihood is computed by checking how many of the K smallest shingles are common to both documents

59 Solution (cont) Observe that this gives us: –An efficient way to compute random orders of shingles –Significant reduction in size of shingles stored (40 bits is usually enough to keep estimates reasonably accurate)