Presentation is loading. Please wait.

Presentation is loading. Please wait.

cs430 lecture 02/22/01 Kamen Yotov

Similar presentations


Presentation on theme: "cs430 lecture 02/22/01 Kamen Yotov"— Presentation transcript:

1 cs430 lecture 02/22/01 Kamen Yotov
Web crawlers cs430 lecture 02/22/01 Kamen Yotov

2 What is a web crawler? Definition (crawler=spider) Self sufficient programs that index any site you point them at. Useful for indexing: websites, distributed among multiple servers websites related to your own! 02/22/2001 Web crawlers

3 Types of web crawlers Server-side (Business oriented)
Technology behind Google, Altavista… Scalable, reliable, available… Resource hungry Client-side (Customer oriented) Examples are Teleport Pro, WebSnake,… Much smaller requirements Need guidance to proceed 02/22/2001 Web crawlers

4 Simple web crawler algorithm
Same simple algorithm for both types! Let S be set of pages we want to index; In first place let S be a singleton set {p} Take an element p of S; Parse the page p and retrieve the set of pages L it has links to; Substitute S=S+L-p ; Repeat as many times as necessary. 02/22/2001 Web crawlers

5 Simple or not so much… Representation of S ?
Queue, Stack, Deque Taking elements and completing S=S+L FIFO, LIFO, Combination How deep do we go? Not only finding, but indexing! Links – not so easy to extract… 02/22/2001 Web crawlers

6 FIFO Queue: BFS 02/22/2001 Web crawlers

7 LIFO Queue: DFS 02/22/2001 Web crawlers

8 What to search for? Most crawlers search only for Some search for
HTML (leaves and nodes in the tree) ASCII clear text (only as leaves in the tree) Some search for PDF PostScript,… Important: indexing after search! 02/22/2001 Web crawlers

9 Links – not so easy to extract…
Relative/Absolute CGI Parameters Dynamic generation of pages Server-side scripting Server-side image maps Links buried in scripting code Undecidable in first place 02/22/2001 Web crawlers

10 Performance issues Commercial crawlers face problems!
Want to explore more than they can; Have limited computational resources; Need much storage space and bandwidth; Communication bandwidth issues: Connection to the backbone is not fast enough to crawl at the desired speed; Need to respect other sites, so they don’t render them not operational. 02/22/2001 Web crawlers

11 An example (Google) 85 people Central system
50% technical, 14 PhD in Computer Science Central system Handles 5.5 million searches per day Increase rate is 20% per month Contains 2500 Linux machines Has 80 terabytes of spinning disks 30 new machines are installed daily Cache holds 200 million pages The aim is to crawl the web once per month! Larry Page, Google 02/22/2001 Web crawlers

12 Typical crawling setting
Multi-machine, clustered environment Multi-thread, parallel searching 02/22/2001 Web crawlers

13 Netiquette robots.txt Site bandwidth overload Restricted material…
# robots.txt for User-agent: * Disallow: /cyberworld/map/ Disallow: /tmp/ # these will soon disappear Disallow: /foo.html # Cybermapper knows where to go. User-agent: cybermapper Disallow: Site bandwidth overload Restricted material… 02/22/2001 Web crawlers

14 An area open for R&D! No much information how real crawlers work
People who know how to do it, just do it (in contrast to explaining it) May be yours will be the next best crawler! 02/22/2001 Web crawlers


Download ppt "cs430 lecture 02/22/01 Kamen Yotov"

Similar presentations


Ads by Google