cs430 lecture 02/22/01 Kamen Yotov

cs430 lecture 02/22/01 Kamen Yotov
Web crawlers cs430 lecture 02/22/01 Kamen Yotov

What is a web crawler? Definition (crawler=spider) Self sufficient programs that index any site you point them at. Useful for indexing: websites, distributed among multiple servers websites related to your own! 02/22/2001 Web crawlers

Types of web crawlers Server-side (Business oriented)
Technology behind Google, Altavista… Scalable, reliable, available… Resource hungry Client-side (Customer oriented) Examples are Teleport Pro, WebSnake,… Much smaller requirements Need guidance to proceed 02/22/2001 Web crawlers

Simple web crawler algorithm
Same simple algorithm for both types! Let S be set of pages we want to index; In first place let S be a singleton set {p} Take an element p of S; Parse the page p and retrieve the set of pages L it has links to; Substitute S=S+L-p ; Repeat as many times as necessary. 02/22/2001 Web crawlers

Simple or not so much… Representation of S ?
Queue, Stack, Deque Taking elements and completing S=S+L FIFO, LIFO, Combination How deep do we go? Not only finding, but indexing! Links – not so easy to extract… 02/22/2001 Web crawlers

FIFO Queue: BFS 02/22/2001 Web crawlers

LIFO Queue: DFS 02/22/2001 Web crawlers

What to search for? Most crawlers search only for Some search for
HTML (leaves and nodes in the tree) ASCII clear text (only as leaves in the tree) Some search for PDF PostScript,… Important: indexing after search! 02/22/2001 Web crawlers

Links – not so easy to extract…
Relative/Absolute CGI Parameters Dynamic generation of pages Server-side scripting Server-side image maps Links buried in scripting code Undecidable in first place 02/22/2001 Web crawlers

Performance issues Commercial crawlers face problems!
Want to explore more than they can; Have limited computational resources; Need much storage space and bandwidth; Communication bandwidth issues: Connection to the backbone is not fast enough to crawl at the desired speed; Need to respect other sites, so they don’t render them not operational. 02/22/2001 Web crawlers

An example (Google) 85 people Central system
50% technical, 14 PhD in Computer Science Central system Handles 5.5 million searches per day Increase rate is 20% per month Contains 2500 Linux machines Has 80 terabytes of spinning disks 30 new machines are installed daily Cache holds 200 million pages The aim is to crawl the web once per month! Larry Page, Google 02/22/2001 Web crawlers

Typical crawling setting
Multi-machine, clustered environment Multi-thread, parallel searching 02/22/2001 Web crawlers

Netiquette robots.txt Site bandwidth overload Restricted material…
# robots.txt for User-agent: * Disallow: /cyberworld/map/ Disallow: /tmp/ # these will soon disappear Disallow: /foo.html # Cybermapper knows where to go. User-agent: cybermapper Disallow: Site bandwidth overload Restricted material… 02/22/2001 Web crawlers

An area open for R&D! No much information how real crawlers work
People who know how to do it, just do it (in contrast to explaining it) May be yours will be the next best crawler! 02/22/2001 Web crawlers

cs430 lecture 02/22/01 Kamen Yotov

Similar presentations

Presentation on theme: "cs430 lecture 02/22/01 Kamen Yotov"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

cs430 lecture 02/22/01 Kamen Yotov

Similar presentations

Presentation on theme: "cs430 lecture 02/22/01 Kamen Yotov"— Presentation transcript:

Similar presentations

About project

Feedback