Download presentation
Presentation is loading. Please wait.
Published byStephany Rodgers Modified over 8 years ago
1
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002
2
Web Resource Discovery Surfing Serendipity Search Specific Information Inverted keyword list Page lookup Crawler Text for keyword indexing Hence, crawlers are needed for discovery of Web resources
3
Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.
4
Some History First crawlers appeared in 1994 Why? Web growth April 1993: 62 registered web servers In 1994, Web (http) traffic grew 15 X faster than the Internet itself Lycos was announced in 1994 as a search engine.
5
So, why not write a robot? You’d think a crawler would be easy to write: Pick up the next URL Connect to the server GET the URL When the page arrives, get its links (optionally do other stuff) REPEAT
6
Crawler Issues The URL itself Politeness Visit Order Robot Traps The hidden web System Considerations
7
Standard for Robot Exclusion Martin Koster (1994) http://any-server:80/robots.txt Maintained by the webmaster Forbid access to pages, directories Commonly excluded: /cgi-bin/ Adherence is voluntary for the crawler
8
The Four Laws of Web Robotics A Crawler must show identifications A Crawler must obey the robots.txt A Crawler must not hog resources A Crawler must report errors
9
Visit Order The frontier Breadth-first: FIFO queue Depth-first: LIFO queue Best-first: Priority queue Random Refresh rate
10
Robot Traps Cycles in the Web graph Infinite links on a page Traps set out by the Webmaster
11
The Hidden Web Dynamic pages increasing Subscription pages Username and password pages Research in progress on how crawlers can “get into” the hidden web
12
System Issues Crawlers are complicated systems Efficiency is of utmost importance Crawlers are demanding of system and network resources
14
Mercator - 1 Written in Java One file configures a crawl –How many threads –What analyzers to use –What filters to use –How to place links on the frontier –How long to run
15
Mercator - 2 Tell it what seed URL[s] to start with Can add your own code –Extend one or more of M’s base classes –Add totally new classes called by your own Is very efficient at memory usage –URLs are hashed –Documents are finger-printed
16
Mercator - 3 Industrial-strength crawler: –Multi-threaded for parallel crawls –Polite: one thread for one server –Mercator implements own host lookup –Mercator uses its own DNS
17
The Web as a Graph Crawling is meant to traverse the web Remove some edges to create a tree –I.e. do not revisit URLs You can only crawl forwards –I.e. need explicit back-links Page rank
18
The Web is a BIG Graph “Diameter” of the Web Cannot crawl even the static part, completely New technology: the focused crawl
20
Conclusion Clearly crawling is not simple Hot topic of the late 90’s research Good technologies as a result Focused crawling is where crawling is going next (hot topic of early 2000’s)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.