Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.

Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002

Web Resource Discovery Surfing  Serendipity Search  Specific Information Inverted keyword list  Page lookup Crawler  Text for keyword indexing Hence, crawlers are needed for discovery of Web resources

Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

Some History First crawlers appeared in 1994 Why? Web growth April 1993: 62 registered web servers In 1994, Web (http) traffic grew 15 X faster than the Internet itself Lycos was announced in 1994 as a search engine.

So, why not write a robot? You’d think a crawler would be easy to write: Pick up the next URL Connect to the server GET the URL When the page arrives, get its links (optionally do other stuff) REPEAT

Crawler Issues The URL itself Politeness Visit Order Robot Traps The hidden web System Considerations

Standard for Robot Exclusion Martin Koster (1994) http://any-server:80/robots.txt Maintained by the webmaster Forbid access to pages, directories Commonly excluded: /cgi-bin/ Adherence is voluntary for the crawler

The Four Laws of Web Robotics A Crawler must show identifications A Crawler must obey the robots.txt A Crawler must not hog resources A Crawler must report errors

Visit Order The frontier Breadth-first: FIFO queue Depth-first: LIFO queue Best-first: Priority queue Random Refresh rate

Robot Traps Cycles in the Web graph Infinite links on a page Traps set out by the Webmaster

The Hidden Web Dynamic pages increasing Subscription pages Username and password pages Research in progress on how crawlers can “get into” the hidden web

System Issues Crawlers are complicated systems Efficiency is of utmost importance Crawlers are demanding of system and network resources

Mercator - 1 Written in Java One file configures a crawl –How many threads –What analyzers to use –What filters to use –How to place links on the frontier –How long to run

Mercator - 2 Tell it what seed URL[s] to start with Can add your own code –Extend one or more of M’s base classes –Add totally new classes called by your own Is very efficient at memory usage –URLs are hashed –Documents are finger-printed

Mercator - 3 Industrial-strength crawler: –Multi-threaded for parallel crawls –Polite: one thread for one server –Mercator implements own host lookup –Mercator uses its own DNS

The Web as a Graph Crawling is meant to traverse the web Remove some edges to create a tree –I.e. do not revisit URLs You can only crawl forwards –I.e. need explicit back-links Page rank

The Web is a BIG Graph “Diameter” of the Web Cannot crawl even the static part, completely New technology: the focused crawl

Conclusion Clearly crawling is not simple Hot topic of the late 90’s research Good technologies as a result Focused crawling is where crawling is going next (hot topic of early 2000’s)

Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.

Similar presentations

Presentation on theme: "Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.

Similar presentations

Presentation on theme: "Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002."— Presentation transcript:

Similar presentations

About project

Feedback