Download presentation
Presentation is loading. Please wait.
1
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems bergmark@cs.cornell.edu
2
March 26, 2003CS502 Web Information Systems2 Web Resource Discovery Finding info on the Web –Surfing (random strategy; goal is serendipity) –Searching (inverted indices; specific info) –Crawling (follow links; “all” the info) Uses for crawling –Find stuff –Gather stuff –Check stuff
3
March 26, 2003CS502 Web Information Systems3 Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.
4
March 26, 2003CS502 Web Information Systems4 Crawlers and internet history 1991: HTTP 1992: 26 servers 1993: 60+ servers; self-register; archie 1994 (early) – first crawlers 1996 – search engines abound 1998 – focused crawling 1999 – web graph studies 2002 – use for digital libraries
5
March 26, 2003CS502 Web Information Systems5 So, why not write a robot? You’d think a crawler would be easy to write: Pick up the next URL Connect to the server GET the URL When the page arrives, get its links (optionally do other stuff) REPEAT
6
March 26, 2003CS502 Web Information Systems6 The Central Crawler Function Server 2 queue Server 1 queue Server 3 queue URL -> IP address via DNS Connect a Socket to Server; send HTTP request Wait for the response: An HTML page
7
March 26, 2003CS502 Web Information Systems7 Handling the HTTP Response Document seen before? FETCH Process this document No Extract text Extract links ::::
8
March 26, 2003CS502 Web Information Systems8 LINK Extraction Finding the links is easy (sequential scan) Need to clean them up and canonicalize them Need to filter them Need to check for robot exclusion Need to check for duplicates
9
March 26, 2003CS502 Web Information Systems9 Update the Frontier FETCHPROCESS URL1 URL2 URL3 : FRONTIER
10
March 26, 2003CS502 Web Information Systems10 Crawler Issues System Considerations The URL itself Politeness Visit Order Robot Traps The hidden web
11
March 26, 2003CS502 Web Information Systems11 Standard for Robot Exclusion Martin Koster (1994) http://any-server:80/robots.txt Maintained by the webmaster Forbid access to pages, directories Commonly excluded: /cgi-bin/ Adherence is voluntary for the crawler
12
March 26, 2003CS502 Web Information Systems12 Visit Order The frontier Breadth-first: FIFO queue Depth-first: LIFO queue Best-first: Priority queue Random Refresh rate
13
March 26, 2003CS502 Web Information Systems13 Robot Traps Cycles in the Web graph Infinite links on a page Traps set out by the Webmaster
14
March 26, 2003CS502 Web Information Systems14 The Hidden Web Dynamic pages increasing Subscription pages Username and password pages Research in progress on how crawlers can “get into” the hidden web
15
March 26, 2003CS502 Web Information Systems15 MERCATOR
16
March 26, 2003CS502 Web Information Systems16 Mercator Features One file configures a crawl Written in Java Can add your own code –Extend one or more of M’s base classes –Add totally new classes called by your own Industrial-strength crawler: – uses its own DNS and java.net package
17
March 26, 2003CS502 Web Information Systems17 The Web is a BIG Graph “Diameter” of the Web Cannot crawl even the static part, completely New technology: the focused crawl
18
March 26, 2003CS502 Web Information Systems18 Crawling and Crawlers Web overlays the internet A crawl overlays the web seed
19
March 26, 2003CS502 Web Information Systems19 Focused Crawling
20
March 26, 2003CS502 Web Information Systems20 Focused Crawling 432 765 1 1 R Breadth-first crawl 1 432 5 R X X Focused crawl
21
March 26, 2003CS502 Web Information Systems21 Focused Crawling Recall the cartoon for a focused crawl: A simple way to do it is with 2 “knobs” 1 432 5 R X X
22
March 26, 2003CS502 Web Information Systems22 Focusing the Crawl Threshold: page is on-topic if correlation to the closest centroid is above this value Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than this value
23
March 26, 2003CS502 Web Information Systems23 Illustration 23 4 6 7 1 555 5 Cutoff = 1 Corr >= threshold
24
March 26, 2003CS502 Web Information Systems24 Closest Furthest
25
March 26, 2003CS502 Web Information Systems25 Correlation vs. Crawl Length
26
March 26, 2003CS502 Web Information Systems26 Fall 2002 Student Project Query Mercator CentroidCollectionDescription Term vectors Centroids, Dictionary Collection URLs Chebyshev P.s HTML
27
March 26, 2003CS502 Web Information Systems27 Conclusion We covered crawling – history, technology, deployment Focused crawling with tunneling We have a good experimental setup for exploring automatic collection synthesis
28
March 26, 2003CS502 Web Information Systems28 http://mercator.comm.nsdlib.org
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.