Web Crawling and Automatic Discovery

Web Crawling and Automatic Discovery
CS431, Spring 2004, Carl Lagoze March 31 Lecture 18

Outline Crawlers Focused Crawling Some Results
Student Project (Fall 2002)

Web Resource Discovery
Finding info on the Web Surfing (random strategy; goal is serendipity) Searching (inverted indices; specific info) Crawling (follow links; “all” the info) Uses for crawling Find stuff Gather stuff Check stuff (From Cheong) Web resource disvovery: how to find useful information on the web. Casual surfing is enjoyable because the Web is large, unpredictable, rich, decentralized and dynamic. But finding the information you are looking for can’t be done by following links. You rely on search engines. And search engines rely on Web crawlers. Web crawlers (robots) systematically traverse the web and download pages for keyword indexing. Search engines can then look up information (pages) based on a query of keywords. BUT crawlers are used for many interesting purposes, such as collection building (topic of next lecture) The web, which overlays the internet (in graph-theoretic terms), is useful for traversing the internet for resource discovery. There are three ways to find info on the web, with varying goals and starting points: surfing start with the current page searching start with a query which goes into an inverted index to look up urls crawling start with a set of “seed” urls and cached pages. Search engines do use crawlers, but they themselves do not crawl the web. Their crawler follows links from one page to the next, caching for later indexing and ranking. Generally a breadth-first crawl, excepting that sometimes a crawl won’t stay on the same site. I’ll explain more about crawlers in a minute, but here are the typical uses for crawlers: Find stuff – e.g. use a focused crawl, the topic of this talk Gather stuff – e.g. for indexing; or collecting addresses. Check stuff – e.g. find broken links, new pages.

Definition Spider = robot = crawler
Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web. Applications of crawlers: Health checks Building databases (link bases in particular, for calculating page ranks) Search engines index pages downloaded by crawlers Collecting addresses And finding collections – a novel crawler application I am going to go over some crawler basics and wind up describing the crawler I’ve been using, just for general interest. Also how crawlers came to be (I.e. some internet history).

Crawlers and internet history
1991: HTTP 1992: 26 servers 1993: 60+ servers; self-register; archie 1994 (early) – first crawlers 1996 – search engines abound 1998 – focused crawling 1999 – web graph studies 2002 – use for digital libraries Crawlers are the children of internet history. In nobody needed any crawlers. The number of servers were few (but growing rapidly), and they registered themselves. So you knew who they were. It was simple to visit them and collect all their pages. Carl: Gopher like thinking at beginning 1994 crawling started (Colorado? Oliver McBryan?) – having programs go out there and discover where the servers were by following links in HTML pages. The HTTP protocol made this automatic approach very doable. Later in the year the first search engine (Lycos) that I knew of, anyway, was announced. They put the crawl data to use by adding some indexing and putting up a user interface where you could type in a query. Now there are more than 8 million sites (Web characterization project at OCLC). In 1994 Web (http) traffic group 15 X faster than the Internet itself. 1996 search engines were everywhere, and losing ground fast against the growth of the web (more about that on a future slide). So focused crawling was introduced (Chakrabarty et al). By 1999 the graph was sufficiently large that it became an object of study in and of itself by many computer scientists. And now digital libraries are getting into the act.

So, why not write a robot? You’d think a crawler would be easy to write: Pick up the next URL Connect to the server GET the URL When the page arrives, get its links (optionally do other stuff) REPEAT So anybody can write a robot, right? Wrong. At first blush, you’d think it’d be easy to code up this program. Just repeat these three steps until you get tired. Let’s look at a typical, basic, large-scale crawler would look like. Carl – crawl overlays a tree on top of the web graph. Web is a graph that overlays the internet

The Central Crawler Function
Server 3 queue Connect a Socket to Server; send HTTP request Server 2 queue URL -> IP address via DNS Wait for the response: An HTML page Server 1 queue Carl – multithreading is essential This is probably the most time-consuming part of a crawl – network latency. Cover it up by fetching as many pages at a time as you can. But not from the same server! Thus we have per-server tasks illustrated here. (I’ve run with 500 such threads). “Queue” denotes the fact that these are URL queues headed to non-overlapping sets of servers. But this raises another problem: your DNS might not be able to handle the load. So typically a large-scale crawler would mirror the DNS server or use a stripped down server. You now suppose a thread has gotten an HTML page? What happens next?

Handling the HTTP Response
Extract text Document seen before? Process this document FETCH No Extract links : A lot of pages have to be collected in as short a time as possible. Therefore it is important to be able to detect duplicate pages. To reduce redundant fetches. Several methods: you can use a page “fingerprint” but this gets messed up with new borders, different styles, dynamic content (such as date or page counters), although checksums might be good as a first pass. A more robust method is “shingles” or q-grams: maybe 10 words per q-gram, who many in common? This can be calculated quickly. Since many Web sites are mirrored, duplicate detection is critical. Note that this page processing is going in parallel because many documents are being downloaded per second. Actually have plenty of time to process a page. You could use multi-threading, or non-blocking sockets with event handlers. Net and disk are the limiting factors. Typically the number of worker threads/sockets is allocated in advance Parallelism – processing can be done entirely in parallel unless you are writing to global state (like updating word frequencies). The document seen before also involves a global update (writing fingerprints to global location) but that is little time compared to the computation to answer “seen before?”. And races don’t matter much since you are just adding to the global store. So, all this is nicely suited to parallelism.

LINK Extraction Finding the links is easy (sequential scan)
Need to clean them up and canonicalize them Need to filter them Need to check for robot exclusion Need to check for duplicates Text extraction is so straightforward, that it really needs no discussion. Can store the whole content to disk (slow) or analyze into a TFIDF term vector on the fly (fast). But here I want to mention some issues with regard to link extraction. Finding the links contained within the HTML page is straightforward (can use a DOM parser if you Jtidy the page first). Or you can just to a linear scan for address tags. Canonicalization: throw away relative (#) links; put into form; lowercase; handle ./ and ../s; add trailing /, etc. 60 bytes per URL! Then you might want to apply filters: throw away pages from .fr domain, ignore .com pages, etc.; skip urls to higher in the same domain. Then robot exclusion checks (will get to that in a minute) – sites might ask robots not to crawl the site or parts of it One you have a nice standard, desirable URL you should check to see if it has already been processed (not always possible because host name to domain names is many-many). Best to avoid IP mapping and stick to canonical hostname you get from your DNS. Use a quick hash function on the URL. MD5 is good – gives 128 bits or 46 characters, which is shorter than the average URL, and you can go lower.

Update the Frontier FETCH PROCESS FRONTIER URL1 URL2 URL3 :
Finally, you have a pile of links you want to pursue further, you add them to the “frontier”. This is the single most important shared data structure to get right because it both grows and decreases. If the frontier goes empty the crawl is over. It should be noted that the frontier is initially filled with one or more SEEDS. From a few seeds, you can go to hundreds of thousands of URLs on the frontier in just minutes. So the structure of the data storage for the frontier is very crucial. The frontier is usually implemented as one or more queues. We’ll see later how the queue order affects to crawl itself. But here you have it – the basic industrial strength crawler like that used by Alta Vista, Internet Archive. FRONTIER

Crawler Issues System Considerations The URL itself Politeness
Visit Order Robot Traps The hidden web We have seen that in writing a basic web crawler, a number of issues had to be addressed: System considerations: From DNS ro managing the LOTS of stuff that crawler’s keep track of. Links are written in many ways, relative, absolute, erroneously. Canonicalize, check for duplicates. Politeness: if you write a crawler that hits the same server again and again, and the server is not yours, you may get lots of from cranky web administrators. Also do obey robots exclusion. Search order: You can visit links in breadth-first order, depth-first order, random order, and best first. Depends on how the frontier is queued: FIFO leads to breadth-first, which is good for parallelism. LIFO leads to depth-first which can be very rude. Random is where instead of a queue, you save the links in a set and choose at random. Best is the subject of focused crawling, currently a very hot research topic. For search engine purposes, there is also the issue of “refresh rate” Robot Traps – more on this later The hidden web are those pages that are formed on the fly (e.g. from a database or weather radar), or are returned because of filling in a form, etc. In general it is difficult if not impossible for crawlers to get at these. But researchers are trying. Carl nice example of robot trap – infinitely expanding calendar

Standard for Robot Exclusion
Martin Koster (1994) Maintained by the webmaster Forbid access to pages, directories Commonly excluded: /cgi-bin/ Adherence is voluntary for the crawler It didn’t take long before crawlers became a pain back in 1994 Hence the drafting of the robot exclusion standard (still a draft but very widely used) The file is on the server at a known address: robots.txt and is maintained by the webmaster, though it may be constructed from other files sitting in local directories. You can forbid access to certain parts of your web site but allow crawlers to access the rest. For example, you probably want to forbid access to logfiles. Cgi-bin is commonly excluded: you don’t want a robot to accidentally vote, for example. You can also say to which “user agents” the restrictions apply, where user agent is the robot/crawler in question. It is NOT considered nice for a crawler to download and read the robots.txt file for the purpose of finding interesting pages to visit! The user-agent in the HTTP request usually tells a server who is visiting.

Visit Order The frontier Breadth-first: FIFO queue
Depth-first: LIFO queue Best-first: Priority queue Random Refresh rate The frontier holds the URLs that have been discovered but not yet visited. The most commonly used order is breadth-first. This is a broad, shallow crawl. The first URL placed on the queue is visited next. Last in, first out, leads to a depth-first crawl, which typically mean that you will repeatedly hit the same server. Worse yet, you might fall into an infinite sequence of URLs. You fetch a page that has a brand new URL to the same page etc. If you do need to do a depth-first search it might be good to set a limit on the depth. Best-first is where based on some objective function, the crawler decides which of the URLs in the frontier it would be best to visit next. This is called a “focused crawl”. Choice can be based on whether the parent document (the one containing the link) is interesting, or whether the URL is relatively short, or if the URL is in a specific domain (.edu might be preferred to .com). Another good idea – if a new URL is to a new server, visit that page right away. Many many different criteria. Implemented by “priority queues”. Focused crawling is a hot research topic right now. Random is when the frontier is implemented as a set rather than as a queue, and you randomly pick a URL to visit next. Serendipity is one of the advantages of this approach. And you might be able to quickly go to unrelated places on the web. One interesting use of random URLs is the web characterization work at OCLC where they make up an IP address at random and then see if that site responds to port 80. They use this to estimate the size of the web. Finally, refresh rate should be mentioned. If you are crawling for a search engine, you need to be concerned with updating the database from time to time. You need to know when pages change, move, come into being, or disappear. This means you recrawl some links at some specified rate. So, this is one of the issues you have to decide on when you implement or choose your crawler.

Robot Traps Cycles in the Web graph Infinite links on a page
Traps set out by the Webmaster We mentioned before the problem of robot traps. First, if you don’t keep track of what you have visited before in the crawl, you may hit a cycle in the Web graph and repeatedly visit the same set of sites. This not only pisses of the site administrators, but the crawler gets nowhere very fast. Keeping track of where you’ve been is the best way to avoid this trap. It is possible for a page to have a newly generated link on it that points back to itself. The crawler will think this page has not been visited before and go to it, by which time the page has a brand new link on it … Fingerprinting pages is one way to avoid this. You can also get into an infinite link situation if pages with new links are generated on the fly. Best to limit the depth on any one server. Or limit the number of slashes in the URL. A slightly different kind of robot trap is a trap set in place by a Web site administrator. Like pages with infinite links to catch crawlers that went into the cgi-bin when they weren’t supposed to. Other administrators set up a page on their site that is linked to ONLY by some line in the robots.txt file. If the crawler visits the page, then the administrator knows that the crawler was reading the robots.txt file in order to get to forbidden pages. Sites have an “htaccess” file that can contain a list of robots or sites that are not allowed to visit this server; they make a redirect to a page explaining why you were banned. (This happened to me at Google…) Finally, it is NOT considered nice for robots to wander around collecting addresses. So some site administrators plant a fake address that will come to him/her if a crawler collected it. So. To avoid robot traps, you need to set limits and adhere to the robots.txt file. And just be a nice crawler.

The Hidden Web Dynamic pages increasing Subscription pages
Username and password pages Research in progress on how crawlers can “get into” the hidden web More and more of the Web is consisting of dynamic pages: ordering forms, weather maps, statistics generated on the fly, output of cgi-bin scripts, database results. All of these kinds of pages are in principle unavailable to a Crawler. Current crawler technology only deals with static pages, and static pages are a dwindling fraction of the web. (However, the web is growing so fast, that the set of static pages is also growing at a respectable rate.) The other part of the web that is hidden to crawlers are for subscription pages (such as the Communications of the ACM) or password-protected pages (such as the New York Times). There is some research going on to figure out how Crawlers can get into at least parts of the hidden web. Reference: Raghavan and Garcia-Molina, “Crawling the Hidden Web”

MERCATOR So we at Cornell have been using an industrial strength crawler called Mercator. Was originally put together at the Systems Research Center by DEC; then developed further by Compaq; now belongs to HP (if they know it). Continues to run on a couple of alpha processors in a closed room in Palo Alto. This diagram illustrates a single Crawl object; the frontier belongs to the crawl object. The rest of the loop is run by hundreds of threads, each thread taking once around the loop. We access it via SSH, which is good because it keeps the load off our net. A couple of Mercator features make this particularly useful for research. Note especially the sequence of analyzers that can be run on each page; you can plug new ones in, take old ones out. A link extractor would be one of the analyzers.

Mercator Features One file configures a crawl Written in Java
Can add your own code Extend one or more of M’s base classes Add totally new classes called by your own Industrial-strength crawler: uses its own DNS and java.net package Mercator demands only one input in order to start running: either (1) the name of a configuration file (which includes the starting seed set); or (2) the name of an existing directory to which an existing crawl was last check-pointed. So you can run one crawl for days. You can change some of its parameters as it goes. A very nice setup. The other nice thing is that it is written in Java. Just extend one of their base analyzer classes to plug in your own code. I haven’t had to touch a piece of their code. You can add your own routines called by your analyzers. Exploited this in a student project last fall. Mercator is very extensible. It has a number of abstract classes that can be implemented to provide specific services. Mercator comes with some existing implementations, such as a link extractor. You can extend the basic analyzer class: you are given two main things – a “docbundle” for the downloaded page which contains stuff like the URL and the content of the page itself. It need not be html – Mercator can download anything. Data structures in Mercator are squished down to their minimum size; space on disk is used to a great extent. Mercator fortunately runs on a machine that has fast disks. You really need to run crawls in parallel for large coverage because you spend most of your time waiting for a page to download. Mercator has intrinsic support for multiple threads. Really doesn’t get in the way of your own custom analyzers. One thread per server gets around the problem of many multi-threaded crawlers where one server could get hit by two threads at once. So each thread has its own queue of frontier URLs, and handles maybe 3 different servers. But again, this is configurable. Earlier versions of Java had synchronized host name lookup (in the java.net package). Very bad if you are running 500 threads at once because it is a bottleneck. So the Mercator folks wrote their own java.net implementation. The next thing that happened is they crashed their company’s DNS server (Domain Name Service)! So they implemented their own stripped down DNS, their own version of named rather than using Berkeley’s BIND right out of the box

The Web is a BIG Graph “Diameter” of the Web
Cannot crawl even the static part, completely New technology: the focused crawl Ever since early crawl days, it has been fashionable to measure the “size of the web”, and when that became infeasible, just the “diameter” of the (static) Web. It has been amusing to follow. Adamic (1999) said average surfer needs to go through only 4 links when going from one topic to the next. Different from looking at the maximum. Albert et al. in 1999 said it was 18 at most. Then for some time the diameter was thought to be 23. But a large Web crawl in 2000 revealed that the diameter is AT LEAST 28. I suspect it is no longer measurable. Kumar et al.’s results are plausible: that for MOST pairs of pages, there is not path from one to the other. Might be hard to verify. Also co-citation might be a way to get to one page from the other. Search engines are stuck with having to do a comprehensive crawl. But what if you were to focus your crawl? The idea is to crawl only the areas of interest, and to get to desired pages more directly. Illustrated in the following diagram… Carl: paper in nature by Giles and Lawence

Crawling and Crawlers Web overlays the internet
A crawl overlays the web seed A crawl is a traversal of the web graph such that no page is visited twice (in theory). Thus any particular crawl can be visualized as a tree layered on the web. Crawlers are usually restricted to the HTTP protocol, so the nodes are web pages (HTML), and the edges are the links (HREF’s) on a parent page to child pages. But more variations are possible. The traversal of the tree is usually breadth first, to avoid robot traps among other things. Breadth first means first in, first out. If you want to build a web graph, you must keep track of all the links you find, even if you don’t visit them.

Focused Crawling The Web is a BIG Graph
“Diameter” of the Web - growing Cannot crawl even the static part, completely New technology: the focused crawl Lawrence and Giles in 1998 pointed out that the Web was quickly overtaking Search Engine indexing. (< 20 % nos). Since the SE’s need to be ready for any query, the crawl needs to be broad and general. Rapidly expanding – the arrows. Bottom portion: crawled and cached for indexing by search engine crawlers. Right portion: crawl just what you want. Theoretically same % and effort, but more choosy. No .com, for example.

Focused Crawling 1 1 4 3 2 5 R X 2 3 4 5 6 7 R Focused crawl
We crawl the web to find materials on Chebyshev Polynomials. Not ALL of them (unlike traditional crawling); skip irrelevant parts of the web. Here is a simple picture. Recall the the web crawl is a tree, because we do not revisit nodes. Now we add relevance ®. A focused crawl is that it is more efficient at finding what you want. Normal crawl is shown on the left. Suppose 7 is about our topic. This is a relevance judgement. There is no such thing as “relevance” for a Search Engine crawler. It has to find and index everything. (Although focused search engines have recently undergone development.) On the right is a focused crawl. Note improvement in efficiency. So that’s enough on crawling and focused crawling, and the relationship between crawling and search engines … now onto collection synthesis. How do we get a cluster of related items? … R Focused crawl Breadth-first crawl 1

Focused Crawling Recall the cartoon for a focused crawl:
A simple way to do it is with 2 “knobs” 1 4 3 2 5 R X Time to move on to our third general topic, focused crawling. Next class Prof. Lagoze will talk about collection building, based on using a focused crawl. But as sort of a preview, here is one way to keep a crawl focused to parts of the Web you are interested in. The idea of a focused crawl is to cut off the paths that probably won’t work; preseve paths that probably will. We have a simple way of doing this…

Focusing the Crawl Threshold: page is on-topic if correlation to the closest centroid is above this value Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than this value The documents to be downloaded is determined by what is on the frontier, which can be prioritized. With the Mercator crawler and the centroids and the dictionary, we can proceed to do a focused crawl. Efficient because we can skip some unprofitable paths through the web. But we might miss some “nuggets” So we have two knobs: a threshold and a cutoff.

Illustration Corr >= threshold 1 Cutoff = 1 2 3 4 555 5 6 7 X X
Cutoff is a value y = 0, 1, 2 … which says how far past a dud you are willing to crawl before applying the X. At node 3 we are “tunneling through trash to get the nugget”. 0 cutoff means you crawl ONLY from pages whose cosine correlation with the closest centroid is at least as great as the threshold. How well did this work? At threshold = 0.3 and cutoff = 2, this is what a crawl looks like … given 26 centroids in math. X 6 7 X

Closest Furthest We can continue the focused crawl for as long as we want. The collections change with time. This graph gives a feeling for the rate at which collections are assembled. As we download more documents, keep the top 50 (by correlation value) for each class. For each class, record the maximum, average, and minimum correlation value for that class. Then average across all 26 classes to get points for the plot. That is, this is not max of max, but average of max. The averaged values are plotted here as a function of how long we run the crawl. The main reason to crawl longer is to get rid of the lower correlating pages. The maximum is quickly achieved. This is equivalent to shrinking the ball around the centroid by crawling for a longer and longer time. Whether or not there is value to crawling longer depends on whether the correlation value is an accurate predictor of accuracy. For this we use a precision measure (next slide).

Fall 2002 Student Project Mercator Centroids, Dictionary Term vectors
Collection URLs Query Centroid Collection Description A project has just been completed (or almost completed) to tie all these concepts together into a well-packaged Java system that can hook on to Mercator but potentially to other crawlers as well. We don’t have a good name for it yet, but here is what it looks like. Here we see just one collection represented, but in fact we had 31 topics in astronomy from “astronomy comets hale-bopp” to “astronomy meteors torino effect”. General search engine class that can be implemented for many search engines in future. The output is intended to be a top-level HTML page that describes the whole Astronomy collection, with links to an HTML page for each sub-collection. Natural language processing codes can be inserted here. This is where we would eventually like to hook on automatic Dublin Core metadata generation as well. All this can be done on the fly in Palo Alto by having Mercator call the collection phase, passing it downloaded documents one by one. Collection phase passes back “instructions” concerning the page: is it a nugget? Should links be followed from this page? (Eventually, which links) The query, centroid, and description synthesis are one-time steps. Query and Centroid Synthesis happens when Mercator instantiates this code; Description synthesis happens when Mercator terminates the crawl (time is up, whatever). Each package shown here also imports and exports XML. Thus you could run the first two phases and keep the XML for the centroids, and then start or resume a crawl from that. (Centroids and dictionary don’t change during an extended crawl; collections do.) Mercator Chebyshev P.s HTML

Conclusion We covered crawling – history, technology, deployment
Focused crawling with tunneling We have a good experimental setup for exploring automatic collection synthesis

Web Crawling and Automatic Discovery

Similar presentations

Presentation on theme: "Web Crawling and Automatic Discovery"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Crawling and Automatic Discovery

Similar presentations

Presentation on theme: "Web Crawling and Automatic Discovery"— Presentation transcript:

Similar presentations

About project

Feedback