Efficient, Automatic Web Resource Harvesting Michael L. Nelson, Joan A. Smith and Ignacio Garcia del Campo Old Dominion University Computer Science Dept.

Efficient, Automatic Web Resource Harvesting Michael L. Nelson, Joan A. Smith and Ignacio Garcia del Campo Old Dominion University Computer Science Dept Norfolk VA 23529 USA {mln, jsmit, dgarcia}@cs.odu.edu Herbert Van de Sompel and Xiaoming Liu Los Alamos National Laboratory Research Library Los Alamos NM 87545 USA {herbertv, liu x}@lanl.gov

Presentation Overview Introduction OAI-PMH Complex data formats as Metadata MOD_OAI Demo Quantitative Evaluation Representation Problem Counting Problem Sitemaps Conclusion References

Introduction What is a web crawler ? A program or automated script which browses the World Wide Web in a methodical, automated manner.World Wide Web What makes Web crawling difficult ?:  Large volume  Fast rate of change and  Dynamic page generation

Crawling Difficulties The large volume implies that the crawler can only download a fraction of the web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted.

Problems Two problems associated with conventional web crawling techniques: 1. Counting Problem :A crawler cannot know if all resources at a web site have been discovered and crawled. 2. Representation Problem :The human-readable format of the resources are not always suitable for machine processing.

Mod_oai A Solution For Counting And Representation Solution: Via an Apache module: mod_oai  implements OAI-PMH + MPEG-21 DIDL  OAI-PMH: count everything (linked or not) using “List” verbs  MPEG-21 DIDL: capture everything using a complex- object format and automated metadata extraction

OAI-PMH Web servers do not have the capability to answer questions of the form “what resources do you have?” and “what resources have changed since 2004-12-27?” Using OAI-PMH a web crawler can quickly get an update on the latest changes to a site. Requests only resources that are new or have changed since its last visit. Hence it can restrict its crawls.

OAI-PMH Data Model resource item Dublin Core metadata MARCXML metadata MPEG-21 DIDL records OAI-PMH identifier = entry point to all records pertaining to the resource METS metadata pertaining to the resource modeled representation of the resource simple model more expressive model complex model complex model

OAI-PMH Queries return records containing metadata. The verbs Identify,ListMetadataFormats and Listsets helps a harvester understand the nature of the repository. ListIdentifiers,ListRecords and GetRecord are used for the actual harvesting of the metadata.

OAI-PMH The powerful feature of OAI_PMH is that it can support any metadata format defined by an XML schema. But !!! In most cases we are interested in transmitting the actual resource and not just the metadata.

Complex Object Formats As Metadata To enable resource harvesting we use XML based complex object formats. Dublin core metadata format is simple but its flat structure cannot be used for complex objects. Hence we use DIDL (Digital Item Declaration Language)

Digital Item A Digital Item is a combination of :  Resources (such as videos, audio tracks, images, etc)  Metadata (such as descriptors, identifiers, etc), and  Structure (describing the relationships between resources).

Data Model A Container is a grouping of Containers and/or Items. An Item is a grouping of Items and/or Components. A Component is a grouping of Resources. Multiple Resources in the same Component are considered equivalent and consequently an agent may use any one of them. A Resource is an individual datastream. A Descriptor conveys secondary information pertaining to a Container, an Item, or a Component.

Xml View..

Mod_oai mod_oai began as a research project at ODU. mod_oai is an Apache module that responds to OAI-PMH requests on behalf of a web server. Goal :To bring the efficiency of OAI-PMH to everyday web sites. If Apache and mod_oai are installed at http://www.foo.edu/, then the baseURL for OAI-PMH requests is http://www.foo.edu/mod_oai. http://www.foo.edu/mod_oai

Mod_oai View Of OAI-PMH Data Model: OAI-PMH identifier: The URL of the resource serves as the OAI-PMH identifier. OAI-PMH datestamp: The modification time of the resource is used as the OAI-PMH datestamp of all 3 metadata formats. OAI-PMH sets: A set organization is introduced based on the MIME type of resource.

Supported Metadata Formats : oai_dc: Dublin Core is supported as mandated.  Only technical metadata that can be derived from http header information is included. http_header: Contains all http response headers that would be returned if a web resource were obtained by means of an http GET. oai_didl: Introduced to allow harvesting of the resource itself.  Web resource is represented by means of an XML wrapper document.  Compliant with the MPEG-21 DIDL

Structural View

Does http://www.foo.edu/ site’s server support mod oai?http://www.foo.edu/ Request : http://www.foo.edu/modoai?verb=Identify If response is valid then it Supports Mod_oai

Demo http://whiskey.cs.odu.edu/

Quantitative Evaluation  To examine the performance of mod_oai, authors compared  OAI-PMH harvesting using the OCLC Harvester with the wget web crawling utility.  http://www.cs.odu.edu served as the testbed.  Overall, the testbed included 5268 files that used 292MB disk space.  The server was at ODU and the client was at LANL.  User files, data files and mail files were excluded from the server utility.

Experiments Authors performed two experiments. Files accessed by “wget”. Total files used: 5268 Index.html“find. -type f” # of files in baseline 7095739 # of files in update (25%) 1141318

Experiments contd …  Reason for difference: Because only a portion of valid URLs at a site are linked from web pages hosted on that site.  Using the “find” seed, wget downloads more URLs (5739) than there are files (5268).  This is because it finds additional URLs that the seed points to, including directories and broken links.

Experiments contd …  Using the seed generated with “find”, authors baselined both wget and mod_oai:  All file modification dates were set to “2000-01-01”  In second test, 25% of the files were covered to make their modification date “2002-01-01” Which is a simulation of expected monthly update rate of “.edu” sites.  Request types: ListRecords, ListIdentifers.  From values: 1900-01-01, 2001-01-01.  Apache was restarted after each round of harvesting.

Comparison of crawling performance (a) Baseline wget & mod oai

After 25% file updates (b) wget & mod oai after 25% file updates

Results Surprisingly, “wget” takes more time in accessing only the updated files. The Apache log file shows that wget uses both the http HEAD and GET methods to check the time. Thus, in checking for updates, wget will use more http requests (5739 HEAD + 1318 GET).

Testing the performance of Mod_oai using Resumption Tokens Impact of resumptionToken size.

Results  Performance of mod_oai was tested using the resumption tokens.  Q: Why leveled off ?  ListRecords returns the base64-encoded file, and ListIdentifiers returns just the resource identifiers.  The bottomline is: we should have different ResumptionToken sizes for ListRecords and ListIdentifiers.

Discussion and future work  Issues associated with mod_oai :- 1.Counting Problem: what constitutes a complete list of the site’s crawl-able resources? 2.Representation Problem: The resource as sent to a browser is not necessarily an optimal representation for the crawler.  mod_oai can help solve both problems.

The Representation Problem What’s that page all about? Mod_oai does not generate descriptive metadata. we are not positioning mod_oai as a replacement for existing repository systems with extensive descriptive metadata. Rather, this paper aims to improve the efficiency of web crawlers. A plug in architecture for mod_oai:  Rules to extract descriptive metadata.  Ability to integrate 3 rd party metadata extraction tools.  Creating complex data items for long term archiving.

The Counting Problem - Problem of Determining how many files were on a web server. - We need to find all resources at a web site OAI-PMH applications are deterministic. However, web harvesting is different. Apache maps U -> F, and mod_oai maps F-> U. Neither function is 1-1 nor onto. Apache can “cover up” legitimate files !! Consider an httpd.conf file with these directives: Alias /A /usr/local/web/htdocs/B Alias /B /usr/local/web/htdocs/A User or crawler requesting http://www.foo.edu/B will actually receive the resource from htdocs/A, and vice-versa. mod oai, on the other hand, is unaware of the alias and returns metadata from htdocs/B as if the directive did not exist.

Security “Security through Obscurity” ? Mod_oai will not export any files that are not accessible through normal http access. How does mod_oai handle protected files? If the necessary credentials in the current http connection are sufficient to meet the requirements specified in the.htaccess file. As a result, harvesters with different credentials will see a different number of records for the same server.

Hidden Files Mod_oai will not advertise any file that the request does not have the credentials to retrieve. Apache will advertise files that it cannot read. To preserve OAI-PMH semantics, mod_oai will not include such files in responses.

Sitemap XML Format XML schema for the sitemap protocol Google Sitemaps

Basics Sitemap: 1) A site map (or sitemap) is a graphical representation of the architecture of a web site. 2) The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling.URLs Benefits of using sitemaps:  Useful where browser can not access all areas of a website.  Improved Comprehensiveness.  Help freshness by notifying search engine of changes.  Identify unchanged pages to 1) prevent unnecessary crawling. 2) Save search bandwidth. Sitemap – “A supplementary tool” for search engines and NOT a replacement for existing crawling mechanisms

Sample XML Sitemap urlset url http://www.example.com/ loc 2005-01-01 lastmod monthly changefreq 0.8 priority

Sample XML Sitemap urlset url http://www.example.com/ loc 2005-01-01 lastmod monthly changefreq 0.8 priority http://www.example.com/catalog?item=12&desc=vacation_hawaii urlloc weekly changefreq url http://www.example.com/catalog?item=73&desc=vacation_new_zealand 2004-12-23 loclastmod weekly changefreq url http://www.example.com/catalog?item=74&desc=vacation_newfoundland 2004-12-23T18:00:15+00:00 loclastmod 0.3 priority url http://www.example.com/catalog?item=83&desc=vacation_usa loc 2004-11-23 lastmod

Sitemap index files Each sitemap file: No more than 50,000 URLs. No larger than 10 MB. Use Multiple sitemap files !!! Definition: XML file that lists the multiple XML sitemap files List each Sitemap file in a Sitemap index file.

Sample XML Sitemap index file Having two Sitemaps: sitemapindex sitemap http://www.example.com/sitemap1.xml.gz loc 2004-10-01T18:23:17+00:00 lastmod sitemap http://www.example.com/sitemap2.xml.gz loc 2005-01-01 lastmod

AttributeDescription requiredEncapsulates information about all of the Sitemaps in the file. requiredEncapsulates information about an individual Sitemap. requiredIdentifies the location of the Sitemap. This location can be a Sitemap, an Atom file, RSS file or a simple text file. optionalIdentifies the time that the corresponding Sitemap file was modified. It does not correspond to the time that any of the pages listed in that Sitemap were changed. By providing the last modification timestamp, you enable search engine crawlers to retrieve only a subset of the Sitemaps in the index. i.e. a crawler may only retrieve Sitemaps that were modified since a certain date.

Sitemap file location The location of a Sitemap file determines the set of URLs that can be included in that Sitemap:  http://example.com/catalog/sitemap.xml http://example.com/catalog/sitemap.xml  http://example.com/catalog/ http://example.com/catalog/  http://example.com/images/ http://example.com/images/ Permission to change http://example.org/path/sitemap.xml ?http://example.org/path/sitemap.xml  http://example.org/path/ http://example.org/path/ In http://example.com/catalog/sitemap.xml :http://example.com/catalog/sitemap.xml Valid URLs include :  http://example.com/catalog/show?item=23 http://example.com/catalog/show?item=23  http://example.com/catalog/show?item=233&user=3453 http://example.com/catalog/show?item=233&user=3453 Invalid URLs include:  http://example.com/image/show?item=23 http://example.com/image/show?item=23  http://example.com/image/show?item=233&user=3453 http://example.com/image/show?item=233&user=3453  https://example.com/catalog/page1.php https://example.com/catalog/page1.php

Extending the Sitemaps protocol You can extend the Sitemaps protocol using your own namespace. Simply specify this namespace in the root element. For example: <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.x xmlns=http://www.sitemaps.org/schemas/sitemap/0.9 xmlns:example="http://www.example.com/schemas/example_schema">......

Mod_oai vs Google sitemaps Authors executed “sitemap-gen.py” in mode3 to generate URLs. Sitemap script generated 5843 URLs, nowhere compared to mod_oai’s results. Sitemaps behaved overly optimistic: > Returned some URLs which didn’t even exist. >Returned URLs which were protected using.htaccess file.

Mod_oai vs Google sitemaps Google’s Sitemap is designed to be an extremely light-weight mechanism for informing Google’s web crawlers of new and updated URLs at a web site, Similar to using mod_oai with only “ListIdentifiers” and no date- stamps, sets or metadata formats. Trade-off between the dynamic access of mod_oai and the static access of Sitemap: A mod_oai response always up to date, but at the cost of computation. Sitemap only as up to date as the local refresh policy, but each crawler access to Sitemap will not impose a computational cost on the server.

Conclusion mod oai is a module which combines complex object metadata formats with OAI-PMH for efficient web resource harvesting. mod_oai can be used in resource discovery and preservation. Experiments reveal that the performance of mod_oai is comparable to that of wget in baseline harvests. It outperforms wget when file updates are considered.

References  http://www.cs.odu.edu/~mln/pubs/widm-2006/modoai- widm06.pdf http://www.cs.odu.edu/~mln/pubs/widm-2006/modoai- widm06.pdf  Efficient, Automatic Web Resource Harvesting  http://www.sitemaps.org/protocol.php http://www.sitemaps.org/protocol.php  Sitemap Protocol  http://www.cs.odu.edu/~mln/presentations.html http://www.cs.odu.edu/~mln/presentations.html  OAI-PMH Data Model

Efficient, Automatic Web Resource Harvesting Michael L. Nelson, Joan A. Smith and Ignacio Garcia del Campo Old Dominion University Computer Science Dept.

Similar presentations

Presentation on theme: "Efficient, Automatic Web Resource Harvesting Michael L. Nelson, Joan A. Smith and Ignacio Garcia del Campo Old Dominion University Computer Science Dept."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient, Automatic Web Resource Harvesting Michael L. Nelson, Joan A. Smith and Ignacio Garcia del Campo Old Dominion University Computer Science Dept.

Similar presentations

Presentation on theme: "Efficient, Automatic Web Resource Harvesting Michael L. Nelson, Joan A. Smith and Ignacio Garcia del Campo Old Dominion University Computer Science Dept."— Presentation transcript:

Similar presentations

About project

Feedback