Efficient, Automatic Web Resource Harvesting Michael L. Nelson, Joan A. Smith and Ignacio Garcia del Campo Old Dominion University Computer Science Dept.

Slides:

Advertisements

Similar presentations

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.

Advertisements

The Open Archives Initiative DRIADE Workshop, Durham NC, May 16-17, 2007 Michael L. Nelson The Open Archives Initiative Michael L. Nelson Computer Science,

Rapid Visual OAI Tool S. Kothamasa, K. Maly, M. Zubair (Old Dominion University) X. Liu (Los Alamos National Laboratory) RCDL 2003, St. Petersburg.

OAI in DigiTool DigiTool Version 3.0.

Depositing e-material to The National Library of Sweden.

Object Re-Use and Exchange Mellon Retreat, Nassau Inn, Princeton, NJ, March Herbert Van de Sompel, Carl Lagoze The OAI Object Re-Use & Exchange.

OAI-PMH Dawn Petherick, University Web Services Team Manager, Information Services, University of Birmingham MIDESS Dissemination.

Information Retrieval in Practice

PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

OAI Standards for Sheet Music Meeting March 28-29, 2002 Basic OAI Principals How They Apply to Sheet Music Presenter: Curtis Fornadley, Senior Programmer/Analyst.

PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.

Overview of Search Engines

Automated Tracking of Online Service Policies J. Trent Adams 1 Kevin Bauer 2 Asa Hardcastle 3 Dirk Grunwald 2 Douglas Sicker 2 1 The Internet Society 2.

Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, Digital Library Research Laboratory Virginia Tech.

OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.

Rapid Visual OAI Tool S. Kothamasa, K. Maly, M. Zubair (Old Dominion University) X. Liu (Los Alamos National Laboratory) RCDL 2003, St. Petersburg.

Strategies for improving Web site performance Google Webmaster Tools + Google Analytics Marshall Breeding Director for Innovative Technologies and Research.

Web Site Performance An analytical approach for benchmarking and tuning.

1. 2 introductions Nicholas Fischio Development Manager Kelvin Smith Library of Case Western Reserve University Benjamin Bykowski Tech Lead and Senior.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAIResource Software Her This work supported in part by the.

1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,

1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.

Dec 9-11, 2003ICADL Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad.

Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec

Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A New Model for Web Resource Harvesting Her This work supported.

07/11/2002Thomas Baron - JACoW Workshop1 CERN Library Requirements T. Baron CERN ETT-DH-CDS.

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAI-PMH for Resource Harvesting Herbert Van de Sompel Digital.

1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.

Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.

OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting T.B. Rajashekar National Centre for Science Information (NCSI) Indian Institute of Science,

Van de Sompel, Herbert Los Alamos National Laboratory – Research Library OAI-PMH for Resource Harvesting.

ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.

The OAI Protocol for Metadata Harvesting Van de Sompel, Herbert Los Alamos National Laboratory – Research Library.

Digital Library Interoperability Architecture CS 502 – Carl Lagoze – Cornell University.

Introduction to metadata

Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA

Kurt Maly Department of Computer Science Old Dominion University Norfolk, Virginia 23529, USA Digital Libraries, OAI and Free Software.

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Phil Barker, March © Heriot-Watt University. You may reproduce all or any part.

Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi

Caltech CODA CODA: Collection of Digital Archives Caltech Scholarly Communication.

Slavic Digital Text Workshop 2006 The Open Archives Initiative Protocol for Metadata Harvesting: an Opportunity for Sharing Content in a Distributed Environment.

OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley

Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,

The OAI: technical overview OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University -- Computer Science.

Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)

JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.

The Open Archives Initiative Marshall Breeding Director for Innovative Technologies and Research Vanderbilt University

Open Archives Initiative Protocol for Metadata Harvesting.

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA DCC.

1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.

June 3-6, 2003E-Society Lisbon Automatic Metadata Discovery from Non-cooperative Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science.

The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.

Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.

The Multi-Faceted Use of the OAI-PMH in the LANL Repository Written By: Henry, Xiaoming,Patrick Henry, Xiaoming,Patrick and Herbert. Presented By: Shashi.

Introduction to OAI Static Repositories By Thomas G. Habing Grainger Engineering Library.

Information Retrieval in Practice

Strategies for improving Web site performance

Georges Arnaout Chaitanya Krishna

VI-SEEM Data Repository

OAI and Metadata Harvesting

A New Model for Web Resource Harvesting

Open Archive Initiative

WebDAV Design Overview

Web Programming : Building Internet Applications Chris Bates CSE :

Presentation transcript:

Efficient, Automatic Web Resource Harvesting Michael L. Nelson, Joan A. Smith and Ignacio Garcia del Campo Old Dominion University Computer Science Dept Norfolk VA USA {mln, jsmit, Herbert Van de Sompel and Xiaoming Liu Los Alamos National Laboratory Research Library Los Alamos NM USA {herbertv, liu

Presentation Overview Introduction OAI-PMH Complex data formats as Metadata MOD_OAI Demo Quantitative Evaluation Representation Problem Counting Problem Sitemaps Conclusion References

Introduction What is a web crawler ? A program or automated script which browses the World Wide Web in a methodical, automated manner.World Wide Web What makes Web crawling difficult ?:  Large volume  Fast rate of change and  Dynamic page generation

Crawling Difficulties The large volume implies that the crawler can only download a fraction of the web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted.

Problems Two problems associated with conventional web crawling techniques: 1. Counting Problem :A crawler cannot know if all resources at a web site have been discovered and crawled. 2. Representation Problem :The human-readable format of the resources are not always suitable for machine processing.

Mod_oai A Solution For Counting And Representation Solution: Via an Apache module: mod_oai  implements OAI-PMH + MPEG-21 DIDL  OAI-PMH: count everything (linked or not) using “List” verbs  MPEG-21 DIDL: capture everything using a complex- object format and automated metadata extraction

OAI-PMH Web servers do not have the capability to answer questions of the form “what resources do you have?” and “what resources have changed since ?” Using OAI-PMH a web crawler can quickly get an update on the latest changes to a site. Requests only resources that are new or have changed since its last visit. Hence it can restrict its crawls.

OAI-PMH Data Model resource item Dublin Core metadata MARCXML metadata MPEG-21 DIDL records OAI-PMH identifier = entry point to all records pertaining to the resource METS metadata pertaining to the resource modeled representation of the resource simple model more expressive model complex model complex model

OAI-PMH Queries return records containing metadata. The verbs Identify,ListMetadataFormats and Listsets helps a harvester understand the nature of the repository. ListIdentifiers,ListRecords and GetRecord are used for the actual harvesting of the metadata.

OAI-PMH The powerful feature of OAI_PMH is that it can support any metadata format defined by an XML schema. But !!! In most cases we are interested in transmitting the actual resource and not just the metadata.

Complex Object Formats As Metadata To enable resource harvesting we use XML based complex object formats. Dublin core metadata format is simple but its flat structure cannot be used for complex objects. Hence we use DIDL (Digital Item Declaration Language)

Digital Item A Digital Item is a combination of :  Resources (such as videos, audio tracks, images, etc)  Metadata (such as descriptors, identifiers, etc), and  Structure (describing the relationships between resources).

Data Model A Container is a grouping of Containers and/or Items. An Item is a grouping of Items and/or Components. A Component is a grouping of Resources. Multiple Resources in the same Component are considered equivalent and consequently an agent may use any one of them. A Resource is an individual datastream. A Descriptor conveys secondary information pertaining to a Container, an Item, or a Component.

Xml View..

Mod_oai mod_oai began as a research project at ODU. mod_oai is an Apache module that responds to OAI-PMH requests on behalf of a web server. Goal :To bring the efficiency of OAI-PMH to everyday web sites. If Apache and mod_oai are installed at then the baseURL for OAI-PMH requests is

Mod_oai View Of OAI-PMH Data Model: OAI-PMH identifier: The URL of the resource serves as the OAI-PMH identifier. OAI-PMH datestamp: The modification time of the resource is used as the OAI-PMH datestamp of all 3 metadata formats. OAI-PMH sets: A set organization is introduced based on the MIME type of resource.

Supported Metadata Formats : oai_dc: Dublin Core is supported as mandated.  Only technical metadata that can be derived from http header information is included. http_header: Contains all http response headers that would be returned if a web resource were obtained by means of an http GET. oai_didl: Introduced to allow harvesting of the resource itself.  Web resource is represented by means of an XML wrapper document.  Compliant with the MPEG-21 DIDL

Structural View

Does site’s server support mod oai? Request : If response is valid then it Supports Mod_oai

Demo

Quantitative Evaluation  To examine the performance of mod_oai, authors compared  OAI-PMH harvesting using the OCLC Harvester with the wget web crawling utility.  served as the testbed.  Overall, the testbed included 5268 files that used 292MB disk space.  The server was at ODU and the client was at LANL.  User files, data files and mail files were excluded from the server utility.

Experiments Authors performed two experiments. Files accessed by “wget”. Total files used: 5268 Index.html“find. -type f” # of files in baseline # of files in update (25%)

Experiments contd …  Reason for difference: Because only a portion of valid URLs at a site are linked from web pages hosted on that site.  Using the “find” seed, wget downloads more URLs (5739) than there are files (5268).  This is because it finds additional URLs that the seed points to, including directories and broken links.

Experiments contd …  Using the seed generated with “find”, authors baselined both wget and mod_oai:  All file modification dates were set to “ ”  In second test, 25% of the files were covered to make their modification date “ ” Which is a simulation of expected monthly update rate of “.edu” sites.  Request types: ListRecords, ListIdentifers.  From values: ,  Apache was restarted after each round of harvesting.

Comparison of crawling performance (a) Baseline wget & mod oai

After 25% file updates (b) wget & mod oai after 25% file updates

Results Surprisingly, “wget” takes more time in accessing only the updated files. The Apache log file shows that wget uses both the http HEAD and GET methods to check the time. Thus, in checking for updates, wget will use more http requests (5739 HEAD GET).

Testing the performance of Mod_oai using Resumption Tokens Impact of resumptionToken size.

Results  Performance of mod_oai was tested using the resumption tokens.  Q: Why leveled off ?  ListRecords returns the base64-encoded file, and ListIdentifiers returns just the resource identifiers.  The bottomline is: we should have different ResumptionToken sizes for ListRecords and ListIdentifiers.

Discussion and future work  Issues associated with mod_oai :- 1.Counting Problem: what constitutes a complete list of the site’s crawl-able resources? 2.Representation Problem: The resource as sent to a browser is not necessarily an optimal representation for the crawler.  mod_oai can help solve both problems.

The Representation Problem What’s that page all about? Mod_oai does not generate descriptive metadata. we are not positioning mod_oai as a replacement for existing repository systems with extensive descriptive metadata. Rather, this paper aims to improve the efficiency of web crawlers. A plug in architecture for mod_oai:  Rules to extract descriptive metadata.  Ability to integrate 3 rd party metadata extraction tools.  Creating complex data items for long term archiving.

The Counting Problem - Problem of Determining how many files were on a web server. - We need to find all resources at a web site OAI-PMH applications are deterministic. However, web harvesting is different. Apache maps U -> F, and mod_oai maps F-> U. Neither function is 1-1 nor onto. Apache can “cover up” legitimate files !! Consider an httpd.conf file with these directives: Alias /A /usr/local/web/htdocs/B Alias /B /usr/local/web/htdocs/A User or crawler requesting will actually receive the resource from htdocs/A, and vice-versa. mod oai, on the other hand, is unaware of the alias and returns metadata from htdocs/B as if the directive did not exist.

Security “Security through Obscurity” ? Mod_oai will not export any files that are not accessible through normal http access. How does mod_oai handle protected files? If the necessary credentials in the current http connection are sufficient to meet the requirements specified in the.htaccess file. As a result, harvesters with different credentials will see a different number of records for the same server.

Hidden Files Mod_oai will not advertise any file that the request does not have the credentials to retrieve. Apache will advertise files that it cannot read. To preserve OAI-PMH semantics, mod_oai will not include such files in responses.

Sitemap XML Format XML schema for the sitemap protocol Google Sitemaps

Basics Sitemap: 1) A site map (or sitemap) is a graphical representation of the architecture of a web site. 2) The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling.URLs Benefits of using sitemaps:  Useful where browser can not access all areas of a website.  Improved Comprehensiveness.  Help freshness by notifying search engine of changes.  Identify unchanged pages to 1) prevent unnecessary crawling. 2) Save search bandwidth. Sitemap – “A supplementary tool” for search engines and NOT a replacement for existing crawling mechanisms

Sample XML Sitemap urlset url loc lastmod monthly changefreq 0.8 priority

Sample XML Sitemap urlset url loc lastmod monthly changefreq 0.8 priority urlloc weekly changefreq url loclastmod weekly changefreq url T18:00:15+00:00 loclastmod 0.3 priority url loc lastmod

Sitemap index files Each sitemap file: No more than 50,000 URLs. No larger than 10 MB. Use Multiple sitemap files !!! Definition: XML file that lists the multiple XML sitemap files List each Sitemap file in a Sitemap index file.

Sample XML Sitemap index file Having two Sitemaps: sitemapindex sitemap loc T18:23:17+00:00 lastmod sitemap loc lastmod

AttributeDescription requiredEncapsulates information about all of the Sitemaps in the file. requiredEncapsulates information about an individual Sitemap. requiredIdentifies the location of the Sitemap. This location can be a Sitemap, an Atom file, RSS file or a simple text file. optionalIdentifies the time that the corresponding Sitemap file was modified. It does not correspond to the time that any of the pages listed in that Sitemap were changed. By providing the last modification timestamp, you enable search engine crawlers to retrieve only a subset of the Sitemaps in the index. i.e. a crawler may only retrieve Sitemaps that were modified since a certain date.

Sitemap file location The location of a Sitemap file determines the set of URLs that can be included in that Sitemap:    Permission to change ?  In : Valid URLs include :   Invalid URLs include:   

Extending the Sitemaps protocol You can extend the Sitemaps protocol using your own namespace. Simply specify this namespace in the root element. For example: <urlset xmlns:xsi=" xsi:schemaLocation=" xmlns= xmlns:example="

Mod_oai vs Google sitemaps Authors executed “sitemap-gen.py” in mode3 to generate URLs. Sitemap script generated 5843 URLs, nowhere compared to mod_oai’s results. Sitemaps behaved overly optimistic: > Returned some URLs which didn’t even exist. >Returned URLs which were protected using.htaccess file.

Mod_oai vs Google sitemaps Google’s Sitemap is designed to be an extremely light-weight mechanism for informing Google’s web crawlers of new and updated URLs at a web site, Similar to using mod_oai with only “ListIdentifiers” and no date- stamps, sets or metadata formats. Trade-off between the dynamic access of mod_oai and the static access of Sitemap: A mod_oai response always up to date, but at the cost of computation. Sitemap only as up to date as the local refresh policy, but each crawler access to Sitemap will not impose a computational cost on the server.

Conclusion mod oai is a module which combines complex object metadata formats with OAI-PMH for efficient web resource harvesting. mod_oai can be used in resource discovery and preservation. Experiments reveal that the performance of mod_oai is comparable to that of wget in baseline harvests. It outperforms wget when file updates are considered.

References  widm06.pdf widm06.pdf  Efficient, Automatic Web Resource Harvesting   Sitemap Protocol   OAI-PMH Data Model