Presentation is loading. Please wait.

Presentation is loading. Please wait.

Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Similar presentations


Presentation on theme: "Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?"— Presentation transcript:

1 Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

2 Table of contents Main technical ideas of OAI-PMH Avano presentation General information Filtering aquatic and marine records Demonstrations What could be improved in OAI-PMH protocol and in repositories implementation?

3 Main technical ideas of OAI-PMH Open Archives Protocol for Metadata Harvesting

4 Definitions and concepts A protocol to share bibliographic records The digital objects (documentation, images, dataset…) stay inside the repositories Two groups of players OAI harvesters OAI harvesters OAI server OAI server HTTP / XML Data providers (Open Archives, Institutional Repositories, Commercial publishers, e.g., Aquatic Commons, OceanDocs, MBL/WHOI) Service providers, or harvesters including AVANO A simple protocol OAI-PMH is based on major web standard : HTTP, XML, Dublin Core

5 Harvesters issue repositories with simple HTTP requests. There are 6 request types (verbs) that can be issued by harvesters: Identify Retrieve information about a repository (administrator email, information about deleted records strategy…) ListMetadataformats Retrieve the metadata formats available from a repository (XML DTD). All repositories must at least allow the sharing of theirs records in unqualified Dublin Core ListSets Get the optional list of Set suggested by the Data Provider to harvest a selection of records (Thematic sets, type of documents, full text available…) ListIdentifiers Get the list of record identifiers available from a data provider GetRecord Get the complete record for the identifier sent as parameter ListRecords Get a list of complete records available from a data provider

6 Some parameters to issue a repository from - until (optional) Specify the range of dates of records to harvest (This applies to the last date of modification and not to the date of publication ) Set (optional) Specify the set of records to retrieve (Thematic sets, type of document, full text available…) metadataPrefix (mandatory) Specify in which format (XML DTD) the record must be returned One example: http://www.ifremer.fr/docelec/oai/OAIHandler?verb=ListRecords& metadataPrefix=oai_dc

7 Minimal OAI compliant metadata consists of the unqualified 15 fields Dublin Core metadata : TITLE CREATOR SUBJECT DESCRIPTION PUBLISHER CONTRIBUTOR DATE TYPE FORMAT IDENTIFIER SOURCE LANGUAGE RELATION COVERAGE RIGHTS

8 Avano, a thematic OAI-PMH harvester implementation example

9 General informations Avano was launched in September 2006. It is available at : http://www.ifremer.fr/avano/ http://www.ifremer.fr/avano/ A part of the system is based on the University of Illinois Open Archives Initiative Metadata Harvesting Project The publication web site and the filtering system are Ifremer In- House developments It handles marine resources but also freshwater resources (rivers, lakes, ground waters, drinking water treatment,...) Avano harvests Open Archives, Institutional repositories and a few commercial publishers (E.g. : HighWire) When possible, if a subset is available, we only harvest records with Full-Text Repositories are not loaded if there is no full-text subset and if the repository contains mainly records with no full-text. Repositories are not loaded if they offer records with link to digital objects stored outside the repository server

10 Harvesting marine repositories The full content of these 9 marine repositories is automatically loaded into Avano ( 18904 records) 9 marine repositories harvested : ePic, Alfred Wegener Institute : 2679 records Aquatic Commons, Iamslic : 269 records ArchiMer, Ifremer : 2241 records DRS, National Institute Of Oceanography of India : 637 records IBSS, Institute of Biology of the Southern Seas : 181 records Marine & Ocean Science ePrints @ Plymouth : 1974 records OceanDocs, Africa and Latin America marine pub. : 1568 records Plankton*Net (AWI and Roscoff marine station) : 7686 images WHOAS (Woods Hole) : 1660 records OAI-PMH

11 146 non-marine repositories Temporary table 4.500.000 records … fishery fishes fishing% … Ocean Dynamics Ocean Engineering Ocean Modelling Ocean Navigator Ocean Research … abietinaria inconstans abietinaria kincaidi abietinaria labrata abietinaria pacifica … Manual checking (40 000 records removed manually) Aquatic and marine terms or expression Filters Journal titles Aquatic species scientific names … Avano (88000 records) OAI-PMH Harvesting non-marine repositories

12 Harvest non-marine repositories Records that contain aquatic journal title, aquatic expressions or scientific names of aquatic species are automatically loaded into Avano. Avano is then already using: An aquatic journal title list from ASFA A list of scientific names of fishes from FishBase A list of scientific names of aquatic species from the FAO Several lists of scientific names of aquatic species from the NODC But if you have lists of scientific names for aquatic algae, fungi, plants, mollusks, gastropods, insects, birds, mammals, if they contain only aquatic species, Please contact me!

13 Keyword filtering method deficits It’s a time consuming method We may validate records (1 or 2%?) that don’t match any Avano subject We may also miss a few records from non-marine repositories (1 or 2%?) especially when : The records are poor (no abstract) The record is only available in local language But this is the only way we found to get the 80% of Avano records that come from general repositories

14 Avano now contains more than 107 000 records from 156 Open Archives and 4 commercial editors

15 Publication year of documents available from Avano

16 The number of connections to Avano is increasing Number of connections

17 An international public

18 Demonstrations Filtering module Public web site: http://www.ifremer.fr/avano/ http://www.ifremer.fr/avano/

19 One year of harvester management review W hat could be improved in OAI-PMH protocol and in repositories implementation?

20 OAI-PMH, what could be improved? Repository stabilities Many repositories (10-20%?) are difficult to harvest because of bad reliability: Un-documented errors occurred during harvesting HTTP time out errors during harvesting OAI-PMH protocol not completely supported (some repositories can only be harvested via the GetRecords method, some others via the ListIdentifier method, some do not return the same number of records via the GetRecords method and via the ListIdentifier method) OAI-PMH server URL changed without notification …

21 OAI-PMH, what could be improved? XML encoding, UTF8 errors Many repositories deliver incorrect XML stream or records that contain UTF8 errors (encoding character errors). This is a problem for some harvesters (E.g. : Avano) if they are using XML parsers that cannot bypass these XML encoding or UTF-8 errors. Records with UTF-8 errors are not loaded in Avano Repositories with XML encoding errors cannot be harvested via the GetRecords method by Avano (which is a problem when the ListIndentifier method doesn’t work either) …

22 OAI-PMH, what could be improved? Big or slow repository harvesting Big or slow repositories can take several days to be harvested This is a problem for unreliable repositories. If one error occurs, the harvesting must be restarted from the beginning (no way to start from where the harvesting stopped) For some of these repositories, an intermediary solution would consist in dividing the harvesting by range of date but it cannot be applied all the time

23 OAI-PMH, what could be improved? Duplicated records This can happen if, for example, a publication is written in collaboration with several institutions. If so, this publication may be archived on each institution server. The international deposit rate is so low, especially for life sciences, that it is not really a problem nowadays. Some national projects are also aggregating a selection of IR and re-exposing the records in OAI-PMH. For example, HAL is a French national Open Archive. Some French scientific organizations are using this platform to build their IR (IN2P3, INSERM…). All the records loaded in these IR are exposed twice (via the national platform and via the IR). If harvesters manager did not heard about these specific national projects, then can load these duplicated IR (e.g. all IN2P3, INSERM… records are duplicated in Oaister)

24 OAI-PMH, what could be improved? Deleted records Many repositories don’t support a mechanism (transient or persistent) that indicates to the harvesters that a record has been deleted Harvesters then have to re-harvest completely (instead of using incremental harvests) the repositories to detect deleted records (which is a major problem for big, slow or not reliable repositories that need several days to be reharvested)

25 OAI-PMH, what could be improved? Type Field 26 000 of the 107 000 records available in Avano have no type field A few (>500) have a type field which is impossible to normalize A1 Airticle 8 Treball Final de Carrera …. All these records will be removed from results if the end-user limits his query to a set type

26 OAI-PMH, what could be improved? Publication Date Field 15 000 of the 107 000 records available in Avano have no publication date A few (>500) have bad-formatted date: 1970-04-00 1981. Montréal, 2000 [196-?] 2005-92-26…. All these records will be removed from results list if the end- user limits his query to a range date All these records will be displayed at the end of the hitlist if the enduser selects to sort the hitlist by date.

27 OAI-PMH, what could be improved? Poor records Some repositories contain poor records (no abstract, no keyword, no author…). Some others contain records only available in national languages. These records will have a bad visibility in harvester search engine because harvester only indexes the bibliographic data and often displays their result-list sorted by rank.

28 OAI-PMH, what could be improved? Aggregating documentation and dataset records This could be a problem for harvester if dataset records do not have the same granularity as the documentation records. E.g. : Pangaea is a publishing network for geological and environmental data. It contains thousands of records that are almost identical (only a few geographical references can be different in these records)

29 E.g. : Pangaea contains 1389 almost identical records that contain the “color reflectance“ expression. If an end-user wants to find the few documentation records that also contain this expression he will have no chance to find them in this list of results:

30 OAI-PMH, what could be improved? Records without free access to the digital object : maybe the main problem ! Many Open Archive and IR now contain records without fulltext, records with pay per view fulltext (E.g. : BePress/ProQuest) or records with restricted access to the full-text. It should not be a problem if harvesters had the possibility to offer information to their end-users about the access to the full-text (and offer, as an option, the possibility to filter them). But this is not the case! We still have to convince scientists and end- users that Open Access is useful and/or necessary. Immediate and free access to the full text is maybe the main argument to convince them. It is my opinion that hiding records with free full text among records with inaccessible full text is not helpful.

31 OAI-PMH, what could be improved? Thematic harvesting Thematic harvesting is supposed to be available via the Set method In practice, no repository offers Set that matches exactly with the range of Avano The OAI-PMH protocol does not allow the harvest of records that belong to several sets. As an example it would not have been possible to harvest “Full-Text” set and “Marine and aquatic” set at the same time. This limitation led to the development of the key-word spotting system to filter marine and aquatic records in general repositories

32 Conclusion (1/2) What do harvesters need to be able to find their place between Google and commercial bibliographic databases? An higher Open Access deposit rate (less than 3% in marine/aquatic sciences?) and/or more commercial publishers to expose their records in OAI-PMH in order to cover the main part of the international scientific production A new version of OAI-PMH that would offer a more reliable way to harvest OA and more qualified mandatory information (date and type field, information about access to the full text…), so that harvesters will be able to offer more powerfull and reliable search options

33 Conclusion (2/2) Please, test and comment Avano. Do not hesitate to suggest modifications! check if your repository is already harvested by Avano and, if no, please register! contact me if you have lists of scientific names for aquatic algae, fungi, plants, mollusks, gastropods, insects, birds, mammals, if they contain only aquatic species!


Download ppt "Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?"

Similar presentations


Ads by Google