Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library.

Similar presentations


Presentation on theme: "Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library."— Presentation transcript:

1

2 Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

3 Outline Brief Harvesting Overview Harvesting Problems Steps to a Fruitful Harvest A Harvesting Service Model Indexing and Interfaces What’s Next?

4 Open Archives Initiative Open Archives Initiative: “develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content” Huh? Let’s just say it’s an effort to help people find stuff Protocol for Metadata Harvesting (OAI-PMH) specifies how repositories can expose their metadata for others to harvest Well over 500 repositories world-wide support the protocol OAIster.org has indexed 3.5 million items from those repositories

5 OAI-PMH Data providers (DP) — those with the stuff Service providers (SP) — those who harvest metadata and provide aggregation and search services OAI-PMH verbs: Identify ListIdentifiers ListMetadataFormats ListSets ListRecords GetRecord Software for both DPs and SPs readily available

6 www.oaforum.org/tutorial/

7 OAI Architecture Source: Open Archives Forum Tutorial

8 gita.grainger.uiuc.edu/registry/

9 errol.oclc.org

10 Harvesting Problems Sets Metadata Formats Metadata Artifacts Granularity Metadata Variances

11 Sets Records are harvested in clumps, called “sets” created by DPs No guidelines exist for defining sets Examples: Collection Organizational structure Format (but is a page image an image? See example)

12 Metadata Formats Only required format is simple Dublin Core, although any format can be made available in addition Few DPs surface richer metadata Simple DC is simply too simple! Example (artifact vs. surrogate dates)

13 Metadata Artifacts “unintended, unwanted aberrations” Sample causes: Idiosyncratic local practices Anachronisms HTML code Examples: Circa = string of dates for searching purposes [electronic resource]

14 Granularity Record Granularity: what is an “object”? A book, or each individual page? Examples: CDL, Univ. of Michigan Metadata Granularity: Multiple values in one field Example: Univ. of Washington

15 Metadata Variances Subject terminology differences Disparities in recording the same metadata Example: date variances Mapping oddities or mistakes Examples: 1) format into description, 2) description into subject

16 Steps to a Fruitful Harvest Needs Assessment (it’s the user, stupid) DP Identification and Communication Metadata Capture Metadata Analysis Metadata Subsetting Metadata Normalization Metadata Enrichment Indexing Interface (it’s still the user, stupid)

17 Needs Assessment What are you trying to accomplish? What will your users want to be able to do? What metadata will you need, and what procedures will you need to set up to enable these activities? Which repositories have what you want? Is what they have (e.g., sets, metadata) usable as is, or ?

18 DP Identification & Communication Identification: Use UIUC directory of DPs to identify potential sources Communication: Not required to tell them you are harvesting, but may help establish a good relationship May want to request that they surface a richer metadata format and/or provide a different set

19 Metadata Capture Sample questions to answer: Individual sets, or all? Richer metadata formats available? How frequently to reharvest? Start from scratch each time or update? Many software options

20

21 +-----------------------------------------+ | Harvester Sample Configurator | +-----------------------------------------+ | Version 1.1 :: July 2002 | | Hussein Suleman | | Digital Library Research Laboratory | | www.dlib.vt.edu :: Virginia Tech | ------------------------------------------+ Defaults/previous values are in brackets - press to accept those enter "&delete" to erase a default value enter "&continue" to skip further questions and use all defaults press -c to escape at any time (new values will be lost) Press to continue [ARCHIVES] Add all the archives that should be harvested Current list of archives: No archives currently defined ! Select from: [A]dd [D]one Enter your choice [D] : a{return} [ARCHIVE IDENTIFIER] You need a unique name by which to refer to the archive you will harvest metadata from Examples: nsdl-380602, VTETD Archive identifier [] : nsdl-380602{return} Virginia Tech Perl Harvester

22 Metadata Analysis Finding out what you have (and don’t have) Encoding practices Gap analysis (e.g., missing fields, etc.) Mistakes (e.g., mapping errors) Software can help Commercial software like Spotfire In-house or open source software tools

23 Source: 2002 Master’s Thesis, Jewel Hope Ward, UNC Chapel Hill Five elements are used 71% of the time

24

25

26

27

28

29

30 Metadata Analysis Model

31 Metadata Subsetting DP sets are unlikely to serve all SP uses well SPs will need the ability to subset harvested metadata Example: prototype subsetting tool

32

33 A Subsetting Model

34 Metadata Normalization Normalizing: to reduce to a standard or normal state Prototype date normalization service screen

35 Metadata Enrichment Adding fields or values may be useful or required, for example: Metadata provider information Geographic coverage Subject terms mapped to a different thesaurus Authority control record

36 A Harvesting Service Model

37 Indexing Pick your favorite database/indexing software: MySQL SWISH-E May need to specifically set up a method to search across the entire record May need different fields for indexing than for display

38 Interface Software interface (API) for other applications: SRU/SRW? Arbitrary Web Services schema? User interface

39 What’s Next? Further protocol development Services layered on top of OAI-PMH Shared software tools Best practices for both DPs and SPs

40 oai-best.comm.nsdl.org


Download ppt "Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library."

Similar presentations


Ads by Google