Metadata Normalisation in Europeana The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop
A. Workflow B. Metadata normalisation with ESE C. Approach in practice: Demo of tools used D. Knowledge SHARING Workshop: Discussion of the practice for EuropeanaLocal Session
A. Workflow B. Metadata normalisation with ESE C. Approach in practice: Demo of tools used D. Knowledge SHARING Workshop: Discussion of the practice for EuropeanaLocal Session
CONTENT SURVEY #0
Stage #0: Content survey Input: Output: Specifications of content contribution Excel specs questionnaire
CONTENT SURVEY #0
Stage #1: Harvesting and package creation Input: Output:Harvested data in XML Collection-specific analysis tool Sample of source data: 1000 records Mapping specifications template Excel specs XML raw data HTML analysis tool XML sample raw data TXT mapping template
CONTENT SURVEY #0
#2 Analysis and mapping specifications Input: Output: Excel specs TXT mapping specs HTML analysis tool XML sample raw data TXT mapping template
CONTENT SURVEY #0
Stage #3: Mapping and normalisation Input: Output: XML raw data TXT mapping specs XML normalised mapped data XML profile Quality check
NORMALISER
STAGE 3
CONTENT SURVEY #0
Stage #4: Database storage and indexing Input: Output: XML normalised mapped data DBINDEX
A. Workflow B. Metadata normalisation with ESE C. Approach in practice: Demo of tools used D. Knowledge SHARING Workshop: Discussion of the practice for EuropeanaLocal Session
Europeana Semantic Element (ESE) Europeana “Schema” for the Prototype Based on Dublin Core Metadata Elements Set (DCMES)(ISO ) 49 Elements (26 Elements & 23 Refinements) Created through discussions in July/August 2008
ESE specialities europeana:country europeana:provider (dc:source) europeana:language (dc:language) europeana:type (dc:type, dc:format) europeana:year (dc:date) europeana:isShownBy (dc:relation) europeana:isShownAt (dc:relation) europeana:object europeana:uri (dc:identifier)
All normalised: Syntax Value Let’s examine their characteristics ESE specialities
Definition: Country of content provider. If several countries: Europe Format: String, ex: switzerland, germany,… Reference: TEL controlled list. Supports TEL interface translation mechanism Mechanism: Manual In portal: Facet browsing of search results Normalised ESE terms: Country
Definition: Organisation sending the data to Europeana Format: String, ex: Musées lausannois, Nasjonalbiblioteket,… Reference: Europeana controlled list of content providers: Mechanism: Manual but potentially can be automated In portal: Facet browsing of search results Normalised ESE terms : Provider
Definition: Language of provider’s country (ESE:languages of the metadata) Format: 2-letters, ex: it, no,fr, en, es,… Reference: ISO639-1 language codes Exception: If several languages: “mul” Mechanism: Manual but potentially can be automated In portal: Facet browsing of search results Normalised ESE terms: Language
Definition: Type of the original object Format: String Reference: 4 Europeana types: IMAGE, TEXT, SOUND, VIDEO Mechanism: Manual: Mapping specified by content provider In portal: Categorisation display Facet browsing of search results Normalised ESE terms: Type
Definition: Date of creation of the original object (analog or born digital) Format: 4 digits [YYYY], ex: 1950 Reference: Europeana year Mechanism: Automatic extraction with “YearExtractor” converter In portal: Facet browsing of search results Browsing by time (timeline) Normalised ESE terms: Year
Definition: URL to the digital object Format: URL ( Mechanism: Automatic or manual In portal: Linking Normalised ESE terms: isShownBy
Definition: URL to the digital object with context Format: URL ( Mechanism: Automatic or manual In portal: Linking Normalised ESE terms: isShownAt
Definition: URL to the digital object as thumbnail Format: URL ( Mechanism: Automatic or manual In portal: Display Normalised ESE terms: Object
Definition: Record identifier for Europeana system Format: URI Mechanism: Automatic: special algorithm guaranteeing uniqueness (and integrity) of records In portal: MyEuropeana Full digital object view in Europeana Normalised ESE terms: URI
A. Workflow B. Metadata normalisation with ESE C. Approach in practice: Demo of tools used D. Knowledge SHARING Workshop: Discussion of the practice for EuropeanaLocal Session
Metadata normalisation in practice Demo of stage #3’s workflow: 1.Go through data of example collection #1 2.Practical exercise: let’s normalise example collection #2 for Europeana!! 3.2 examplesof known issues MAPPING & NORMALISATION #3
SUBVERSION (SVN)
COLLECTION FOLDERSOURCE XMLMAPPING SPECS TXTOUTPUT XMLMAPPING/NORM. SPECS XML
Example 1: “Midas” collection 83 moving image records from the Association des Cinémathèques Européennes Harvested data Fields mapping/Type values mapping specs Analysis file (source data) Mapping file Profile file Analysis file + sample (normalised data)
Example 2: “Outsider Art Museum” collection 4142 records from the Musées Lausannois
Known issues with mapping/profile files 1. Wrong syntax in mapping file causes errors in profile.xml: If use “=>” in comment in mapping.txt this creates a mapping entry in profile.xml! Ex: ………
BEFORE
AFTER
Known issues with mapping/profile files 2. Wrong syntax in mapping file causes errors in profile.xml: There should be 2 blanks between “=>” and “N/A” and not one otherwise the mapping specification is not well formatted in XML in profile.xml: Ex: ………………….
MAPPING.TXT PROFILE.XML MAPPING.TXT PROFILE.XML profile.xml with error: 2 white spaces!
Documentation in Europeana context Europeana Semantic Elements (ESE) v3.1 “Europeana – Data Offline Preparation” Commented version of “profile.xml” “Quality Control Checklist”
A. Workflow B. Metadata normalisation with ESE C. Approach in practice: Demo of tools used D. Knowledge SHARING Workshop: Discussion of the practice for EuropeanaLocal Session
Questions about Europeana metadata ingestion/normalisation process? Integration and/or compatibility of this process with EuropeanaLocal content strategy: Where normalisation will take place? By who? … Discussion
Thank you
Duplicated records Records without URLs to digital object Records without Europeana type (SOUND, TYPE, IMAGE, VIDEO) Records to copyright-protected digital objects Discarding factors during normalisation