XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne
What are “significant characteristics”? Those properties of a digital file which have to be known to enable the processing of the file within a specific setup.
Why extract them by software? To create technical metadata as required by organizational models for long term preservation. (NLNZ)
Within Planets … … served by solutions to identify formats: formats registry / PRONOM / DROID. … and a solution for extracting and processing such characteristics: XCL.
Migrator tiff png Extractor tiff XCELpng XCEL Comparator png XCDL tiff XCDL 93% A Vision
Extractor Appropriate XCELs Comparator C-Set A Vision
1 million objects: use one second for each. == minutes == hours == working days of a computer == hour days for a Human == 7 working weeks Why automate?
1 million objects: use five minutes for each. == hours == hour days for a Human Why automate?
Assumption: Preservation is only feasible, if the content of two digital objects can be compared without human intervention, giving a numerical estimate of their degree of similarity. Why automate?
(1)Language to represent the complete content of a digital object. XCDL (2)Language to describe any machine readable format in a formal language. XCEL (3)Software to extract the content of a file based upon a description as under (2) and express it in the language as specified under (1). “extractor” (4)Software to compare two such content descriptions. “comparator” Abstract solution I
height 1 greyscale 0 imageType 1 zlibDeflateInflate 0 compression compression 0... height ad 429 uint32 imageType.....
32
(1)Language to represent the complete content of a digital object. XCDL (2)Language to describe any machine readable format in a formal language. XCEL (3)Software to extract the content of a file based upon a description as under (2) and express it in the language as specified under (1). “extractor” (4)Software to compare two such content descriptions. “comparator” Abstract solution I
Are the following two items equal: VIII 8
VIII 8 eight
VIII 8 eight otto
VIII 8 eight otto acht
VIII 8 eight otto acht 8.0
VIII 8 eight otto acht Information model: „an image“
VIII 8 information model: „an image“ format ontology: „what terms are used in formats to describe image properties“
Extraction language: “how to get the terms describing an image out of a file” Information model: „what is an image“ Format ontology: „what terms are used in formats to describe image properties“
(1)A theoretical model of information (not: data) types – “image”, “text”, “audio”... (2)Ontologies, which map existing file format terminologies onto these model. (3)A language – XCDL – which allows to express the content of files in different formats using the vocabulary of the ontologies and the “grammar” of the information model. Abstract solution II
eXtensible Characterisation Definition Language Purpose: Describe the contents of a file in terms of an abstract model. XCDL
XCDL: text model (1) A text (= ) is composed of data (= ) plus interpretations of data according to the underlying format specification (= ).
XCDL: text model (2) Or, one level of abstraction higher, a text is composed of content carrying tokens, accompanied by rendering info plus deployment info plus historical info.
This is a text … fontsize 48 unsignedInt8
This is a text … fontsize 48 unsignedInt8
Thank you! Questions?
XC(E/D)L - & related issues (originally from Sebastian Beyl)
Already known XCEL Machine readable format description XCDL Normdatas and properties from original file ORIGINAL FILE Extractor
Problem: propertySets and relation to normdatas normdatas original file property 1 property 2
Problem: propertySets and relation to normdatas pSet. 3 pSet. 3 propertySet 2 again! propertySet 2 again! propertySe t 2 propertySet 1 again! propertySet 1 again! propertySet 1 propertySet 1 normdatas XCDL property 1 property 2
Problem: propertySets and relation to normdatas pSe t. 3 pSe t. 3 propertySet 2 again! propertySet 2 again! propertyS et 2 propertySet 1 again! propertySet 1 again! propertySet 1 propertySet 1 normdatas XCDL property 1 property 2 Rules: - Relation to normdata ONLY with propertySet - No overlapping relations - every propertySet-definition (in one object) only once
Problem: recursive structures Footnote example from koffice.org
Problem: recursive structures Footnote example from koffice.org normdata
Problem: recursive structures Footnote example from koffice.org Property fontsize normdata Property fontSize
Problem: recursive structures Footnote example from koffice.org normdata Property fontSize Property footnote
Problem: recursive structures Footnote example from koffice.org normdata Property fontSize Property footnote normdata of property?
Problem: recursive structures Footnote example from koffice.org normdata Property fontSize Property footnote property of normdata of property? How to bring it in XCDL?
Problem: recursive structures Property „Object B“ as footnote Footnote example from koffice.org Rules: properties and propertySets only for ONE object Upper object always points to lower object, so lower object can exists itself Object A normdata Property fontSize Object B normdata Property fontSize Object A Object B
Problem: embedded objects Example from wikipedia.de
Problem: embedded objects Example from wikipedia.de Original (container) file Text datas Picture datas as embedde d file
Problem: embedded objects Example from wikipedia.de extraction XCDL-Object A (text datas) XCDL-Object B (image datas) Object A handles object B as an „image property“ Original (container) file Text datas Picture datas as embedde d file
XCDL-Object A (text datas) XCDL-Object B (image datas) Object A handles object B as an „image property“ Problem: embedded objects Example from wikipedia.de Standalone Image-XCDL Rules: If upper object (A) is not readable or cannot use for comparison, the embedded object can be Handled as a „ Standalone “ -XCDL
Problem: embedded objects Example from wikipedia.de XCDL-Object A (text datas) XCDL-Object B UNKNOWN IMAGE FORMAT Second Parsing, if known Image format Rules: If lower object (B) cannot be parsed, raw datas can be stored for later parsing, without data-loss or comparison problems for upper object (A)