Preservation Seminar 8 Jan CASPAR: Long term preservation of digitally encoded information David Giaretta
Preservation Seminar 8 Jan CASPAR aims Produce tools and techniques to support digital preservation and make it easier to share the cost –must be relatively easy to use –must have a low “buy-in” in terms of effort required for adoption –must avoid requiring wholesale change of everyone else’s systems –must be decentralised and reproducible so that it can live on after the formal end of the CASPAR project –must be “preservable” –must be open: open source, open standards Cannot do everything but should do something broadly useful Working closely with the UK Digital Curation Centre –
Preservation Seminar 8 Jan Digital Preservation… Easy to do… …as long as you can provide money forever Easy to test claims about tools… …as long as you live a long time
Preservation Seminar 8 Jan Validation Demonstrate theoretical basis “Accelerated lifetime” tests –Changes in hardware –Changes in environment –Changes in Designated Community Demonstrate increased trustworthiness –Measured using draft Certification Standard
Preservation Seminar 8 Jan Digital Preservation Need to preserve information & knowledge – not just “the bits” –Documents, videos are rendered – simple? –Data – must be processed – in new ways - harder Need to manage knowledge to keep archives alive through time –Preservation is a process, not a one-time event –Preservation is expensive – costs need to be shared The alternative is money – endless supplies of money Open Archival Information Systems Reference Model (ISO 14721) provides a general conceptual framework (
Preservation Seminar 8 Jan Disincentives for preservation: cost Money Time Budget available If cost of preserving old information increases… Need to show that costs are contained
Preservation Seminar 8 Jan Immediate benefits of Digital Preservation: Use of Unfamiliar Data Global Cyber-Infrastructures allow users to find and try to use data from many sources –Some sources will be familiar –Most available sources will be unfamiliar How can one be sure that the unfamiliar data is used correctly Garbage in – garbage out Need to be able to deal with unfamiliar data whether it is contemporary or old (preserved)
Preservation Seminar 8 Jan OAIS Reference Model ISO : Reference Model for an Open Archival Information Systems (OAIS). An OAIS is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community. Long Term Preservation: The act of maintaining information, in a correct and Independently Understandable form, over the Long Term. Long Term is long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Designated Community: An identified group of potential Consumers who should be able to understand a particular set of information. The Designated Community may be composed of multiple user communities. Has sufficient documentation to allow the information to be understood and used by the Designated Community without having to resort to special resources not widely available, including named individuals. OASISOAI XX
Preservation Seminar 8 Jan OAIS Reference Model – Functional Model
Preservation Seminar 8 Jan OAIS Information Model Information Object Representation Information 1+ interpreted using 1+ Data Object interpreted using Physical Object Digital Object Bit Sequence 1+ Recursion ends at KNOWLEDGEBASE of the DESIGNATED COMMUNITY (this knowledge will change over time and region)
Preservation Seminar 8 Jan Rep.Info. Classification
Preservation Seminar 8 Jan FITS FILE FITS STANDARD PDF STANDARD FITS JAVA s/w JAVA VM PDF s/w FITS DICTIONARY SPECIFICATION UNICODE SPECIFICATION XML SPECIFICATION
Preservation Seminar 8 Jan Representation Information The Data Object is “interpreted using” the Representation Information (RepInfo) The Reference Model is designed to ensure that an OAIS is not set the impossible task of having to provide all possible RepInfo immediately Hence: –Take account of the Designated Community and its associated Knowledge Base The amount of RepInfo is not fixed –Additional RepInfo will be needed over time
Preservation Seminar 8 Jan Early Results High level architecture for sharing cost and access to Representation Information Detailed examinations of specific datasets to understand what is really needed to keep them understandable and usable
Preservation Seminar 8 Jan Rep. Info. Use and maintenance
Preservation Seminar 8 Jan Registry for Representation Info The Digital Object could have RepInfo packed with it, as well as CPID Support automated access & processing 1 – User gets data from archive. Data has associated Curation Persistent Identifier (CPID) 2 2 – User unfamiliar with data so requests Rep.Info.using CPID – User receives Rep.Info – which has its own CPID in case it is not immediately usable
Preservation Seminar 8 Jan CASPAR information flow architecture Rep Info
Preservation Seminar 8 Jan CASPAR Testbeds Three testbeds –Cultural: UNESCO –Performing Arts: INA, IRCAM –Scientific: ESA and CCLRC Complex, multi-source, multifaceted data Many common preservation & evaluation & validation issues Some specific requirements on preservation (technical, delivery, legal) –Specific user communities/ Knowledge bases Also test the OAIS model
Preservation Seminar 8 Jan Science: CCLRC example World map of ionosondes
Preservation Seminar 8 Jan Laser facility produces Binary data normally used by proprietary software Describe using EAST data description language Use in generic application (shown here) to display/process Example of use of RepInfo
Preservation Seminar 8 Jan Some Issues Difficult to derive physical quantities from data –Can be analysed in multiple ways –Raises fundamental questions about Representation Information Common automated method is proprietary –Data structure also proprietary –Paper documentation - restricted access Provenance and trust
Preservation Seminar 8 Jan ESA example GOME Global Ozone Monitoring Instrument on ERS-2
Preservation Seminar 8 Jan GOME data processing
Preservation Seminar 8 Jan GOME Level 4 product: Integration of GOME, other data and models GOME Level 3 product: Integration of time and space data GOME Level 2 product: Ozone profile at given location
Preservation Seminar 8 Jan Some Issues Provenance and Context of processed data relationship to Representation Information of raw data and Knowledge base of Designated Community
Preservation Seminar 8 Jan UNESCO examples DATA: Scanned documents and maps Aerial and close range photography (Digital photogrammetry) Monument measurements (Laser scanning) Satellite images (Remote sensing and image processing) Multi-scale digital cartography (Geographic information systems (GIS) and CAD) 3D models, virtual tours (Computer visualization) Mandatory Documentation: Identification of property Description of property Justification of inscription State of conservation and factors affecting the property Protection and Management Monitoring Documentation Contact information of responsible authorities Signature on behalf of the State Party(ies) World Heritage List
Preservation Seminar 8 Jan Performing Arts examples Examples: Score MAX/MSP patches Additional instructions Figure 2: Preservation of interactive multimedia performances Motion Analysis and Recognition Motion- Multimedia Mapping Strategy Multimedia Generation GUI (For monitor & control) Motion Capture and Processing Motions 3D motion data Multimedia output Mapping Parameters
Preservation Seminar 8 Jan Some Issues What is Preservation of “performability”? –Composer’s intention Authenticity Proprietary software and hardware Copyright Digital Rights Management
Preservation Seminar 8 Jan Shared Infrastructure Registries of Representation Information Persistent Identifier name resolvers –DOI? ARK? URL? – none are guaranteed Interfaces – support preservation and interoperability Standards – Preservation Description Information –Fixity, Provenance, Reference, Context
Preservation Seminar 8 Jan Accreditation/Certification for repositories Long-standing demand for ability to measure Trustability of digital repositories Part of OAIS “roadmap” RLG/NARA working group –Version 1.0 Audit and Certification Checklist about to be released New open workgroup to produce ISO standard for Audit and Certification –See to join mailing listhttp://mailman.ccsds.org/cgi-bin/mailman/listinfo/moims-rac
Preservation Seminar 8 Jan Knowledge at the heart of preservation Knowledge driven approach Knowledge management to support long-term preservation of concepts/information including: –Single, complex, on demand, interactive objects –DRM –Authenticity –Access –Storage –Designated Community – descriptions Knowledge base definition ontologies
Preservation Seminar 8 Jan Possible Infrastructure Build-up European Preservation Infrastructure Task Force on Permanent Access Alliance Other Alliance Members CCLRC Curation Activities CASPAR Other CCLRC projects FP7 projects
Preservation Seminar 8 Jan WHEN Component architecture and prototypes by month 12 Framework architecture month 18 Component integration months Testbed implementations months Project completion month 42
Preservation Seminar 8 Jan
Preservation Seminar 8 Jan Conclusions Information and Knowledge – needs more than just storing the “bits” Understanding and being able to process the vast amount of unfamiliar data which is available is hard It is expensive –Costs must be shared So far the Open Archival Information Systems Reference Model provides conceptual framework –Many similarities can be exploited –Many subtleties need to be explored Watch this space
Preservation Seminar 8 Jan BACKUP SLIDES
Preservation Seminar 8 Jan Example RepInfo Label A Label is itself RepInfo. It provides a way to collect together in a sensible way lots of individual pieces of RepInfo
Preservation Seminar 8 Jan Re-using RepInfo Existing RepInfo can be used to build up further RepInfo –E.g. refer to existing RepInfo in labels
Preservation Seminar 8 Jan Versioning and LID Each object has a unique identifier Versions of an object share a “logical ID” (LID) Simply using the LID gives the latest version Can specify a particular version
Preservation Seminar 8 Jan Clients DCC Registry: –Web browser –Thick client ( Any Registry –Applications using API
Preservation Seminar 8 Jan GUI access to Registry
Preservation Seminar 8 Jan Classifications Many Classification Schemes Help to find RepInfo
Preservation Seminar 8 Jan Initial RepInfo Simple text –ASCII –Unicode –UTF7/8 PDF, Word(!) FITS format FITS standard dictionaries Things that are “MISSING”
Preservation Seminar 8 Jan RepInfo entry Simple command line tool
Preservation Seminar 8 Jan Creating Repinfo There are many tools which can be used to create RepInfo: –Simple text editor to create text describing the data –Complex tools to capture data description e.g. EAST (see next slides) DFDL etc –Programming languages of various sorts
Preservation Seminar 8 Jan EAST descriptions
Preservation Seminar 8 Jan Snapshot d ’écran OASIS OASIS tool for creating EAST descriptions
Preservation Seminar 8 Jan Example of EAST description
Preservation Seminar 8 Jan Using RepInfo A pointer to RepInfo can be attached to data The RepInfo can be used to –Display –Examine –Process –Re-use the data
Preservation Seminar 8 Jan Laser facility produces Binary data normally used by proprietary software Describe using EAST data description language Use in generic application (shown here) to display/process Example of use of RepInfo
Preservation Seminar 8 Jan Simple Buy-In Need to add RepInfo to your Data Objects? Does the RepInfo already exist? –Yes: get its ID and put that in a label –No: register what you have – be assigned an ID. Add more details later when needed Or others can add more details
Preservation Seminar 8 Jan Preservation Issues Given a file or a stream of bits how does one know what Representation Information is needed (this question applies to Representation Information itself as well as to the digital objects we are primarily interested in preserving and using); how does one know, for example, if this thing is in FITS format? Someone may simply “know” what it is and how to deal with it i.e. the bits are within the Knowledge Base One may be able to recognise the format by looking for various types of patterns. One may feed the bits into all available interpreters to see which accept the data as valid Other means…. The only safe way: have an associated label which points to the appropriate Representation Information –Note this does not exclude the other methods e.g. for data rescue
Preservation Seminar 8 Jan Example Label:
Preservation Seminar 8 Jan Access to Registry Send a letter? Phone? ? Read the Web page and copy the relevant information? Software Access? –URL –Web Service –Application?
Preservation Seminar 8 Jan Registries – software access Roll-your-own?
Preservation Seminar 8 Jan Lazy person’s Registry/Repository Use existing standards –UDDI No repository –ebXML Additional advantage: helps integration with the GRID
Preservation Seminar 8 Jan Registry/Repository access Interface and protocols – JAXR “standard” Can talk to UDDI and ebXML registries FreebXML implementation –many access methods URL, Web Services, API, Etc..
Preservation Seminar 8 Jan Persistent IDs Findability –Persistent IDs DOI, URN, ARK, PURL, etc What can we rely on? Don’t put all your eggs in one basket
Preservation Seminar 8 Jan Example e1fe9271-cd a63e-b112ebf792c / For example the ARK identifier is created by appending the string in "value" to that in the resolver of resolverType="ark".
Preservation Seminar 8 Jan Registry/ Repository (regrep) Has to be a trusted repository (of RepInfo) –Authenticity of RepInfo –Access control –Certificates/Digests : (are they trustable over the long term?) Extensibility Distributed –Share the effort Notification Service
Preservation Seminar 8 Jan Operating Registries See RegistryProcedures RegistryProcedures