SCAPE Carl Wilson Open Planets Foundation SCAPE Training Guimarães Characterisation An introduction to the identification and characterisation of file formats. This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT (Grant Agreement number ).
SCAPE About Us Carl Wilson Open Planets Foundation SCAPE Project EU funded research project SCAlable Preservation Environments 2
SCAPE About You Once Around The Room Name Where you work What you do Why you’re here DO Ask Questions Or tell me to slow down… Or ask me to repeat something… 3
SCAPE File Formats What is a File Format? A “standard” method of encoding data for storage. May be to an open specification OR a proprietary one, open preferred Or simply following a loosely documented convention 4
SCAPE Who Cares About Formats? Operating Systems: in order to open a file with an application that can interpret /render it. Web Servers: to negotiate Content-Type in HTTP requests Memory Institutions: to identify software stacks that can render or extract meaning from a file, now or at a later date. More Generally: everyone with digital content, whether they know it or not. 5
SCAPE Some Uses of Format Information Format Information: Associates a file with software that can interpret and/or render its contents Can be used to find documentation / specifications to help interpret a file’s contents Is a first step to preservation planning, knowing what you have…… 6
SCAPE File Name Extension A file name suffix separated by a dot “.”, from the file base name. Examples:.pdf,.txt,.jpg,.doc,.docx This has worked for a number of years BUT Any user with the right permission can change a file extension Bytes aren’t always transferred with a name 7
SCAPE Internet Media (MIME) Types The format identifiers used by the web Examples: text/plain text/html image/jpg Don’t readily hold extra information such as format version, but may be extended. 8
SCAPE Apple’s Alternatives Pre OS-X versions of MAC OS used Creator and Type codes Creator: The software that created the file Type: The type of information, e.g. TEXT More flexible than extension, but no longer used Recent OS-X versions also use Uniform Type Identifiers 9
SCAPE PRONOM Unique Identifiers or PUIDs PRONOM is a web based registry of file format information Created and Hosted by the National Archives of the UK in 2002 Uses PUIDS to identify file formats: fmt/15 == Acrobat PDF 1.1 fmt/16 == Acrobat PDF 1.2 fmt/17 == Acrobat PDF
SCAPE The Unix File Utility A standard Unix program for identifying the data in a file. First released in 1973, written in C so requires Operating System dependent compilation Open source version used in Linux distributions written in 1986 Identification based upon compiled “magic” files Provides text information about files, or MIME types with the right options 11
SCAPE FIDO Format Identification of Digital Objects Open Source format identification tools Based upon the PRONOM signature data compiled to regular expressions Written in Python so can be run on different Operating Systems Richer command line syntax than DROID 12
SCAPE Apache Tika Open Source toolkit for detecting and extracting metadata and structured text from files Performs Format Identification and deeper characterisation (more on that later). Java based so will run on different platforms. Returns MIME types as format identifiers 13
SCAPE How Do These Tools Identify Formats? They exploit “common features” of the format. PDF start of file: %PDF-1.1PDF Version 1.1 %PDF-1.2PDF Version 1.2 %PDF-1.6PDF Version 1.6 Tika and File simply look for files starting with the string %PDF- and return the MIME type FIDO However…… 14
SCAPE FIDO & PDF Identification FIDO identifies the different PDF versions, each of which have a PUID FIDO also looks for an END OF FILE marker for PDFs :.%EOF. This could be a problem……. 15