FITS: The File Information Tool Set
Background FITS is part of the second generation Harvard University Library Digital Repository Service(DRS2), which supports content models and METS/PREMIS object descriptors. Developed Fall 2008 First public release Spring 2009: http://fits.googlecode.com
Why? Needed an automatic way to identify and extract metadata for a wide range of file types No single file analysis tool satisfied our needs
Design Goals Act as a wrapper around other open source tools Extensible Needs to be a standalone command line tool and also provide an API Allow priority setting for tools Open source
The Tools Current tools: 3 Categories Jhove 1.5 Exiftool National Library of New Zealand Metadata Extractor (NLNZ) DROID FFIdent File Utility 3 Categories File Identification (all of them) Metadata Extraction (Jhove, Exiftool, NLNZ) format Validation (Jhove)
Process
Features Conflict management Value normalization Tool prioritization “inches” vs “2” Tool prioritization Format tree for understanding more specific format identities. PDF/A is a more specific version of PDF
Example Output <fits> <identification> <identity format="Graphics Interchange Format" mimetype="image/gif"> <tool toolname="Jhove" toolversion="1.5" /> ... </identity> </identification> <fileinfo> <size toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">40149</size> <md5checksum toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">265c9345ebf93c89d472766fda095de4</md5checksum> </fileinfo> <filestatus> <well-formed toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">true</well-formed> <valid toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">true</valid> </filestatus> <metadata> <image> <height toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">1024</height> </image> </metadata> </fits>
Configuration All settings are in the fits.xml config file Enable/disable tools (available in the API too) Prevent tools from processing files with specific file extensions Set tool priority Add new tools Use your own consolidator code Report or ignore conflicts Options to display original tool output
Sample Configuration File <fits_configuration> <!-- Order of the tools determines preference --> <tools> <!-- exclude-exts attribute is a comma delimited list of file extensions that the tool should not try to process --> <tool class="edu.harvard.hul.ois.fits.tools.jhove.Jhove" exclude-exts="dng,mbx"/> <tool class="edu.harvard.hul.ois.fits.tools.fileutility.FileUtility" exclude-exts="dng,wps"/> <tool class="edu.harvard.hul.ois.fits.tools.exiftool.Exiftool" exclude-exts="txt,wps,vsd"/> <tool class="edu.harvard.hul.ois.fits.tools.droid.Droid" exclude-exts="dng"/> <tool class="edu.harvard.hul.ois.fits.tools.nlnz.MetadataExtractor" exclude- exts="dng,zip,odb,ott,odg,otg,odp,otp,ods,ots,odc,otc,odi,oti,odf,otf,odm,oth"/> <tool class="edu.harvard.hul.ois.fits.tools.oisfileinfo.FileInfo"/> <tool class="edu.harvard.hul.ois.fits.tools.oisfileinfo.XmlMetadata"/> <tool class="edu.harvard.hul.ois.fits.tools.ffident.FFIdent" exclude-exts="dng,wps,vsd"/> </tools> <output> <dataConsolidator class="edu.harvard.hul.ois.fits.consolidation.OISConsolidator"/> <display-tool-output>true</display-tool-output> <report-conflicts>true</report-conflicts> <validate-tool-output>false</validate-tool-output> <internal-output-schema>xml/fits_output.xsd</internal-output-schema> <external-output-schema>http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd</external-output-schema> <fits-xml-namespace>http://hul.harvard.edu/ois/xml/ns/fits/fits_output</fits-xml-namespace> </output> <!-- file name of the droid signature file to use in tools/droid/--> <droid_sigfile>DROID_SignatureFile_V35.xml</droid_sigfile> </fits_configuration> 10
Some Limitations... Speed Technical metadata only returned if the tool that reported it is in the first <identity> block FITS considers a successful identification to be a combination of the format name and mime type
Future Plans More tools Apache Tika (text document formats) Jhove 2 Aduna Aperture (text, documents, email formats) Mediainfo (audio and video formats) Better audio and video format support as we add object support for them to DRS2
Wrap Up http://fits.googlecode.com http://ots-schemas.googlecode.com Java library for reading and writing METS (limited support), MODS, PREMIS, MIX, TextMD, DocumentMD, and soon AES audio metadata More information on DRS2: http://hul.harvard.edu/ois/systems/drs/enhance ments.html