Andrea Goethals, Harvard Library ASERL Webinar 2013 File Information Tool Set.

Slides:



Advertisements
Similar presentations
File Format Identification and Archival Processing
Advertisements

DDI for the Uninitiated ACCOLEDS /DLI Training: December 2003 Ernie Boyko Statistics Canada Chuck Humphrey University of Alberta.
1 Metadata Tools for JISC Digitisation Projects of still images and text Ed Fay BOPCRIS, Hartley Library University of Southampton.
SCAPE Carl Wilson Open Planets Foundation SCAPE Training Guimarães Characterisation An introduction to the identification and characterisation of.
More Better Metadata SAA 2014 Panel: Metadata and Digital Preservation: How Much Do We Really Need? Andrea Goethals, Harvard Library Even v.
DRS 2 Metadata Migration June 25, Agenda Introduction Preliminary results - content analysis Metadata options Next steps Questions.
DRS 2 one in a series of periodic updates Harvard University Library Andrea Goethals October 21, 2009 DRS = Digital Repository Service.
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
Depositing e-material to The National Library of Sweden.
ISO & OAI-PMH By Neal Harmeyer, Amy Hatfield, and Brandon Beatty PURDUE UNIVERSITY RESEARCH REPOSITORY.
Preservation Metadata Extraction and Collection : Tools and Techniques Mat Black National Library of New Zealand Te Puna Matauranga o Aotearoa.
Converting Microsoft Office Documents Bill Weber E-Learning Systems Administrator E-Learning Operations.
PREMIS What is PREMIS? – Preservation Metadata Implementation Strategies When is PREMIS use? – PREMIS is used for “repository design, evaluation, and archived.
WMES3103 : INFORMATION RETRIEVAL
DSA Week 22 MIME types and Meta data. Agenda Google Maps and MyMap Coursework Placement opportunity Tutorial – Multiple Choice questions Lecture – MIME.
PREMIS What is PREMIS? o Preservation Metadata Implementation Strategies When is PREMIS use? o PREMIS is used for “repository design, evaluation, and archived.
Descriptive Metadata o When will mods.xml be used by METS (aip.xml) ?  METS will use the mods.xml to encode descriptive metadata. Information that describes,
WORKING WITH COMMAND-LINE TOOLS Danielle Cunniff Plumer School of Information The University of Texas at Austin Summer 2014.
Installing Windows XP Professional Using Attended Installation Slide 1 of 41Session 2 Ver. 1.0 CompTIA A+ Certification: A Comprehensive Approach for all.
CS 0008 Day 2 1. Today Hardware and Software How computers store data How a program works Operators, types, input Print function Running the debugger.
An Introduction to Scanning and Storing Photographs and Graphics Bryn Jones Aug 2002
Preserving Digital Collections Andrea Goethals Florida Center for Library Automation (FCLA)
July 9, National Software Reference Library Douglas White Information Technology Laboratory July 2004.
Core Issues in Digital Preservation: Text and Images Jacob Nadal, Preservation Officer UCLA Library.
Prepared by George Holt Digital Photography BITMAP GRAPHIC ESSENTIALS.
NCSU Libraries Ingest Workflow Issues: Metadata North Carolina Geospatial Data Archiving Project Steve Morris North Carolina State University Libraries.
Create a Website on the CWU network Find “How to Post a Web Page with a PC”
 EPrints & Preservation David Tarrant University of Southampton (UK) Preserv Repository Preservation and Interoperability.org.uk.
Using Styles and Style Sheets for Design
FITS: The File Information Tool Set
Artur Kulmukhametov Vienna University of Technology SCAPE PW Training Event Aarhus, November 2013 Content Profiling and C3PO.
How to build your own Dark Archive (in your spare time) Priscilla Caplan FCLA.
Digital Preservation 101, or, How to Keep Bits for Centuries Julie C. Swierczek Digital Asset Manager and Digital Archivist Harvard Art Museums.
PREMIS and the National Digital Newspaper Program Justin Littman Office of Strategic Initiatives, LC
Gathering Audio Metadata for the Monterey Jazz Festival Concerts OLAC 2006 By Nancy J. Hoebelheinrich, Stanford University Libraries.
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
Meet and Confer Rule 26(f) of the Federal Rules of Civil Procedure states that “parties must confer as soon as practicable - and in any event at least.
DRS 2 Orientation Harvard University Library September 30, 2010 DRS = Digital Repository Service.
The FCLA Digital Archive Joint Meeting of CSUL Committees, 2005.
A l a d d i n. c o m eSafe 6 FR2 Product Overview.
The Statistics New Zealand Prototype PREMIS creation tool Euan Cochrane PREMIS Fair October 2009
“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”
Introduction to metadata
Small steps and lasting impact: making a start with preservation or It’s not all NASA Patricia Sleeman Digital Archives and Repositories University of.
ETD2006 Preserving ETDs With D.A.I.T.S.S. FLORIDA CENTER FOR LIBRARY AUTOMATION FC LA PAPER AUTHORS: Chuck Thomas Priscilla.
1 / 61 Using the Customer Support Web Site © 2006, Universal Tax Systems, Inc. All Rights Reserved. Customer Support Site Objectives –In this chapter you.
HATHI TRUST A Shared Digital Repository Use of PREMIS for Internet Archive AIPs September 22, 2010.
Microsoft Access.  What is Data ?  Data vs. Information  Database History.  What is a Database?  Examples for Small and Large Databases.  Types.
File Analysis Dr. John P. Abraham Professor UTPA.
PREMIS at the British Library Markus Enders, The British Library PREMIS Implementation Fair, San Fransisco, CA 07 October 2009.
Digital Preservation Policies: Technical Considerations SAA Boston: Andrea Goethals, FCLA.
The Evolving Process to Add Preservation Support for New Formats at Harvard Library IS&T Archiving 2015 Andrea Goethals. Franziska Frey and David Ackerman.
DAITSS and the Florida Digital Archive Priscilla Caplan Florida Center for Library Automation iPRES 2006.
Hyperion Artifact Life Cycle Management Agenda  Overview  Demo  Tips & Tricks  Takeaways  Queries.
Getting it together! Automating Standardized Technical Metadata for Images and Audio Jody L. DeRidder University of Alabama Libraries DLF 2015 October.
Thinking Long Term - Archive Strategies for Alfresco Nathan McMinn Remote Service Engineer Alfresco Chetan Lalye Senior Software Architect Agilent Technologies.
CACI Proprietary Information | Date 1 PD² SR13 Client Upgrade Name: Semarria Rosemond Title: Systems Analyst, Lead Date: December 8, 2011.
Digital Preservation What, Why, and How? Dan Albertson’s Digital Libraries Class April 13, 2016 Jody DeRidder Head, Metadata & Digital Services University.
Joint Meeting of CSUL Committees,
Preserving Digital Collections
MXFComponentSuite Version 2.0 Technical Overview
Software and file types
Topics in Born Digital Archiving
DAITSS and the Florida Digital Archive
Lesson 9 Sharing Documents
Andrea Goethals, Harvard Library
Metadata for research outputs management
Digital Preservation Policies: Technical Considerations
Lesson 6 File Types.
Presentation transcript:

Andrea Goethals, Harvard Library ASERL Webinar 2013 File Information Tool Set

 Intro to… ◦ File formats ◦ File tools  FITS

 “Specific structure or arrangement of data code stored as a computer file. A file format tells the computer how to display, print, and process, and save the data.” – BusinessDictionary.com  “The organization of information according to preset specifications” - TheFreeDictionary

 Unclear specifications  Complex/long specifications  Specifications that depend on many other specificationsor  No accessible sources (proprietary formats, very old formats)

 Related formats, examples: ◦ OpenDocument formats are packaged as ZIP files ◦ Many formats (e.g. XML) are text formats  Some formats lack obvious identifying features (e.g. magic numbers), examples: ◦ Text character encoding ◦ TIFF versions

 Can be hard for tools to accurately identify formats  Some tools are more specific than others for particular formats ◦ E.g. Zip vs. OpenDocument vs. OpenDocument Spreadsheet  Some subjectivity behind format tools ◦ Different names for same format ◦ Different opinions about format validity

 Identify formats  Validate formats  Extract metadata

Poll 1

Identify formats Validate formats Extract metadata Formats DROIDYESNO > 1000 ExifToolYESNOYEScouple hundred FFidentYESNO ~ 50 File utilityYESNO > 1000 JHOVEYES 11 + variations MediaInfoYESNOYES~30 A/V containers NLNZ MEYESNOYES~ 20

 Original motivation ◦ Offset risk of accepting any format (Web archives, attachments, donated hard drives)  Thoughts ◦ No single format identification tool can suffice (format support varies, accuracy varies) ◦ Unsustainable to only use “library” tools - want to incorporate tools from any domain

Polls 2 & 3

 Develop a tool manager instead of a tool  Include open source tools from any domain  Make highly configurable, tweak over time as experience & knowledge is gained  Account for tool inaccuracy in the design ◦ Check the tools against each other  Do any disagree?  How many are in agreement?

 Identify many file formats  Validate a few file formats  Extract metadata  Calculate basic file info (file size, MD5, etc.)  Output technical metadata ◦ Community-standard metadata schemas  Identify problem files ◦ Conflicting opinions on format, metadata values ◦ Unidentifiable file formats

Any file FITS wrapper + XSL JHOVE FITS wrapper + XSL DROID FITS wrapper + XSL NLNZ ME FITS wrapper + XSL ExifTool FITS wrapper + XSL File utility FITS wrapper + XSL FFIdent FITS XML Standard XML FITS XML + Tika, AudioInfo, ADLTool, FileInfo, XMLMetadata

 Different names for the same format ◦ ‘JPEG2000’ vs ‘JPEG 2000’ vs ‘JPEG 2000 image”  Different values for the same metadata ◦ “inches” vs “2” vs “in.” ◦ “Grayscale” vs “Greyscale”  Different ways of saying it can’t identify it  ‘Unknown Binary’ vs ‘bytestream’ vs ‘data’ vs no value  ‘application/octet-stream’ vs ‘application/unknown’ vs no value  Different ways metadata is output ◦ Ex: bits per sample (single or multiple values)

// format name, version, registry IDs // file name, size, MD5, etc. // validity info // normalized, combined metadata // native tool output

 Format ◦ Name ◦ Version  MIME media type  Format registry identifier(s) ◦ PRONOM puid

 Format ◦ Name = Portable Document Format ◦ Version =1.4  MIME media type = application/pdf  Format registry identifier(s) ◦ PRONOM puid = fmt/16

cmd (open up a shell) cd..\..\Program Files\Fits\fits (navigate to install).\fits.bat –h (see parameters).\fits.bat –i..\testfiles\myfile.pdf (FITS XML metadata only)

<fits xmlns=" xmlns:xsi=" xsi:schemaLocation=" harvard.edu/ois/xml/ns/fits/fits_output version="0.6.2" timestamp="4/24/13 10:40 AM"> 1.6 fmt/ / 2013:04:24 10:40:05-04: :04:24 10:39:31-04:00 C:\Program Files\Fits\fits \..\testfiles\myfile.pdf</filep ath>..\testfiles\myfile.pdf 40d af9ff5c6046b6a1ad2f true Local Disk C:\Program Files\Fits\testfiles\Felix_output.txt 2 no yes no

<fits xmlns=" xmlns:xsi=" xsi:schemaLocation=" harvard.edu/ois/xml/ns/fits/fits_output version="0.6.2" timestamp="4/24/13 10:40 AM"> 1.6 fmt/ / 2013:04:24 10:40:05-04: :04:24 10:39:31- 04:00 C:\Program Files\Fits\fits-0.6.2\..\testfiles\myfile.pdf..\testfiles\myfile.pdf 40d af9ff5c6046b6a1ad2f true Local Disk C:\Program Files\Fits\testfiles\Felix_output.txt 2 no yes no

<fits xmlns=" xmlns:xsi=" xsi:schemaLocation=" harvard.edu/ois/xml/ns/fits/fits_output version="0.6.2" timestamp="4/24/13 10:40 AM"> 1.6 fmt/ / 2013:04:24 10:40:05-04: :04:24 10:39:31-04:00 C:\Program Files\Fits\fits \..\testfiles\myfile.pdf</filep ath>..\testfiles\myfile.pdf 40d af9ff5c6046b6a1ad2f true Local Disk C:\Program Files\Fits\testfiles\Felix_output.txt 2 no yes no

<fits xmlns=" xmlns:xsi=" xsi:schemaLocation=" harvard.edu/ois/xml/ns/fits/fits_output version="0.6.2" timestamp="4/24/13 10:40 AM"> 1.6 fmt/ / 2013:04:24 10:40:05-04: :04:24 10:39:31-04:00 C:\Program Files\Fits\fits \..\testfiles\myfile.pdf</filep ath>..\testfiles\myfile.pdf 40d af9ff5c6046b6a1ad2f true Local Disk C:\Program Files\Fits\testfiles\Felix_output.txt 2 no yes no

 For text: TextMD (Library of Congress)  For images: MIX (Library of Congress)  For documents: DocumentMD (Florida Virtual Campus / Harvard Library)  For audio: AES57 (Audio Engineering Society)

.\fits.bat –i..\testfiles\myfile.pdf –xc (FITS XML metadata+ standard technical metadata)

<fits xmlns=" xmlns:xsi=" instance" xsi:schemaLocation=" harvard.edu/ois/xml/ns/fits/fits_output version="0.6.2" timestamp="4/24/13 10:48 AM"> 1.6 fmt/20. (snip). 2 hasOutline

 xml xml ◦ Premis:  Generic technical metadata (fixity, size, format, creating application)  Format-specific technical metadata in objectCharacteristicsExtension ◦ FITS XML output in administrative metadata

 xml/ directory  fits.xml (tweak your tool preferences)  fits_format_tree.xml (tweak knowledge-base of related formats)

 ◦ Downloads: get the newest version ◦ Mailing list: fits-users (new releases announced here) ◦ Issues: File any bugs  Source code (if you want to contribute):

C:\Program Files\Fits\fits-0.6.1>.\fits.bat -i demo\Acknowledgements.rtf <fits xmlns=" xmlns:xsi=" instance" xsi:schemaLocation=" harvard.edu/ois/xml/ns/fits/fits_output version="0.6.1" timestamp="7/21/12 3:51 PM"> fmt/50 fmt/51

 Indicate tool inaccuracies and/or areas for educating ourselves  To resolve ◦ Is Rich Text Format a more specific form of Plain Text?  If so, adjust fits_format_tree.xml ◦ What should the MIME media-type for Rich Text Format? (consult specification if possible)  Normalize the tool output to this MIME media-type