SCAPE Carl Wilson Open Planets Foundation SCAPE Training Guimarães Characterisation - 101 An introduction to the identification and characterisation of.

Slides:



Advertisements
Similar presentations
©TheMcGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 2 Getting Started with Java.
Advertisements

File Format Identification and Archival Processing
SOFTWARE Chapter 5.
1 Copyright © 2002 Pearson Education, Inc.. 2 Chapter 1 Introduction to Perl and CGI.
Copyright © 2003 Pearson Education, Inc. Slide 6-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
A centre of expertise in digital information management The OAI Protocol for Metadata Harvesting Andy Powell UKOLN,
Texas Digital Library Services Preservation Network.
Repository models and policies for preservation Steve Hitchcock Preserv Project Intelligence Agents Multimedia Group, School of Electronics and Computer.
OLAC Metadata Steven Bird University of Melbourne / University of Pennsylvania OLAC Workshop 10 December 2002.
AJDT and AspectJ Release Review | © 2007 by SpringSource, made available under the EPL v1.0 1 Release Review: AJDT and AspectJ
1 Copyright © 2005, Oracle. All rights reserved. Introducing the Java and Oracle Platforms.
17 Copyright © 2005, Oracle. All rights reserved. Deploying Applications by Using Java Web Start.
17 March 2010 Workshop on Efficient and Effective eGovernment FASTeTEN : a Flexible Technology in Different European Administrative Contexts
Introduction to Planets Hans Hofman Nationaal Archief Netherlands Prague, 17 October 2008.
0 - 0.
Preserving and Sharing Digital Data Greg Colati, Director, Archives and Special Collections May 11, 2012.
4. Internet Programming ENG224 INFORMATION TECHNOLOGY – Part I
Web Page Concept and Design :
Software change management
The Modern Control Boot Disk. 2 What do we mean by a Modern control boot disk? In your previous lectures you learned about the original DOS control boot.
Information Systems Today: Managing in the Digital World
CAR Training Module PRODUCT REGISTRATION and MANAGEMENT Module 2 - Register a New Document - Without Alternate Formats (Run as a PowerPoint show)
The World Wide Web. 2 The Web is an infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that.
CREATING A PAYMENT REQUEST FOR VENDOR IN SYSTEM
© 2008 Security Compass inc. 1 Firefox Plug-ins for Application Penetration Testing Exploit-Me.
23-Nov-2000/Janne Saarela Business opportunities on the semantic Web Janne Saarela.
Getting Familiar with Web Pages 1 2 The Internet Worldwide collection of interconnected computer networks that enables businesses, organizations, governments,
Learning the Basics – Lesson 1
Addition 1’s to 20.
44212: Web-site Development
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 15 Programming and Languages: Telling the Computer What to Do.
Excel Lesson 17 Importing and Exporting Data Microsoft Office 2010 Advanced Cable / Morrison 1.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
Preservation Metadata Extraction and Collection : Tools and Techniques Mat Black National Library of New Zealand Te Puna Matauranga o Aotearoa.
DSA Week 22 MIME types and Meta data. Agenda Google Maps and MyMap Coursework Placement opportunity Tutorial – Multiple Choice questions Lecture – MIME.
Basic Unix Dr Tim Cutts Team Leader Systems Support Group Infrastructure Management Team.
1 CS428 Web Engineering Lecture 18 Introduction (PHP - I)
1 Archive-It Training University of Maryland July 12, 2007.
PHP and MySQL Week#1  Course Plan.  Introduction to Dynamic Web Content.  Setting Up Development Server Eng. Mohamed Ahmed Black 1.
Product Retrieval Statistics Canada / Statistique Canada Chuck Humphrey ACCOLEDS/DLI Training December, 2001.
Drupal Workshop Introduction to Drupal Part 1: Web Content Management, Advantages/Disadvantages of Drupal, Drupal terminology, Drupal technology, directories.
Web server and web browser It’s a take and give policy in between client and server through HTTP(Hyper Text Transport Protocol) Server takes a request.
Web Application Access to Databases. Logistics Test 2: May 1 st (24 hours) Extra office hours: Friday 2:30 – 4:00 pm Tuesday May 5 th – you can review.
Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.
FITS: The File Information Tool Set
Artur Kulmukhametov Vienna University of Technology SCAPE PW Training Event Aarhus, November 2013 Content Profiling and C3PO.
1 Chapter 2 & Chapter 4 §Browsers. 2 Terms §Software §Program §Application.
Annick Le Follic Bibliothèque nationale de France Tallinn,
ITIS 1210 Introduction to Web-Based Information Systems Chapter 23 How Web Host Servers Work.
Andrea Goethals, Harvard Library ASERL Webinar 2013 File Information Tool Set.
COP 3813 Intro to Internet Computing Prof. Roy Levow Lecture 1.
Files Chapter 4.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
27.1 Chapter 27 WWW and HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, A Weekend with Nanite Large scale.
8 th Semester, Batch 2009 Department Of Computer Science SSUET.
SCAPE David Open Planets Foundation / University of Southampton iPres2012 Toronto, October 2012 LDS 3 Applying.
COSC 2328 – Web Programming.  PHP is a server scripting language  It’s widely-used and free  It’s an alternative to Microsoft’s ASP and Ruby  PHP.
ITP 109 Week 2 Trina Gregory Introduction to Java.
CS 330 Class 7 Comments on Exam Programming plan for today:
HTTP – An overview.
Product Retrieval Statistics Canada / Statistique Canada Title page
PHP Introduction.
Managing a Web Server and Files
Intro to PHP.
DocumentParser: November, 2013.
Web Application Development Using PHP
Presentation transcript:

SCAPE Carl Wilson Open Planets Foundation SCAPE Training Guimarães Characterisation An introduction to the identification and characterisation of file formats. This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT (Grant Agreement number ).

SCAPE About Us Carl Wilson Open Planets Foundation SCAPE Project EU funded research project SCAlable Preservation Environments 2

SCAPE About You Once Around The Room Name Where you work What you do Why you’re here DO Ask Questions Or tell me to slow down… Or ask me to repeat something… 3

SCAPE File Formats What is a File Format? A “standard” method of encoding data for storage. May be to an open specification OR a proprietary one, open preferred Or simply following a loosely documented convention 4

SCAPE Who Cares About Formats? Operating Systems: in order to open a file with an application that can interpret /render it. Web Servers: to negotiate Content-Type in HTTP requests Memory Institutions: to identify software stacks that can render or extract meaning from a file, now or at a later date. More Generally: everyone with digital content, whether they know it or not. 5

SCAPE Some Uses of Format Information Format Information: Associates a file with software that can interpret and/or render its contents Can be used to find documentation / specifications to help interpret a file’s contents Is a first step to preservation planning, knowing what you have…… 6

SCAPE File Name Extension A file name suffix separated by a dot “.”, from the file base name. Examples:.pdf,.txt,.jpg,.doc,.docx This has worked for a number of years BUT Any user with the right permission can change a file extension Bytes aren’t always transferred with a name 7

SCAPE Internet Media (MIME) Types The format identifiers used by the web Examples: text/plain text/html image/jpg Don’t readily hold extra information such as format version, but may be extended. 8

SCAPE Apple’s Alternatives Pre OS-X versions of MAC OS used Creator and Type codes Creator: The software that created the file Type: The type of information, e.g. TEXT More flexible than extension, but no longer used Recent OS-X versions also use Uniform Type Identifiers 9

SCAPE PRONOM Unique Identifiers or PUIDs PRONOM is a web based registry of file format information Created and Hosted by the National Archives of the UK in 2002 Uses PUIDS to identify file formats: fmt/15 == Acrobat PDF 1.1 fmt/16 == Acrobat PDF 1.2 fmt/17 == Acrobat PDF

SCAPE The Unix File Utility A standard Unix program for identifying the data in a file. First released in 1973, written in C so requires Operating System dependent compilation Open source version used in Linux distributions written in 1986 Identification based upon compiled “magic” files Provides text information about files, or MIME types with the right options 11

SCAPE FIDO Format Identification of Digital Objects Open Source format identification tools Based upon the PRONOM signature data compiled to regular expressions Written in Python so can be run on different Operating Systems Richer command line syntax than DROID 12

SCAPE Apache Tika Open Source toolkit for detecting and extracting metadata and structured text from files Performs Format Identification and deeper characterisation (more on that later). Java based so will run on different platforms. Returns MIME types as format identifiers 13

SCAPE How Do These Tools Identify Formats? They exploit “common features” of the format. PDF start of file: %PDF-1.1PDF Version 1.1 %PDF-1.2PDF Version 1.2 %PDF-1.6PDF Version 1.6 Tika and File simply look for files starting with the string %PDF- and return the MIME type FIDO However…… 14

SCAPE FIDO & PDF Identification FIDO identifies the different PDF versions, each of which have a PUID FIDO also looks for an END OF FILE marker for PDFs :.%EOF. This could be a problem……. 15