Digital Preservation: Context and Content

Digital Preservation: Context and Content
Edward M. Corrado: ELAG 2011 @ecorrado | | Digital Preservation: Context and Content

Binghamton University (NY, USA)
Binghamton is the “premier public university in the northeast” Fiske Guide to Colleges (2010) Undergraduates: 11,787 Graduate students: 3,108 Average SAT score range: ~ 2.4 million volumes in the libraries Started as Triple Cities College in 1946 Became a “University Center” in 1965 Part of the State University of New York system Obligatory about us. Point out we are a relatively new University

Digital preservation: A definition (1 of 2)
A “series of managed activities necessary to ensure continued access to digital materials for as long as necessary” “Actions required to maintain access to digital materials beyond the limits of media failure or technological change.” “materials may be records created during the day-to-day business of an organisation;"born-digital … or the products of digitisation projects.” Digital Preservation collation says: Digital Preservation Refers to the series of managed activities necessary to ensure continued access to digital materials for as long as necessary. Digital preservation […] refers to all of the actions required to maintain access to digital materials beyond the limits of media failure or technological change. Those materials may be records created during the day-to-day business of an organization;"born-digital" materials created for a specific purpose (e.g. teaching resources); or the products of digitization projects.

Digital preservation: A definition (2 of 2)
JISC Beginner’s Guide to Digital Preservation elaborates: Managed: Digital preservation is a Management problem. Activities: The policy needs to filter down to a list of processes: tasks that can take place at specified times and in specified ways. Necessary: What needs to be done. How long do you want to preserve the objects for? Discussions about the activities needed to achieve a level of preservation are necessary. Continued Access: Access is the key here. Most objects in the public sphere are preserved to support access and retrieval. Digital Materials: Digital materials, digital objects, call them what you will. This is the stuff you are preserving. Different objects require different processes. JISC Beginign Guide to Digital Preservstion adds: Managed – Digital preservation is a managerial problem. All activities (the planning, resource allocation, use of technologies) need to have been thought through and they take place for a reason. The term managed stresses the need for a high-level strategy and a policy Activities – The policy needs to filter down to a list of processes: tasks that can take place at specified times and in specified ways. Necessary – We are looking at what needs to be done. In your policy you will have looked at how long you want to preserve the objects for. Necessary discussions about the activities needed to achieve a level of preservation. Continued Access – Access is the key here. Most objects in the public sphere are preserved to support access and retrieval. How long this access is needed will have been discussed and should be defined in your policy. Digital Materials – Digital materials, digital objects, call them what you will. This is the stuff you are preserving. Different objects require different processes.

Why digital preservation? (1 of 2)
Local Content as the Future of Academic Libraries? At least in regards to Physical Collections? Google Books, HathiTrust To a large degree journal collections and government documents already being “managed” outside of libraries By managed, I don’t need mean there is no work involved locally, but it is different work Will books go the same way (obvious differences between roles of Research University Libraries and 4 year colleges)

Why digital preservation? (2 of 2)
Furthers the role of University Archives to the digital world Not a new idea, a new format Majority of new material is published in digital format Scholarly Articles Campus newsletters Course catalogs Web sites A major research library with any level of archices/special collections wouldn’t think of not erforming preservation on pyhiscal books Photo by Klara Kim. © CC BY-NC-SA

Long Tail of librarianship
A University is a collection of niche markets (especially when I comes to faculty/researchers) A major research library with any level of archices/special collections wouldn’t think of not erforming preservation on pyhiscal books

Low-hanging fruit Faculty Research Special Collections/
University Archives Course Catalogs Mention that in the US that Faculty research the property of the individual faculty member and they can be very possessive Newsletters Also need to be opportunistic (blood serum collection) Images cc-by-nc-2.0:

At-risk materials Image cc-by-nc-sa 2.0:

Digital Asset/Content Management System vs. Digital Preservation System
“Digital asset management (DAM) consists of management tasks and decisions surrounding the ingestion, annotation, cataloguing, storage, retrieval and distribution of digital assets.” (Wikipedia, 2011) Preservation is not mentioned “Often, those creating digital materials or designing digital content management systems do not take great interest in their long-term preservation.” (Trusted Digital Repositories: Attributes and Responsibilities An RLG-OCLC Report ,2002 (p. 18))

Digital Preservation: Binghamton’s Solution
Experimented with various “Digital Content” systems including Content Pro, CONTENTdm, DSpace, EPrints None of these has preservation “built-in” Building our own was not practical Staffing levels Lack of expertise Rosetta by Ex Libris

Rosetta and discovery No preservation systems is useful if there is no access (especially for a University Library) Rosetta does not have a public discovery layer Rosetta’s Digital Publishing System is flexible so there are options Primo for discovery First University to use Prim0 with Rosetta Works well with other library systems such as Aleph and Primo Central One stop shopping Have harvested records from Rosetta in other systems. Works fine (although the other system I used didn’t harvest everything, only standard OAI-PMH Primo wanted a single interface that tied in all systems, but was till flexible for our local needs.

Implementation timeline
Early December 2010: Initial training January 2011: Installation and functional training February 2011: 2 Metadata librarians begin working at Binghamton End of March 2011: Went live with first collection May 2011: Present at ELAG! May 2011: Go live beta with Primo for Aleph, etc. More collections coming soon to Rosetta!!

Overall staffing Systems: 1 person (~0.5 FTE)
Project Management Systems/Technical Metadata/Cataloging: 3 people (~1.0 FTE) User Interface: Part of Web Services Librarian's Time Special Collections: Not directly involved with implementation, but relied on heavily for collection level expertise Could really use a Digital Curation Librarian Point is that you don’t need a lot of staff

O/S and hardware RHEL 5.5 IBM x3550 M2 (2x – one production, one test)
24 GB RAM (2) 2.93 GHz Intel x5570 Xeon Processors (2) 146 GB hard drives HP StorageWorks MSA2300i Controller 24 TB (pre-mirroring) Tape backup Servers bought for other reasons.

What is put into Rosetta?
Each file (or group of files) deposited is an Intellectual Entity or IE With an IE are: Metadata forms (in DC format) Submission formats (how a file is to be uploaded) Copyright statement to be used Access right forms (who can view the content) When you talk about Rosetta and what goes into it, you talk about IE’s or Intellectual entities. An IE can be a single file containing the digital image of a single photograph. An IE can be a single PDF containing 50 images, each of a page of a book. An IE can be 50 Tiffs, each tiff containing the digitized image of a single page of a book. Each IE has accompanying metadata which for the version of Rosetta we are using is in Dublin Core. The Dublin Core metadata is wrapped up together with administrative information about the file size, file format, access rights information etc., into a submission information package called a SIP. The SIP moves along the deposit process where depending on the rules that have been set up, it gets assessed, arranged and approved until it is put into the Permanent Repository. However what we are interested in for the moment are the metadata forms. Looking at these will give you a quick glimpse into how you work with Rosetta. The metadata form, like a MARC record, is designed to contain information about the IE. The metadata can be entered, edited, indexed and most importantly is used to access an individual IE. After all, no matter how well a repository takes care of a file, how well it keeps it and preserves it, it makes no sense to put items into a system if you cannot get them back out.

Rosetta metadata decisions
Dublin Core fields Non-standard fields Create a master list Dublin Core is so flexible that it is inflexible We found that making decisions about Dublin Core fields was much harder than we thought it would be and we are still making decisions and doing some second guessing about some of the elements {isFormat of and source]. We not only looked at the DCMI site we found and read several best practices sites which were very helpful to our understanding of how libraries actually used Dublin Core and freely stole a lot of peoples hard work. It took us a while to visualize how the data we had would fit into the available fields. We decided to create a master list of Dublin Core fields and then for each project we would choose the fields we needed for that particular project. We ended up using the core fields and each of the qualifiers. We set out to keep to the Dublin Core straight and narrow. Days of discussion later we were pretty happy with our choices, we could see how most of the elements would fit our needs, even though the terms square pegs and round holes kept popping up into the conversation. The big exception was that we wanted more notes fields. Although “description” is designed for notes, we wanted a extra field for notes and a field for internal notes. We wanted to use internal notes like a MARC 590 field. If we were to digitize something that was not unique we wanted to be able to say something about Binghamton’s copy. We drew the line there however and resisted some of the tempting extra fields we saw in our research – though some of them looked really useful. We then went into Rosetta and created a master metadata form. Let’s look at the metadata form.

Staffing for metadata Metadata Librarians create the metadata forms and set up the rules to use them Other staff, student help from with the Libraries will create metadata Projects ingested from outside the library will get our help but use their own staff/students

Metadata Form Need to create a metadata form for each collection.
A metadata form will have the fields that you need for a specific project. So while one project may need MESH subject headings another may not. One of the nice things about metadata forms is that they can be duplicated. So we created a metadata form that uses all of the Dublin Core fields that we had just compiled. We did this because it is quicker to duplicate a default form and subtract the fields you do not need rather than have to create the form and add all fields over and over again. ---I’ll explain why. You start with the name of the form and a brief description. Then, [see the arrow,] you click the fields information tab.

Create metadata form For each collection, you can make a metadata form with only the fields you want. You keep adding new fields. The drop down you see highlighted contains the DC tags, you choose a tag. On this screen, you also make your choice of the type of field it will be. Rosetta is designed to let you set up radio button fields, date fields, text boxes etc.

Choose your properties for the field
And my favorite part, you add your tooltip -- where you leave breadcrumbs about how to use the field– which here says to try to create a unique title for each item. You can make the field mandatory or not, we did make the title mandatory, single line or not. Add the Label and description.

A finished form After putting all those fields into a metadata form this, if you can see this slide, is what it looks like. If you can see them, 3 of the fields have been automatically filled in – the URL to our rights statement, the digital publisher – us and the rights holder – us again. Also in between the rights and the rights holder there is a field for the digital collection. In this case we actually have 3 parts to the overall collection and have created a field with radio buttons to choose which part of the collection this IE (remember what an IE is?) belongs. Now at this point I should mention that you don’t to use this form, you can also load IE’s in bulk with the metadata in spreadsheets where the data will map to the fields. However for demonstration purposes I think it is easier to understand the process if you visualize creating a metadata form and then fill it out. There are many ways to get to the same ending – in most cases different ways will be used for different projects. For example we are working on one project where graduate students are entering all the metadata, for that project they are working with spreadsheets which will be loaded and matched with the IE’s. But the concept is the same. So we have added IE’s, we have added metadata, the IE has been added to the permanent repository, it has undergone risk analysis and a display copy has been made. Are we there yet?

Searching for “cubmarine” in Rosetta
As Edward mentioned, Rosetta does not have a public display though it does have it’s own internal searching and display. The largest collection that has been ingested so far is the Edwin A Link Collection. Link lived in Binghamton was a pioneer in both aviation and ocean engineering . He is most remembered for inventing the flight simulator called the "Blue Box" or "Link Trainer“. One of his inventions was the “cubmarine”, a small submersible – we like this one simply because we like the name.

Here we see the result of our search within Rosetta and her is the cubmarine on the back of Link’s ship the Sea Diver. . You can see part of the metadata above the picture. As you can see the title is “Perry Cubmarine on Sea Diver”. However you cannot see all of the metadata – you have to scroll down.

Primo at Binghamton To display our Rosetta holdings to our users, we decided to use Primo as a discovery layer. For our search collections choices you can search just our digital collections (Rosetta), just our Library catalog (Aleph), or by searching “Local Digital Collections & Catalog” you can search both the library catalog and Rosetta. This was important to us since for some of our collections we will have materials in both locations and want the user to find all available materials at once. Of course they can also search “everything” to include journal and newspaper articles associated with their search.

For example, doing a search for “edwin a link” we get 5134 results if we search both the library catalog and our digital collections. By looking at the facets on the left of the screen, we see that 5097 of the results are available from Rosetta with the other 37 available in the Library. But lets redo our cubmarine search, this time in Primo.

Same search in Binghamton’s Primo
Searching only the digital collections in Primo for “cubmarine” brings up the same image we just saw from our Rosetta search. Under “details” you can see the part of the metadata. All of the metadata is there, but you do need to scroll down to see the rest of it. One big change from the original metadata view is that the labels have become much more user friendly. For example you will see “Title” instead of dc:title and Subject(TGM) instead of dcterms:TGM. We are still working with Dublin Core to determine the best fit for the data. As you can see, you have a thumbnail of the image with your results and a click on “View Online” will take the user to Rosetta for the full image.

And this is exactly the same image we saw in Rosetta – the picture is watermarked.

Can zoom in! You can rotate, zoom in or out or best fit the picture. You can also print it – with the watermark.

Comments and challenges
Built for National Libraries Comes with territory of being first University partner Scales down well Ex Libris extremely responsive and making improvements Staffing Current staffing levels are adrquate,but we have big plans and need more staffing to get things as quickly as we’d like

Future projects More Special Collections (digitization) Blood Serum
University Photographer 767 DVDs! Born Digital Newsletters Campus website Planning to be opportunistic Prioritization: Committee to set priorities

Edward M. Corrado http://ecorrado.us/ ecorrado@ecorrado.us
Digital Preservation: Context and Content: ELAG 2011 Edward M. Corrado

Digital Preservation: Context and Content

Similar presentations

Presentation on theme: "Digital Preservation: Context and Content"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Digital Preservation: Context and Content

Similar presentations

Presentation on theme: "Digital Preservation: Context and Content"— Presentation transcript:

Similar presentations

About project

Feedback