Presentation is loading. Please wait.

Presentation is loading. Please wait.

Challenges of Digital Preservation MA / CS 109 April 22, 2011 Andrea Goethals Manager of Digital Preservation & Repository Services Harvard Library.

Similar presentations


Presentation on theme: "Challenges of Digital Preservation MA / CS 109 April 22, 2011 Andrea Goethals Manager of Digital Preservation & Repository Services Harvard Library."— Presentation transcript:

1 Challenges of Digital Preservation MA / CS 109 April 22, 2011 Andrea Goethals Manager of Digital Preservation & Repository Services Harvard Library

2

3 “Digital Content”? Digitized (born-analog) Born-digital ◦ Tweets ◦ Web sites ◦ Email ◦ Documents  PDF  Word, OpenOffice …  Spreadsheets ◦ Data sets

4 Digital content is not new 1957: 1 st digital image 1969: ARPAnet 1971: 1 st email sent 1972: 1 st consumer- level video game 1975: 1 st digital camera Russell Kirsch’s son (source: NIST)

5 But has only recently exploded 1998: 1 st Google index ◦ 26 million pages 2000: Google index ◦ 1 billion pages 2008: Google link processors ◦ 1 trillion unique URIs ◦ “… and the number of individual Web pages out there is growing by several billion pages per day” – from the official Google blog

6 The coming tsunami 2010: estimated at 1.2 ZB (1 ZB is 1 million TBs) ◦ DVDs stacked from Earth to the Moon and back 2020: expected to grow by a factor of 44 to 35 ZB ◦ DVDs stacked halfway to Mars Source: 2010 IDC Digital Universe Study sponsored by EMC

7 Outpacing storage Source: 2009 IDC Digital Universe Study sponsored by EMC

8 Why do we care?

9 May be historically significant Captured March 19, 2011 for a Japan Earthquake collection created by Virginia Tech, Internet Archive (http://www.archive-it.org/public/collection.html?id=2438)http://www.archive-it.org/public/collection.html?id=2438

10 May be a work of art YouTube Play. A Biennial of Creative Video (Oct. 2010 -)

11 May be an important reference Only available in digital form

12 May only be possible digitally

13 Who cares? Cultural heritage institutions ◦ Libraries, archives ◦ Museums, historical societies ◦ Academic institutions Governments Entertainment, news and media industry Scientific community Funding bodies (NSF, NIH) You?

14 Preservation historically Archives and libraries have been preserving all kinds of analog material for centuries using: ◦ Environmental control ◦ Conservation treatments Can store away until resources allow processing ◦ Benign neglect approach works well

15 Analog content is fairly durable Even damaged, may still be identifiable, readable, usable Anatolian Cuneiform Tablet, circa 1850 BCE

16 In contrast digital content is Easily destroyed Transient Hidden Requires more active attention – benign neglect approach doesn’t work

17 Digital content is easily destroyed Bad people Hardware or software failures Human mistakes ◦ The slip of a finger can lead to catastrophic results ◦ “Help! Accidental deletion. I accidentally deleted 62 images… can you please recover them from backups?”

18 Digital content is transient Average lifespan of a Web site is between 44 and 100 days Captured April 8, 2009Visited October 13, 2010

19 Digital content is hidden Which is corrupt?

20 Digital content is hidden Both. Use helps but its not enough to detect corruption.

21 But is it usable??? It’s not enough to preserve the digital bits ◦ AppleWorks? ◦ WordStar? ◦ Excel 1.0? To use digital content we need software that can read the format

22 Reading formats ffd8ffe000104a46494600010201 008300830000ffed0fb050686f74 6f73686f7020332e30003842494d 03e90a5072696e7420496e666f00 0000007800000000004800480000 000002f40240ffeeffee03060252 0347052803fc0002000000480048 0000000002d80228000100000064 000000010003030300000001270f 0001000100000000000000000000 0000600800190190000000000000 0000000000000000000000000000 0000000000000000000000003842 494d03ed0a5265736f6c7574696f 6e0000000010008313a3000200...

23 Reading formats ffd8ffe000104a46494600010201 008300830000ffed0fb050686f74 6f73686f7020332e30003842494d 03e90a5072696e7420496e666f00 0000007800000000004800480000 000002f40240ffeeffee03060252 0347052803fc0002000000480048 0000000002d80228000100000064 000000010003030300000001270f 0001000100000000000000000000 0000600800190190000000000000 0000000000000000000000000000 0000000000000000000000003842 494d03ed0a5265736f6c7574696f 6e0000000010008313a3000200... SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2...

24 Reading formats ffd8ffe000104a46494600010201 008300830000ffed0fb050686f74 6f73686f7020332e30003842494d 03e90a5072696e7420496e666f00 0000007800000000004800480000 000002f40240ffeeffee03060252 0347052803fc0002000000480048 0000000002d80228000100000064 000000010003030300000001270f 0001000100000000000000000000 0000600800190190000000000000 0000000000000000000000000000 0000000000000000000000003842 494d03ed0a5265736f6c7574696f 6e0000000010008313a3000200... SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2...

25 Access to information information content bits formats SW HW HW (paper) information content HW (paper) symbols language Analog book Unmediated access Digital book Technology-mediated access

26 Formats are key to digital preservation information content bits formats SW HW supporting technologies digital content If the format of our content is unsupported by technology, we can’t access the content’s information!

27 Dependent on fleeting technology We are dependent on technology to interpret (render, play, etc.) digital content No technology sticks around – it all ages and disappears Eventually all digital content in its original format becomes unusable!

28 Format obsolescence Kodak PhotoCD ◦ Used by libraries in the 1990’s and into 2000’s as a preservation format ◦ Best decoders were from Kodak and are no longer supported ◦ Very few software decoders remaining – soon images in this format will be unusable ◦ Harvard’s Digital Repository Service has 7,243 of these

29 Two sub-problems Keep the bits safe Keep the information usable as technology changes

30 Safe bits Infrastructure, polices, practices and professional staff to counter risks ◦ High quality storage ◦ Redundancy (multiple copies, multiple locations) ◦ Media refreshing (replacing) ◦ Security and access restrictions ◦ Content recovery ◦ Integrity monitoring (check for corruption)…

31 Integrity monitoring Message digests – unique signatures for digital content ◦ Fixed-size bit strings  6326ec82b3200df4a87fc54356d2cb73 ◦ Calculated by cryptographic hash functions, e.g. MD5, SHA1, … Any changes to a file result in a changed message digest Useful for detecting corruption

32 Usable information People have to be able to find it People must be able to manage it Document what’s important (description, context, ownership, processing history) Know what you are preserving (formats)…

33 A TIFF is a TIFF? Tiff 4.0 Tiff 5.0 Tiff 6.0 Tiff 6.0 extension YCbCr (Class Y) TIFF/IT (ISO 12639:2003) TIFF/EP (ISO 12234- 2:2001) RichTIFF EXIF 2.0 EXIF 2.1 (JEIDA-49-1998) EXIF 2.2 (JEITA CP-3451) GeoTIFF 1.0 TIFF-FX (RFC 2301) Class F (RFC 2306) RFC 1314 Canon RAW (.crw,.cr2,.tif) Nikon RAW (.nef) DNG (Adobe Digital Negative)

34 Identifying formats Techniques: “magic numbers”, full parse Few tools ◦ Support limited number of formats ◦ Accuracy varies Some improvements ◦ File Information Tool Set (FITS)  fits.google.code ◦ NARA-sponsored research

35 Usable information Make sure there’s technology to support the formats! (technology watch) Preservation strategies ◦ Technology preservation ◦ Creation of viewing software ◦ Emulation & variations:  Universal Virtual Machine  Universal Virtual Computer ◦ Format normalization ◦ Format migrations…

36 Key format migration considerations What can’t be lost in the transformation? “Significant properties” ◦ E.g. color, embedded metadata, resolution, ICC profiles, interaction, attachments, fonts, links ◦ How important are each of these properties? – weighted criteria To what format? “Preservable” formats What else must be changed? Ex: Links How many versions to keep?

37 Preservation lifecycle – a series of hand-offs Create or acquire digital content Ingest into a preservation repository ◦ Continuous cycle of:  Monitoring  Planning  Intervention ◦ Subject to collection management decisions Transfer to next generation of the repository or to a different repository

38 Ongoing commitment Requires continual proactive program ◦ You can’t just start and stop ◦ Time frames are MUCH shorter than for preservation of analog material Requires ongoing investment in infrastructure and staff

39 Can’t do it alone Digital preservation activities must be shared across institutions Even collectively we don’t have adequate resources or understanding

40 Preservation community Collaborative organizations (NDSA, IIPC, OPF) Collaborative projects Standards and best practices Shared infrastructure and tools ◦ Formats registry ◦ Repository software ◦ Preservation planning tools ◦ Format tools

41


Download ppt "Challenges of Digital Preservation MA / CS 109 April 22, 2011 Andrea Goethals Manager of Digital Preservation & Repository Services Harvard Library."

Similar presentations


Ads by Google