Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Similar presentations


Presentation on theme: "Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008."— Presentation transcript:

1 Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

2 Overview Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Introduction File format data and information loss  What happens if data is corrupted in files?  Categories of file format data Measuring Information Loss  Robustness Indicators  Study results for different file formats

3 Overview Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Introduction File format data and information loss  What happens if data is corrupted in files?  Categories of file format data Measuring Information Loss  Robustness Indicators  Study results for different file formats

4 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Background EU-founded project “ Planets ”  characterisation of file format content www.planets-project.eu University of Cologne, Computer Science for the Humanities (Historisch-Kulturwissenschaftliche Informationsverarbeitung (HKI))  Planets partner www.hki.uni-koeln.de

5 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Context Long-term preservation of digital information Which file format to choose? Criteria, e.g.: Open standard Spread of usage Hard-/Software-Dependencies Authenticity … Robustness

6 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Robustness ::= Error resilience of file formats against bit- stream corruption

7 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Issues/ Research topics Is there any correlation between file format and data integrity? If so, are there any differences among file formats concerning the degree of robustness? Which file format based factors are responsible for varying degrees of robustness? How can we improve the robustness of file formats?

8 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Benefits Digital preservation: Decision support for choosing file format for long-term preservation Contribution to file format research Improvement of existing file formats Design of future file formats

9 Overview Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Introduction File format data and information loss  What happens if data is corrupted in files?  Categories of file format data Measuring Information Loss  Robustness Indicators  Study results for different file formats

10 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland File Format Data and Information loss What is “ File Format ” in our context? Set of rules, constituting the logical organisation of data Set of rules, indicating how to interpret data Set of rules  file format specification File Format Data::= Binary data, formatted according to the rules of a file format

11 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland What happens if data is corrupted in files? G Testimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel

12 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland G First 224 Byte of testfile FF

13 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland G Plain information loss: 1 byte data = = 1 Pixel

14 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland What happens if data is corrupted in files? G Testimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel

15 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland G Part of the TIF Image File Directory, Tag: Photometric Interpretation 00

16 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland G Conditional information loss: 1 bit changes == 100% information changed

17 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Categories of File Format Data Technical data (data for processing): Image width: 277 Image length: 339 Compression: uncompressed

18 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland “ Payload ” data (basic data of usage): Pixel data, starting from byte #0x008

19 Overview Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Introduction File format data and information loss  What happens if data is corrupted in files?  Categories of file format data Measuring information loss  Robustness Indicators  Study results for different file formats

20 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Robustness Indicators (1) R B = Δ (b0,b1) / m where i.b0 is the basic data of usage before being corrupted, ii.b1 is the basic data of usage after being corrupted, iii.m is the number of corruption procedures. R B indicates an average information loss.

21 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Example A file X may have 2000 byte of payload data. Presuming the number of byte changed after the file has been corrupted 3 times is per each corruption procedure 1. Δ (b0,b1) = 200 byte 2. Δ (b0,b1) = 150 byte 3. Δ (b0,b1) = 250 byte The average information loss for file X based on 3 corruption procedures is then R B = 600 / 3 = 200

22 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland R B related to the total number of payload data: (2) R Bt = R B / n where n is the total number of basic data of usage (payload data). (3) R Bt = R B / n * 100 = R Bt expressed in percentage Interpretation: R Bt = 0 % : max. Robustness (min. Information loss)

23 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Example (continued) (2)R Bt = 200 / 2000 = 0.1 (3)R Bt = 200 / 2000 * 100 = 10 (%)

24 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Study on Robustness for various file formats: Example Results TIF - uncompressed - LZW - JPEG (2 different compression levels) - ZIP PNG (filtered, unfiltered) JPEG2000 (lossless, lossy) BMP (uncompressed) G

25 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Study on Robustness for various file formats: Example Results Method - simulation of file corruption: every file is corrupted up to 3000 times (3000 corruption procedures) - applying 3-5 different corruption ratios:  less than 0.01%  0.01%  0.1%  1.0%  more than 1.0% G

26 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Method - compressed payload data is decompressed - original payload data and corrupted one is compared - computing Robustness Indicators Values G

27 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland G

28 Example: Jp2 formatted image, corruption of 1 Byte, “ bad case ”

29 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Example: Jp2 formatted image, corruption of 1 Byte, “ good case ”

30 Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Example: Jp2 formatted image, corruption of 1 Byte, “ good case ” with visualized differences in pixel data

31 Thank you very much! Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008


Download ppt "Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008."

Similar presentations


Ads by Google