Download presentation
Presentation is loading. Please wait.
Published byRalph Baldwin Modified over 9 years ago
1
Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008
2
Overview Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Introduction File format data and information loss What happens if data is corrupted in files? Categories of file format data Measuring Information Loss Robustness Indicators Study results for different file formats
3
Overview Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Introduction File format data and information loss What happens if data is corrupted in files? Categories of file format data Measuring Information Loss Robustness Indicators Study results for different file formats
4
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Background EU-founded project “ Planets ” characterisation of file format content www.planets-project.eu University of Cologne, Computer Science for the Humanities (Historisch-Kulturwissenschaftliche Informationsverarbeitung (HKI)) Planets partner www.hki.uni-koeln.de
5
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Context Long-term preservation of digital information Which file format to choose? Criteria, e.g.: Open standard Spread of usage Hard-/Software-Dependencies Authenticity … Robustness
6
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Robustness ::= Error resilience of file formats against bit- stream corruption
7
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Issues/ Research topics Is there any correlation between file format and data integrity? If so, are there any differences among file formats concerning the degree of robustness? Which file format based factors are responsible for varying degrees of robustness? How can we improve the robustness of file formats?
8
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Benefits Digital preservation: Decision support for choosing file format for long-term preservation Contribution to file format research Improvement of existing file formats Design of future file formats
9
Overview Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Introduction File format data and information loss What happens if data is corrupted in files? Categories of file format data Measuring Information Loss Robustness Indicators Study results for different file formats
10
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland File Format Data and Information loss What is “ File Format ” in our context? Set of rules, constituting the logical organisation of data Set of rules, indicating how to interpret data Set of rules file format specification File Format Data::= Binary data, formatted according to the rules of a file format
11
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland What happens if data is corrupted in files? G Testimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel
12
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland G First 224 Byte of testfile FF
13
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland G Plain information loss: 1 byte data = = 1 Pixel
14
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland What happens if data is corrupted in files? G Testimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel
15
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland G Part of the TIF Image File Directory, Tag: Photometric Interpretation 00
16
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland G Conditional information loss: 1 bit changes == 100% information changed
17
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Categories of File Format Data Technical data (data for processing): Image width: 277 Image length: 339 Compression: uncompressed
18
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland “ Payload ” data (basic data of usage): Pixel data, starting from byte #0x008
19
Overview Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Introduction File format data and information loss What happens if data is corrupted in files? Categories of file format data Measuring information loss Robustness Indicators Study results for different file formats
20
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Robustness Indicators (1) R B = Δ (b0,b1) / m where i.b0 is the basic data of usage before being corrupted, ii.b1 is the basic data of usage after being corrupted, iii.m is the number of corruption procedures. R B indicates an average information loss.
21
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Example A file X may have 2000 byte of payload data. Presuming the number of byte changed after the file has been corrupted 3 times is per each corruption procedure 1. Δ (b0,b1) = 200 byte 2. Δ (b0,b1) = 150 byte 3. Δ (b0,b1) = 250 byte The average information loss for file X based on 3 corruption procedures is then R B = 600 / 3 = 200
22
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland R B related to the total number of payload data: (2) R Bt = R B / n where n is the total number of basic data of usage (payload data). (3) R Bt = R B / n * 100 = R Bt expressed in percentage Interpretation: R Bt = 0 % : max. Robustness (min. Information loss)
23
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Example (continued) (2)R Bt = 200 / 2000 = 0.1 (3)R Bt = 200 / 2000 * 100 = 10 (%)
24
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Study on Robustness for various file formats: Example Results TIF - uncompressed - LZW - JPEG (2 different compression levels) - ZIP PNG (filtered, unfiltered) JPEG2000 (lossless, lossy) BMP (uncompressed) G
25
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Study on Robustness for various file formats: Example Results Method - simulation of file corruption: every file is corrupted up to 3000 times (3000 corruption procedures) - applying 3-5 different corruption ratios: less than 0.01% 0.01% 0.1% 1.0% more than 1.0% G
26
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Method - compressed payload data is decompressed - original payload data and corrupted one is compared - computing Robustness Indicators Values G
27
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland G
28
Example: Jp2 formatted image, corruption of 1 Byte, “ bad case ”
29
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Example: Jp2 formatted image, corruption of 1 Byte, “ good case ”
30
Volker Heydegger | Archiving 2008 | 25 th June 2008 | Bern, Switzerland Example: Jp2 formatted image, corruption of 1 Byte, “ good case ” with visualized differences in pixel data
31
Thank you very much! Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.