File Formats in the Context of Archiving Dr. Thomas Fischer EMANI – Project Meeting February 14 th - 16 th, 2002 Springer-Verlag Heidelberg Göttingen State and University Library (SUB)
February 14 th - 16 th, 2002 EMANI Project Meeting SUB Göttingen Archives Store Different Kind of Data... archives have to deals with different kind of data raw binary data texts images multimedia ...
February 14 th - 16 th, 2002 EMANI Project Meeting SUB Göttingen... in Different File Formats binary data: stream of bytes text: ASCII, other encodings of simple text, formatted text images: vector or pixel oriented graphics multimedia: a plethora of different file types for different purposes
February 14 th - 16 th, 2002 EMANI Project Meeting SUB Göttingen Focus on... mathematics consists mostly of text, formulas, diagrams, and some images further contents might be (compiled) programs, interactive simulations etc. for learned journals the contents is overwhelmingly text with few images
February 14 th - 16 th, 2002 EMANI Project Meeting SUB Göttingen Text! text files usually contains to kinds of information: textual data providing the contents (words) of the file structural data containing the information for the presentation of the text
February 14 th - 16 th, 2002 EMANI Project Meeting SUB Göttingen Two Kinds of Problems loss of structure leads to loss of formatting loss of text leads to loss of meaning if problems occur with the media or the program that reads the file, some information may be lost the latter is usually considered more serious
February 14 th - 16 th, 2002 EMANI Project Meeting SUB Göttingen Two Types of Text File Formats structured format (e.g. Microsoft Word, PDF): file consits of text (more or less uninterrupted) and tables (usually at the beginning or the end of the file) that provide additional information, formatting etc. mark-up format (e.g. HTML, XML, RTF, TeX): file consists of stream of text with formatting information interspersed
February 14 th - 16 th, 2002 EMANI Project Meeting SUB Göttingen For Archiving Purposes the file format chosen should be readable without the use of specialized programs the file format should be robust against damage of media and loss of data
February 14 th - 16 th, 2002 EMANI Project Meeting SUB Göttingen Types of Text Format mark-up languages like XML or TeX store text and formatting together. Text can be reconstructed using any text editor, format probably regained. structured formats like MS Word or PDF need the dedicated program for proper representation and may or may not allow the extraction of the text contained, depending on the particular situation, usually not visible to the user. Consequence: Mark-up formats are better suited for archiving