EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, Dr. Thomas Fischer Metadaten und Datenbanken SUB Göttingen
EMANI Göttingen: Thomas Fischer Overview Basis Situation Purposes of formats Formats for purposes Text formats for archiving Text formats for retrieval Image formats for archiving Presentation formats: text and images Co-operation and compatibility Import of data Coordination
EMANI Göttingen: Thomas Fischer Basis Situation Archiving for presentation: Preserve original appearance of documents for clients of electronic journals over (long) time. Archive documents in a fashion independent of software and hardware to minimize problems of mingration. Is there a possibility to unify the procedures of electronic publishing and archiving/presentation of this material.
EMANI Göttingen: Thomas Fischer Purposes of Formats I In the EMANI Project, we suggest to collect a vast corpus of data from different sources: Retrodigitized material (different formats of images) Digital material (different formats of text and images) Multimedia-type material (interactive, video …) Programs This material comes in different formats, because there is no single format that would serve the needs of producing this material. To integrate this material into one collection, standardization will be extremely valuable.
EMANI Göttingen: Thomas Fischer Purposes of Formats II A closer look: Retrodigitization: This process produces images. Requirements come from usage and administration. The participants can usually decide on the format. Digital material: The majority of mathematics text is written in T E X these days, but articles may include images (e.g. EPS). There is material which is not produced in T E X (e.g. editorials) or where the sources are no longer available. Some text- processing formats (Word, WordPerfect) and presentation formats (PS, PDF) are common. Programs should be archived as source code (essentially ASCII). Compiled programs cause migration problems. Multimedia: Not considered for now (e.g. videos from fractal images).
EMANI Göttingen: Thomas Fischer Formats for Purposes In the context of archiving mathematics, different formats are needed for different purposes: Archiving: A stable representation of content and form of the article. The format should be independent of proprietary software and insensitive to minor errors. Retrieval: Metadata and probably full textual representation of the contents. Formulas are still an open question (much more complex than in Chemistry).open question Chemistry Presentation: The presentation of the material should be as true as possible to the original “look and feel”. If should be rendered by standard agents and not require special programs (beyond simple plug-ins) on the client’s side. Special measures may be necessary for the visually impaired.
EMANI Göttingen: Thomas Fischer Text formats for archiving The best-suited formats for archiving are mark-up formats like T E X or MathML. There is some progress in the conversion from MathML to T E X and vice versa.MathML to T E X vice versa For documents in T E X, a suitable environment is necessary for correct rendering. This needs documentation and archiving of the respective additional files (stylesheets, fonts etc.) Included images will come in different formats, usually EPS, but PDF and T E X-defined graphics are possible. It is not clear to me how to handle these. The format of the images is not necessarily obvious.
EMANI Göttingen: Thomas Fischer Retrieval formats For metadata, a scheme (application profile) is needed (another work package) For retrieval using full text, an additional textual layer may to be added to the text file (unless some T E X based full text search mechanism becomes available). This textual layer should be stored in a normalized format, the one provided by the Text Encoding Initiative (TEI) might be useful (mark-up of structural information). The alternative is an integrated search engine which provide access to the data by storing relevant information in it own database removed from the original data (like Google does for internet files)
EMANI Göttingen: Thomas Fischer Image Formats for Archiving High quality images are necessary for further pcosessing of the data: conversion to other formats, OCR, printing. For management of the files, the possible inclusion of metadata is extremely desirable. If the images are to be archived in compressed form, the compression algorithm should be lossless and free of copyrights. This points to 600 dpi TIFF as standard format, compressed using CCITT G4 compression for bitonal images.
EMANI Göttingen: Thomas Fischer Presentation formats I T E X is not well suited for the presentation of mathematical articles on the net: Requirements of additional files like stylesheets Special fonts necessary IBM techexplorer Hypermedia Browser shows possibilities and limitations T E X files have to be processed on the server side and delivered in a unified format. Possible options are DVI, PS, PDF and DjVu. Since DVI-viewer usually only exist in a T E X environment, and almost the same holds for GhostView for reading PS on-screen, PDF or DjVu have to be considered.DjVu
EMANI Göttingen: Thomas Fischer Presentation formats II Image files from scanning are usually very large, so would create a heavy load when sent via the net. Image files have to be processed on the server side and delivered in a unified format. This may be different depending on desired resolution on the client’s side, e.g. for viewing onscreen or (high) quality printing. Possible options are JPEG, PDF and DjVu.DjVu
EMANI Göttingen: Thomas Fischer Cooperation: Exchange of Metadata Metadata come in different format, Springer uses the MAJOUR-header DTD (European Workgroup on SGML) in different version (note: this is fairly complicated, rignal documentation has 151 pages). This header is presented in SGML mark-up and/or RDF syntax. Both can be technically imported into an envisaged EMANI system. The compatibility of the metadata schemes has to be studied (richness and availability of data compared with the emerging EMANI scheme).
EMANI Göttingen: Thomas Fischer MAJOUR-header: SGML Springer-Verlag Berlin Heidelberg 211 Numerische Mathematik Numer. Math X NUMMA Original article Springer-Verlag Berlin Heidelberg 2000
EMANI Göttingen: Thomas Fischer MAJOUR-header: RDF On the dual of complex Ol'shanski\uı semigroups EN <rdfs:label rdf:parseType="Literal">English
EMANI Göttingen: Thomas Fischer Cooperation: Import of Articles Import of articles in PDF format: Quality check: appropriate resolution, scalabilty, printable? Need standards for handling PDF Import of articles in T E X format: Check necessary additional files Create appropriate container for all files referring to one article Create structure to manage general additional files.
EMANI Göttingen: Thomas Fischer T E X files from Springer TeX needs friendly environment: Document Class: svjour 2001/10/17 LaTeX document class for Springer journals - version 1.9 Class Springer-SVJour Warning: Specified option or subpackage "leqno" not found - on input line 92. ! Class Springer-SVJour Error: No valid journal specified in option list. See the Springer-SVJour class documentation for explanation. Type H for immediate help.... l.93...ournal specified in option list}{} ? ) ) No pages of output. produced output after installation of svjour.cls svnummat.clo TOTAL00.NUM But: references missing!
EMANI Göttingen: Thomas Fischer Cooperation Internal: NSF/DFG: Access to Mathematical Literature over Time Jahrbuch Projekt Mathematical Monographs... External (?) DML NUMDAM Elsevier?...
EMANI Göttingen18 Thank you for your attention!