Download presentation
Presentation is loading. Please wait.
Published byHollie Baker Modified over 9 years ago
1
Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge
2
Documents Building blocks of digital libraries Building blocks of digital libraries Many different standards for documents Many different standards for documents Internationalization Internationalization Fixed versus fluid Fixed versus fluid Permanent versus transient Permanent versus transient Indexing Indexing
3
Representing Characters EBCDIC EBCDIC Extended Binary Coded Decimal Interchange Code Extended Binary Coded Decimal Interchange Code Represented in 8 bits Represented in 8 bits ASCII (1968) ASCII (1968) American Standard Code for Information Interchange American Standard Code for Information Interchange Represented with 7 bits Represented with 7 bits Does not support many foreign languages Does not support many foreign languages Many expansions made to the basic ASCII character set Many expansions made to the basic ASCII character set ISCII (1983) ISCII (1983) Indian Script Code for Information Interchange Indian Script Code for Information Interchange Hindi and related languages Hindi and related languages GB and Big-5 for Chinese GB and Big-5 for Chinese
4
Unicode Successor of ASCII Successor of ASCII ISO-10646 (1993) ISO-10646 (1993) Universal Universal Aims to represent ALL the world’s languages Aims to represent ALL the world’s languages Default encoding for HTML and XML Default encoding for HTML and XML Development began in 1988 as a joint effort between Apple and Xerox Development began in 1988 as a joint effort between Apple and Xerox Unicode standard continues to evolve Unicode standard continues to evolve Round-trip compatibility – Unicode can be mapped to/from any character set without loss Round-trip compatibility – Unicode can be mapped to/from any character set without loss
5
Unicode Character Set Unicode standard is massive Unicode standard is massive Two subsets of standard: ISO 10646-1/2 Two subsets of standard: ISO 10646-1/2 94,000 characters defined 94,000 characters defined Represents scripts Represents scripts Scripts versus languages Scripts versus languages Punctuation shared among scripts Punctuation shared among scripts Universal character set – characters at the core of Unicode Universal character set – characters at the core of Unicode
6
Representing Documents Plain text Plain text Full-text indexing Full-text indexing Bag of words Bag of words Inversion of the text Inversion of the text Inverted files Inverted files Granularity of document Granularity of document Granularity of index Granularity of index Word segmentation Word segmentation Chinese and Japanese are written without spaces Chinese and Japanese are written without spaces Spacing in Chinese sentences can completely change the meaning Spacing in Chinese sentences can completely change the meaning
7
Page Description Languages Device independence Device independence PostScript (also a programming language) PostScript (also a programming language) First commercially developed page description language (1985) First commercially developed page description language (1985) Fonts: Type 1, TrueType, OpenType Fonts: Type 1, TrueType, OpenType Text extraction Text extraction Using PostScript in a digital library Using PostScript in a digital library
8
Page Description Languages Portable Document Format (PDF) Portable Document Format (PDF) PDF versus Postscript PDF versus Postscript Not a full-scale programming language Not a full-scale programming language New features for interactive display New features for interactive display Random access to pages Random access to pages Hierarchically structured content Hierarchically structured content Navigation within a document Navigation within a document Hyperlinks Hyperlinks File format: header, objects, cross-references, trailer File format: header, objects, cross-references, trailer Searchable image option Searchable image option
9
Word-Processor Documents Rich Text Format Rich Text Format 240 page specification 240 page specification Document-level metadata Document-level metadata Conversion to HTML software available Conversion to HTML software available Native Word formats Native Word formats Binary Binary Proprietary Proprietary LaTeX format LaTeX format Typed formatting commands Typed formatting commands Non-proprietary Non-proprietary
10
Representing Images Lossless image compression Lossless image compression GIF GIF PNG PNG JPEG-Lossless JPEG-Lossless JPEG-2000 JPEG-2000 Lossy image compression Lossy image compression JPEG JPEG Progressive refinement Progressive refinement
11
Representing Audio and Video Evolution of signals over time Evolution of signals over time Sample rate Sample rate Samples per second Samples per second Multimedia compression Multimedia compression Codec Codec Asymmetry Asymmetry Redundancy Redundancy
12
MPEG ISO Moving Picture Experts Group (1988) ISO Moving Picture Experts Group (1988) Audio and Video at 1.5 Mbit/second Audio and Video at 1.5 Mbit/second Family of standards Family of standards MPEG-1 MPEG-1 Low resolution video, 30 fps, near CD quality Low resolution video, 30 fps, near CD quality Layer 3 – MP3 Layer 3 – MP3 MPEG-2 MPEG-2 Higher quality video (DVD) Higher quality video (DVD) Supports interlaced images (Broadcast TV) Supports interlaced images (Broadcast TV) Multichannel audio Multichannel audio
13
MPEG MPEG-3 abandoned MPEG-3 abandoned MPEG-4 MPEG-4 Low bandwidth networks – mobile and WWW Low bandwidth networks – mobile and WWW Object based (vs. signal based) Object based (vs. signal based) Interactive Interactive Strategies for identifying and managing intellectual property Strategies for identifying and managing intellectual property MPEG-7 MPEG-7 Metadata description for content delivered via MPEG-1,2,4 Metadata description for content delivered via MPEG-1,2,4 MPEG-21 MPEG-21 Multimedia lifecycle Multimedia lifecycle Interoperability Interoperability
14
Other Multimedia Formats Audio and Video Audio and Video AVI (Microsoft) AVI (Microsoft) Quicktime (Apple) Quicktime (Apple) Streaming Streaming RealAudio, RealVideo, RealOne (Realsystems) RealAudio, RealVideo, RealOne (Realsystems) ASF (Microsoft) ASF (Microsoft) Audio only Audio only WAV (Microsoft, IBM) WAV (Microsoft, IBM) AIFF (Apple) AIFF (Apple) AU (Sun) AU (Sun)
15
Multimedia in a Digital Library Indexing and browsing structures Indexing and browsing structures Text-based Text-based Content-based Content-based Summarizing audio and video Summarizing audio and video Digitizing media Digitizing media Linear resolution, color depth, frame rate, sample rate Linear resolution, color depth, frame rate, sample rate Preservation issues Preservation issues
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.