Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter Four Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge.

Similar presentations


Presentation on theme: "Chapter Four Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge."— Presentation transcript:

1 Chapter Four Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge

2 Documents Building blocks of digital libraries Building blocks of digital libraries Many different standards for documents Many different standards for documents Internationalization Internationalization Fixed versus fluid Fixed versus fluid Permanent versus transient Permanent versus transient Indexing Indexing

3 Standards Organizations American National Standards Institute (ANSI) American National Standards Institute (ANSI) International Standards Organization (ISO) International Standards Organization (ISO)

4 Representing Characters EBCDIC EBCDIC Extended Binary Coded Decimal Interchange Code Extended Binary Coded Decimal Interchange Code Represented in 8 bits Represented in 8 bits ASCII (1968) ASCII (1968) American Standard Code for Information Interchange American Standard Code for Information Interchange Represented with 7 bits Represented with 7 bits Does not support many foreign languages Does not support many foreign languages Many expansions made to the basic ASCII character set Many expansions made to the basic ASCII character set ISCII (1983) ISCII (1983) Indian Script Code for Information Interchange Indian Script Code for Information Interchange Hindi and related languages Hindi and related languages GB and Big-5 for Chinese GB and Big-5 for Chinese

5 Unicode Successor of ASCII Successor of ASCII ISO-10646 (1993) ISO-10646 (1993) Universal Universal Aims to represent ALL the world’s languages Aims to represent ALL the world’s languages Default encoding for HTML and XML Default encoding for HTML and XML Development began in 1988 as a joint effort between Apple and Xerox Development began in 1988 as a joint effort between Apple and Xerox Unicode standard continues to evolve Unicode standard continues to evolve Round-trip compatibility – Unicode can be mapped to/from any character set without loss Round-trip compatibility – Unicode can be mapped to/from any character set without loss

6 Unicode Character Set Unicode standard is massive Unicode standard is massive Two subsets of standard: ISO 10646-1/2 Two subsets of standard: ISO 10646-1/2 94,000 characters defined 94,000 characters defined Represents scripts Represents scripts Scripts versus languages Scripts versus languages Punctuation shared among scripts Punctuation shared among scripts Universal character set – characters at the core of Unicode Universal character set – characters at the core of Unicode

7 Five Zones of Unicode Alphabetic scripts (Western languages, Latin, Greek, Cyrillic, Hebrew and Arabic) Alphabetic scripts (Western languages, Latin, Greek, Cyrillic, Hebrew and Arabic) Ideographic scripts (Chinese, Japanese, Korean) Ideographic scripts (Chinese, Japanese, Korean) Other characters (Braille, mathematical symbols) Other characters (Braille, mathematical symbols) Surrogates Surrogates Reserved codes Reserved codes

8 Composite and Combining Characters Unicode Terms Character: abstract form of a letter Character: abstract form of a letter Glyph: a particular rendition of a character on a page Glyph: a particular rendition of a character on a page Different fonts  different glyphs Different fonts  different glyphs Unicode does not distinguish between different glyphs Unicode does not distinguish between different glyphs Characters are abstract members of linguistic scripts, not graphic entities Characters are abstract members of linguistic scripts, not graphic entities Code Point: a Unicode value, specified by prefixing U+ Code Point: a Unicode value, specified by prefixing U+ Includes ligatures and combining diaeresis Includes ligatures and combining diaeresis Canonical and compatibility equivalence Canonical and compatibility equivalence Deprecated characters Deprecated characters Code Range: range of values that characters span Code Range: range of values that characters span

9 Unicode Character Encoding UTF: Unicode character set Transformation Format UTF: Unicode character set Transformation Format UTF-32 UTF-32 ISO standard uses a 32 bit (4 byte) value ISO standard uses a 32 bit (4 byte) value Unicode consortium uses 21 bits Unicode consortium uses 21 bits 32 planes (5 bits) of 65,536 characters (16 bits) 32 planes (5 bits) of 65,536 characters (16 bits) Basic multilingual plane (living languages) Basic multilingual plane (living languages) Supplementary multilingual plane (historic scripts, other alphabets Supplementary multilingual plane (historic scripts, other alphabets Supplementary ideographic plane (ancient Chinese ideographs) Supplementary ideographic plane (ancient Chinese ideographs) Supplementary special-purpose plane (tags for languages) Supplementary special-purpose plane (tags for languages) UTF-16, UTF-8 UTF-16, UTF-8 Hindi and related scripts (ISCII) Hindi and related scripts (ISCII)

10 Representing Documents Plain text Plain text Full-text indexing Full-text indexing Bag of words Bag of words Inversion of the text Inversion of the text Inverted files Inverted files Granularity of document Granularity of document Granularity of index Granularity of index Word segmentation Word segmentation Chinese and Japanese are written without spaces Chinese and Japanese are written without spaces Spacing in Chinese sentences can completely change the meaning Spacing in Chinese sentences can completely change the meaning

11 Page Description Languages Device independence Device independence PostScript (also a programming language) PostScript (also a programming language) First commercially developed page description language (1985) First commercially developed page description language (1985) Fonts: Type 1, TrueType, OpenType Fonts: Type 1, TrueType, OpenType Text extraction Text extraction Using PostScript in a digital library Using PostScript in a digital library

12 Page Description Languages Portable Document Format (PDF) Portable Document Format (PDF) Successor to Postscript Successor to Postscript PDF versus Postscript PDF versus Postscript Not a full-scale programming language Not a full-scale programming language New features for interactive display New features for interactive display Random access to pages Random access to pages Hierarchically structured content Hierarchically structured content Navigation within a document Navigation within a document Hyperlinks Hyperlinks File format: header, objects, cross-references, trailer File format: header, objects, cross-references, trailer Searchable image option Searchable image option

13 Word-Processor Documents Rich Text Format Rich Text Format 240 page specification 240 page specification Document-level metadata Document-level metadata Conversion to HTML software available Conversion to HTML software available Native Word formats Native Word formats Binary Binary Proprietary Proprietary LaTeX format LaTeX format Typed formatting commands Typed formatting commands Non-proprietary Non-proprietary

14 Representing Images Lossless image compression Lossless image compression GIF GIF PNG PNG JPEG-Lossless JPEG-Lossless JPEG-2000 JPEG-2000 Lossy image compression Lossy image compression JPEG JPEG Progressive refinement Progressive refinement

15 Representing Audio and Video Evolution of signals over time Evolution of signals over time Sample rate Sample rate Samples per second Samples per second Multimedia compression Multimedia compression Codec Codec Asymmetry Asymmetry Redundancy Redundancy

16 MPEG ISO Moving Picture Experts Group (1988) ISO Moving Picture Experts Group (1988) Audio and Video at 1.5 Mbit/second Audio and Video at 1.5 Mbit/second Family of standards Family of standards MPEG-1 MPEG-1 Low resolution video, 30 fps, near CD quality Low resolution video, 30 fps, near CD quality Layer 3 – MP3 Layer 3 – MP3 MPEG-2 MPEG-2 Higher quality video (DVD) Higher quality video (DVD) Supports interlaced images (Broadcast TV) Supports interlaced images (Broadcast TV) Multichannel audio Multichannel audio

17 MPEG MPEG-3 abandoned MPEG-3 abandoned MPEG-4 MPEG-4 Low bandwidth networks – mobile and WWW Low bandwidth networks – mobile and WWW Object based (vs. signal based) Object based (vs. signal based) Interactive Interactive Strategies for identifying and managing intellectual property Strategies for identifying and managing intellectual property MPEG-7 MPEG-7 Metadata description for content delivered via MPEG-1,2,4 Metadata description for content delivered via MPEG-1,2,4 MPEG-21 MPEG-21 Multimedia lifecycle Multimedia lifecycle Interoperability Interoperability

18 MPEG Television standards: NTSC, PAL, SECAM Television standards: NTSC, PAL, SECAM Digital television and video standard: CCIR601 Digital television and video standard: CCIR601 MPEG video MPEG video Frames: intra (I), predicted (P), bidirectional (B) Frames: intra (I), predicted (P), bidirectional (B) MPEG audio MPEG audio Acoustic masking Acoustic masking Three compression layers Three compression layers Mixed media Mixed media Time-stamped packets are multiplexed into a single stream Time-stamped packets are multiplexed into a single stream Typically – video 1.2 Mbits/s and audio takes.3 Mbits/s Typically – video 1.2 Mbits/s and audio takes.3 Mbits/s

19 Other Multimedia Formats Audio and Video Audio and Video AVI (Microsoft) AVI (Microsoft) Quicktime (Apple) Quicktime (Apple) Streaming Streaming RealAudio, RealVideo, RealOne (Realsystems) RealAudio, RealVideo, RealOne (Realsystems) ASF (Microsoft) ASF (Microsoft) Audio only Audio only WAV (Microsoft, IBM) WAV (Microsoft, IBM) AIFF (Apple) AIFF (Apple) AU (Sun) AU (Sun)

20 Multimedia in a Digital Library Indexing and browsing structures Indexing and browsing structures Text-based Text-based Content-based Content-based Summarizing audio and video Summarizing audio and video Digitizing media Digitizing media Linear resolution, color depth, frame rate, sample rate Linear resolution, color depth, frame rate, sample rate Preservation issues Preservation issues


Download ppt "Chapter Four Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge."

Similar presentations


Ads by Google