Chapter Four Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge.

Slides:



Advertisements
Similar presentations
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Advertisements

Web Design Vocab 3 PNG, JPG, GIF, MP3, MPEG.
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
EE442—Multimedia Networking Jane Dong California State University, Los Angeles.
WMES3103 : INFORMATION RETRIEVAL
BSIM0010 Digital Libraries: Documents (mainly based on Chapter 4 of Witten & Bainbridge’s book)
Media: Text “Words and symbols in any form, spoken or written, are the most common system of communication.” ~ unknown.
Digital Video Teppo Räisänen LIIKE/OAMK. General Information Originally video material was processed using analog tools Nowadays it is common, that digital.
Introduction to Computer Graphics
IT-Academic Technology Services Using Microsoft PowerPoint 2010 for Digital Storytelling.
+ Video Compression Rudina Alhamzi, Danielle Guir, Scott Hansen, Joe Jiang, Jason Ostroski.
HYPERTEXT MARKUP LANGUAGE (HTML)
ITEC 1011 Introduction to Information Technologies 2. Data Formats Chapt. 3.
Nat 4/5 - Software Design and Development – Low Level Operations - 1 National 4/5 – Computing Science Information Systems Design and Development Media.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Chapter 14-Designing for the World Wide Web. Overview Introducing multimedia on the Web. Designing text for the Web. Creating images for the Web. Adding.
CS 1308 Computer Literacy and the Internet. Creating Digital Pictures  A traditional photograph is an analog representation of an image.  Digitizing.
MPEG-2 Standard By Rigoberto Fernandez. MPEG Standards MPEG (Moving Pictures Experts Group) is a group of people that meet under ISO (International Standards.
File Formats COM 366 Web Design & Layout. Native file format –Format native to software program –.psd > PhotoShop default Preserves layers –Use “Save.
Sem 1 v2 Chapter 14: Layer 6 - The Presentation layer.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Lecture 2 Character Codes and Low-Structure Text Document Formats.
Naresuan University Multimedia Paisarn Muneesawang
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
1 Seminar Presentation Multimedia Audio / Video Communication Standards Instructor: Dr. Imran Ahmad By: Ju Wang November 7, 2003.
Institute of Technology Sligo - Dept of Computing Sem 1 Chapter 14: Layer 6 - The Presentation layer.
Data Representation CS280 – 09/13/05. Binary (from a Hacker’s dictionary) A base-2 numbering system with only two digits, 0 and 1, which is perfectly.
© Keith Vander Linden, 2005 Jeremy D. Frens, Open up the box of a computer, and you won't find any numbers in there. You'll find electromagnetic.
Multimedia and The Web.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge.
Common file formats  Lesson Objective: Understanding common file formats and their differences.  Learning Outcome:  Describe the type of files which.
GIF - Graphics Interchange Format JPEG - Joint Photographic Experts Group PNG - Portable Network Graphics GIF images are limited to the 8 bit palette which.
Multimedia From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Important Multimedia Issues Audio Movies.
Data Compression. Compression? Compression refers to the ways in which the amount of data needed to store an image or other file can be reduced. This.
Core Issues in Digital Preservation: Audio and Video Jacob Nadal, Preservation Officer UCLA Library.
© Keith Vander Linden, 2005 Jeremy D. Frens, Open up the box of a computer, and you won't find any numbers in there. You'll find electromagnetic.
Chapter 2 : Imaging and Image Representation Computer Vision Lab. Chonbuk National University.
How Analog and Digital Recording Works Analog converted to digital via an ADV (Analog to Digital Converter = stream of numbers) On playback: digital converted.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Introduction to metadata
File Format. Graphics file Format GIF (Graphics Interchange Format) JPEG (Joint Photographic Experts Group) PNG (Portable Network Graphics) TIFF (Tag.
File Format. Graphic file Format GIF –cross-platform compatibility –developed by CompuServe as a common format for exchanging bitmapped images between.
Chapter 1 Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
File Analysis Dr. John P. Abraham Professor UTPA.
Marwan Al-Namari 1 Digital Representations. Bits and Bytes Devices can only be in one of two states 0 or 1, yes or no, on or off, … Bit: a unit of data.
Media Types Information Systems can contain the following types of media: Sound, graphics, video & text.
Chap 14 Presentation Layer Andres, Wen-Yuan Liao Department of Computer Science and Engineering De Lin Institute of Technology
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
COMP135/COMP535 Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 2 Lecture 2 – Digital Representations.
Chapter 1 Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Layer 6 Presentation Layer. Overview Now that you have learned about Layer 5 of the OSI model, it is time to look at Layer 6, the presentation layer.
1 Part A Multimedia Production Chapter 2 Multimedia Basics Digitization, Coding-decoding and Compression Information and Communication Technology.
Introduction to MPEG  Moving Pictures Experts Group,  Geneva based working group under the ISO/IEC standards.  In charge of developing standards for.
Software Design and Development Storing Data Part 2 Text, sound and video Computing Science.
Part A Multimedia Production
4th Edition, Irv Englander
Level 3 Extended Diploma Unit 19 Computer Systems Architecture
Level 3 Extended Diploma Unit 19 Computer Systems Architecture
Representing Information as bit patterns
Phnom Penh International University (PPIU)
Overview What is Multimedia? Characteristics of multimedia
Ch2: Data Representation
2. Data Formats Chapt. 3.
Lesson 5: Multimedia on the Web
File Compression and Formats
Presentation transcript:

Chapter Four Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge

Documents Building blocks of digital libraries Building blocks of digital libraries Many different standards for documents Many different standards for documents Internationalization Internationalization Fixed versus fluid Fixed versus fluid Permanent versus transient Permanent versus transient Indexing Indexing

Standards Organizations American National Standards Institute (ANSI) American National Standards Institute (ANSI) International Standards Organization (ISO) International Standards Organization (ISO)

Representing Characters EBCDIC EBCDIC Extended Binary Coded Decimal Interchange Code Extended Binary Coded Decimal Interchange Code Represented in 8 bits Represented in 8 bits ASCII (1968) ASCII (1968) American Standard Code for Information Interchange American Standard Code for Information Interchange Represented with 7 bits Represented with 7 bits Does not support many foreign languages Does not support many foreign languages Many expansions made to the basic ASCII character set Many expansions made to the basic ASCII character set ISCII (1983) ISCII (1983) Indian Script Code for Information Interchange Indian Script Code for Information Interchange Hindi and related languages Hindi and related languages GB and Big-5 for Chinese GB and Big-5 for Chinese

Unicode Successor of ASCII Successor of ASCII ISO (1993) ISO (1993) Universal Universal Aims to represent ALL the world’s languages Aims to represent ALL the world’s languages Default encoding for HTML and XML Default encoding for HTML and XML Development began in 1988 as a joint effort between Apple and Xerox Development began in 1988 as a joint effort between Apple and Xerox Unicode standard continues to evolve Unicode standard continues to evolve Round-trip compatibility – Unicode can be mapped to/from any character set without loss Round-trip compatibility – Unicode can be mapped to/from any character set without loss

Unicode Character Set Unicode standard is massive Unicode standard is massive Two subsets of standard: ISO /2 Two subsets of standard: ISO /2 94,000 characters defined 94,000 characters defined Represents scripts Represents scripts Scripts versus languages Scripts versus languages Punctuation shared among scripts Punctuation shared among scripts Universal character set – characters at the core of Unicode Universal character set – characters at the core of Unicode

Five Zones of Unicode Alphabetic scripts (Western languages, Latin, Greek, Cyrillic, Hebrew and Arabic) Alphabetic scripts (Western languages, Latin, Greek, Cyrillic, Hebrew and Arabic) Ideographic scripts (Chinese, Japanese, Korean) Ideographic scripts (Chinese, Japanese, Korean) Other characters (Braille, mathematical symbols) Other characters (Braille, mathematical symbols) Surrogates Surrogates Reserved codes Reserved codes

Composite and Combining Characters Unicode Terms Character: abstract form of a letter Character: abstract form of a letter Glyph: a particular rendition of a character on a page Glyph: a particular rendition of a character on a page Different fonts  different glyphs Different fonts  different glyphs Unicode does not distinguish between different glyphs Unicode does not distinguish between different glyphs Characters are abstract members of linguistic scripts, not graphic entities Characters are abstract members of linguistic scripts, not graphic entities Code Point: a Unicode value, specified by prefixing U+ Code Point: a Unicode value, specified by prefixing U+ Includes ligatures and combining diaeresis Includes ligatures and combining diaeresis Canonical and compatibility equivalence Canonical and compatibility equivalence Deprecated characters Deprecated characters Code Range: range of values that characters span Code Range: range of values that characters span

Unicode Character Encoding UTF: Unicode character set Transformation Format UTF: Unicode character set Transformation Format UTF-32 UTF-32 ISO standard uses a 32 bit (4 byte) value ISO standard uses a 32 bit (4 byte) value Unicode consortium uses 21 bits Unicode consortium uses 21 bits 32 planes (5 bits) of 65,536 characters (16 bits) 32 planes (5 bits) of 65,536 characters (16 bits) Basic multilingual plane (living languages) Basic multilingual plane (living languages) Supplementary multilingual plane (historic scripts, other alphabets Supplementary multilingual plane (historic scripts, other alphabets Supplementary ideographic plane (ancient Chinese ideographs) Supplementary ideographic plane (ancient Chinese ideographs) Supplementary special-purpose plane (tags for languages) Supplementary special-purpose plane (tags for languages) UTF-16, UTF-8 UTF-16, UTF-8 Hindi and related scripts (ISCII) Hindi and related scripts (ISCII)

Representing Documents Plain text Plain text Full-text indexing Full-text indexing Bag of words Bag of words Inversion of the text Inversion of the text Inverted files Inverted files Granularity of document Granularity of document Granularity of index Granularity of index Word segmentation Word segmentation Chinese and Japanese are written without spaces Chinese and Japanese are written without spaces Spacing in Chinese sentences can completely change the meaning Spacing in Chinese sentences can completely change the meaning

Page Description Languages Device independence Device independence PostScript (also a programming language) PostScript (also a programming language) First commercially developed page description language (1985) First commercially developed page description language (1985) Fonts: Type 1, TrueType, OpenType Fonts: Type 1, TrueType, OpenType Text extraction Text extraction Using PostScript in a digital library Using PostScript in a digital library

Page Description Languages Portable Document Format (PDF) Portable Document Format (PDF) Successor to Postscript Successor to Postscript PDF versus Postscript PDF versus Postscript Not a full-scale programming language Not a full-scale programming language New features for interactive display New features for interactive display Random access to pages Random access to pages Hierarchically structured content Hierarchically structured content Navigation within a document Navigation within a document Hyperlinks Hyperlinks File format: header, objects, cross-references, trailer File format: header, objects, cross-references, trailer Searchable image option Searchable image option

Word-Processor Documents Rich Text Format Rich Text Format 240 page specification 240 page specification Document-level metadata Document-level metadata Conversion to HTML software available Conversion to HTML software available Native Word formats Native Word formats Binary Binary Proprietary Proprietary LaTeX format LaTeX format Typed formatting commands Typed formatting commands Non-proprietary Non-proprietary

Representing Images Lossless image compression Lossless image compression GIF GIF PNG PNG JPEG-Lossless JPEG-Lossless JPEG-2000 JPEG-2000 Lossy image compression Lossy image compression JPEG JPEG Progressive refinement Progressive refinement

Representing Audio and Video Evolution of signals over time Evolution of signals over time Sample rate Sample rate Samples per second Samples per second Multimedia compression Multimedia compression Codec Codec Asymmetry Asymmetry Redundancy Redundancy

MPEG ISO Moving Picture Experts Group (1988) ISO Moving Picture Experts Group (1988) Audio and Video at 1.5 Mbit/second Audio and Video at 1.5 Mbit/second Family of standards Family of standards MPEG-1 MPEG-1 Low resolution video, 30 fps, near CD quality Low resolution video, 30 fps, near CD quality Layer 3 – MP3 Layer 3 – MP3 MPEG-2 MPEG-2 Higher quality video (DVD) Higher quality video (DVD) Supports interlaced images (Broadcast TV) Supports interlaced images (Broadcast TV) Multichannel audio Multichannel audio

MPEG MPEG-3 abandoned MPEG-3 abandoned MPEG-4 MPEG-4 Low bandwidth networks – mobile and WWW Low bandwidth networks – mobile and WWW Object based (vs. signal based) Object based (vs. signal based) Interactive Interactive Strategies for identifying and managing intellectual property Strategies for identifying and managing intellectual property MPEG-7 MPEG-7 Metadata description for content delivered via MPEG-1,2,4 Metadata description for content delivered via MPEG-1,2,4 MPEG-21 MPEG-21 Multimedia lifecycle Multimedia lifecycle Interoperability Interoperability

MPEG Television standards: NTSC, PAL, SECAM Television standards: NTSC, PAL, SECAM Digital television and video standard: CCIR601 Digital television and video standard: CCIR601 MPEG video MPEG video Frames: intra (I), predicted (P), bidirectional (B) Frames: intra (I), predicted (P), bidirectional (B) MPEG audio MPEG audio Acoustic masking Acoustic masking Three compression layers Three compression layers Mixed media Mixed media Time-stamped packets are multiplexed into a single stream Time-stamped packets are multiplexed into a single stream Typically – video 1.2 Mbits/s and audio takes.3 Mbits/s Typically – video 1.2 Mbits/s and audio takes.3 Mbits/s

Other Multimedia Formats Audio and Video Audio and Video AVI (Microsoft) AVI (Microsoft) Quicktime (Apple) Quicktime (Apple) Streaming Streaming RealAudio, RealVideo, RealOne (Realsystems) RealAudio, RealVideo, RealOne (Realsystems) ASF (Microsoft) ASF (Microsoft) Audio only Audio only WAV (Microsoft, IBM) WAV (Microsoft, IBM) AIFF (Apple) AIFF (Apple) AU (Sun) AU (Sun)

Multimedia in a Digital Library Indexing and browsing structures Indexing and browsing structures Text-based Text-based Content-based Content-based Summarizing audio and video Summarizing audio and video Digitizing media Digitizing media Linear resolution, color depth, frame rate, sample rate Linear resolution, color depth, frame rate, sample rate Preservation issues Preservation issues