Download presentation
Presentation is loading. Please wait.
Published byAugust Kelly Modified over 9 years ago
1
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text
2
2 Administration Assignment 1 submission problems: Due date postponed to Thursday 12:20 Demonstration by Dean Eckstrom Wednesday discussion classes: Olin 155, 7:30-8:25 and 8:35 to 9:00 Check Notices for sections
3
3 Digital Libraries and Checking Information Email to Teaching Assistants: "I have heard that..." "There is a rumor that..." Authoritative source(s): Course web site -- Notices
4
4 Text The richness of text Elements: letters, scripts, symbols Structure: words, sentences, paragraphs, headings, tables Appearance: fonts, layout, design, materials Special: mathematics, music Digital libraries must represent ever variant!
5
5 Markup and Page Description Mark-up languages represent the structure of text e.g., SGML, XML The mark-up must be combined with a style sheet for rendering. Page description languages represent the appearance of text e.g., PostScript, PDF
6
6 Markup and Style Sheets style sheet rendering software document content and structure formatted document
7
7 Alternative Renderings style sheet for display rendering software document content and structure printed document rendering software style sheet for print computer display
8
8 Example: the Oxford English Dictionary Typography of printed text represented semantic information. Keyboard the text, capturing all typographic information. Automatic parser to extract semantics (e.g., date, quotation, phonetics, etc.). Markup in SGML to tag semantic information. Separate style sheets for various editions, print, CD-ROM, online. Before the web, yet used with the web.
9
9 Character Distinguish between the abstract character as a structural element, "A" representations of the character A A A A 100001 A A "capital a"
10
10 ASCII A binary encoding of a character as an 8-bit byte, e.g., 01000001 is the encoding for "A" 0 127 255 printable ASCII standard (7-bit) ASCII extended (8-bit) ASCII 32
11
11 Unicode 16-bit codes that represent distinct characters organized by scripts, not languages compatible with Unihan (Chinese, Japanese, Korean)
12
12 Scripts Scripts supported by Unicode 2.0 Arabic Armenian Bengali Bopomofo Cyrillic Devanagari Georgian Greek Gujarati Gurmkhi Han Hangul Hebrew Hiragana Kannada Katakana Latin Lao Malayalam Oriya Phonetic Tamil Telugu Thai Tibetan
13
13 More Scripts Numbers General Diacritics General Punctuation General Symbols Mathematical Symbols Technical Symbols Dingbats Arrows, Blocks, Box Drawing Forms & Geometric Shapes Miscellaneous Symbols Presentation Forms
14
14 Unicode and UTF-8 UTF-8 a stream encoding of Unicode characters. one to six bytes to represent each Unicode character, identified by number of leading ones. single byte characters are identical to printable ASCII, e.g., 01000001 has no leading one, therefore it is a single byte code.
15
15 Markup Languages SGML (Standard Generalized Markup Language) A system for creating markup languages that represent the structure of a document XML (eXtensible Markup Language) A simplified version of SGML intended for use with online information DTD (Data Type Definition) A markup specification for a class of documents, defined within the SGML framework HTML (Hypertext Markup Language) A markup and formatting language with links to other objects
16
16 XML Example (Metadata) Digital Libraries and the Problem of Purpose David M. Levy Corporation for National Research Initiatives January 2000 article continued on next slide
17
17 continued from previous slide 10.1045/january2000-levy http://www.dlib.org/dlib/january00/01levy.html English D-Lib Magazine 1082-9873 6 1 Copyright (c) David M. Levy XML Example (Metadata)
18
18 Constructing a DTD: Entities Entities are basic units of information: Character entities a b... z 0 1... 9 ! ?... < α Any other entities &logo; &square-root;
19
19 Entities The name of an entity is purely mnemonic. It makes no assertions about the context in which the entity is used or its appearance when rendered. The DTD used by a scientific publisher will have about 4,000 entities to represent all the special symbols and the variants used in scientific disciplines.
20
20 Constructing a DTD: Elements Elements define the structure. An element is a string of entities, bracketed by tags: This is a paragraph. Some heading Jane Austen John Hancock
21
21 Constructing a DTD: Grammar Every DTD has a grammar that defines: allowable relationships between entities and elements hierarchies and nesting etc. The grammar is expressed as a set of rules that can be processed automatically.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.