BSIM0010 Digital Libraries: Documents (mainly based on Chapter 4 of Witten & Bainbridge’s book)

Slides:



Advertisements
Similar presentations
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Advertisements

WMES3103 : INFORMATION RETRIEVAL
Chapter 2: Algorithm Discovery and Design
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Binary Expression Numbers & Text CS 105 Binary Representation At the fundamental hardware level, a modern computer can only distinguish between two values,
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
Chapter 2: Algorithm Discovery and Design
Chapter 2: Algorithm Discovery and Design
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
 Method of representing or encoding numbers  Two main notation types  Sign-value  Roman numerals  Positional (place-value)  Modern decimal notation.
HYPERTEXT MARKUP LANGUAGE (HTML)
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
Nat 4/5 - Software Design and Development – Low Level Operations - 1 National 4/5 – Computing Science Information Systems Design and Development Media.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
ECA 228 Internet/Intranet Design I Meta Tags & Directories.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
2.1 Different Text Attributes Font A set of printable or displayable text characters with its style and size specified Arial 16 point bold Arial 32 point.
Sem 1 v2 Chapter 14: Layer 6 - The Presentation layer.
Multimedia and the Web Chapter Overview  This chapter covers:  What Web-based multimedia is  how it is used today  advantages and disadvantages.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
Chapter Four Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Chapter 6 Text and Multimedia Languages and Properties
Data Usually computing systems are complex devices, dealing with a vast array of information categories.
Lecture 2 Character Codes and Low-Structure Text Document Formats.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Institute of Technology Sligo - Dept of Computing Sem 1 Chapter 14: Layer 6 - The Presentation layer.
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science, C++ Version, Third Edition.
Data Representation CS280 – 09/13/05. Binary (from a Hacker’s dictionary) A base-2 numbering system with only two digits, 0 and 1, which is perfectly.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
Chapter 3: The UNIX Editors ASCII and vi Editors.
Chapter Three The UNIX Editors. 2 Lesson A The vi Editor.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CS151 Introduction to Digital Design
10-Sep Fall 2001: copyright ©T. Pearce, D. Hutchinson, L. Marshall Sept Representing Information in Computers:  numbers: counting numbers,
Text and Graphics September 26, Unit 3.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
1 Data Representation Characters, Integers and Real Numbers Binary Number System Octal Number System Hexadecimal Number System Powered by DeSiaMore.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
Data Representation Conversion 24/04/2017.
Chapter Three The UNIX Editors.
Lis508 lecture 2: characters to textual documents Thomas Krichel
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
1 Problem Solving using Computers “Data....Representation, and Storage.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
Layer 6 Presentation Layer. Overview Now that you have learned about Layer 5 of the OSI model, it is time to look at Layer 6, the presentation layer.
2. Data Formats. Introduction Examples pp Real World Data Computer Data Input device Dear Mom: Keyboard … Digital camera …
 Method of representing or encoding numbers  Two main notation types  Sign-value  Roman numerals  Positional (place-value)  Modern decimal notation.
Introduction to MPEG  Moving Pictures Experts Group,  Geneva based working group under the ISO/IEC standards.  In charge of developing standards for.
Binary and Decimal Numbers
Data Representation.
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Machine level representation of data Character representation
Characters & Fonts Digital Multimedia, 2nd edition
Guide To UNIX Using Linux Third Edition
Characters & Fonts Digital Multimedia, 2nd edition
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Chapter 3 - Binary Numbering System
Presentation transcript:

BSIM0010 Digital Libraries: Documents (mainly based on Chapter 4 of Witten & Bainbridge’s book)

October 2005Documents2 Why bothering documents in DL Documents are DL’s building blocks. Documents are DL’s building blocks. High-level considerations High-level considerations  How are documents organized? Would documents be classified to various collections?  What do they look like, e.g., as web pages, or file attachment? Low level considerations Low level considerations  How are documents represented?  Document types, formats and their characteristics are to be considered.

October 2005Documents3 Challenges in keeping documents in DLs Probably the biggest challenge Probably the biggest challenge  Tension between the fast changing technologies and the long-term existence of libraries Decision problem Decision problem  What document formats to be supported (and why)? Page description (e.g. PDF) or editable format? Does a DL need to support internationalization? Does a DL need to support internationalization?  Display documents of various languages properly  Multi-languages user interface  Ability to search text properly (which is not always easy)

October 2005Documents4 Your turn … Download a document called “sample.rtf” from ILN and view it with MS Word. Now open it with Notepad and find the number of occurrences of the word “bold” in it. Does the number in line with your expectation? Any implication of your finding to the search design of a digital library? Download a document called “sample.rtf” from ILN and view it with MS Word. Now open it with Notepad and find the number of occurrences of the word “bold” in it. Does the number in line with your expectation? Any implication of your finding to the search design of a digital library?

October 2005Documents5 Searching properly is not always easy Internationalization can be problematic for searching Internationalization can be problematic for searching  Same text (more exactly, character) may not be represented in the same way in various document types, e.g. the ‘ï’ in naïve is represented as  ï (name code) or ï (number code) in HTML  A single character in extended ASCII (decimal equivalent 139) extended ASCII extended ASCII Solution: Unicode Solution: Unicode  Helps with character representation only (unique representation of each character)

October 2005Documents6 Indexing can be difficult too Indexes are often created out of “important” words from documents so as to facilitate rapid full-text searching. Automatic indexing is not easy as the issues are involved: Indexes are often created out of “important” words from documents so as to facilitate rapid full-text searching. Automatic indexing is not easy as the issues are involved:  How “important” words can be identified? e.g. is, the, am  How can we segment meaningful words out of a document when the languages are not written with spaces between words, e.g., Chinese?  Text abstraction is an area of research that aims to deal with the aforesaid problems. It requires knowledge about a language as well as the domain knowledge.

October 2005Documents7 Document formats not restricted to text Text Text  Portable Document Format (PDF), Rich Text Format (RTF), MS Word, LaTeX, and XML etc. Image Image  GIF, JPEG, PNG, TIFF Video Video  MPEG, Apple’s QuickTime, Microsoft’s AVI Audio Audio  WAV, AIFF, WMA, AU, AAC, MP3, …

October 2005Documents8 Are document format details important When striving for standardization, knowledge about how different formats work may help one to appreciate their strengths and weaknesses, e.g., When striving for standardization, knowledge about how different formats work may help one to appreciate their strengths and weaknesses, e.g.,  Interactive features of PDF will lose when converted into Postscript.  Converting GIF to JPEG can degrade image quality irreversibly.  Converting HTML to Postscript is easy but converting Postscript to HTML with the visual features accurately replicated will be extremely difficult.

October 2005Documents9 Representing characters Different computer vendors adopted different set of codes in the products until the 7-bit (1 bit: 2 combination 0, 1) America Standard Code for Information Interchange (ASCII) was announced by American National Standard Institute (ANSI). Different computer vendors adopted different set of codes in the products until the 7-bit (1 bit: 2 combination 0, 1) America Standard Code for Information Interchange (ASCII) was announced by American National Standard Institute (ANSI). In addition to printable characters, ASCII include non-printable characters too In addition to printable characters, ASCII include non-printable characters tooASCII  BEL rings the bell  BS for backspace the cursor (128 combination  STX starts the transmission in communication

October 2005Documents10 Extended ASCII: a solution or not? ASCII support the English language only. ASCII support the English language only. International Standards Organization (ISO) extends ASCII by using the unused bit of the 8-bit code to include more characters of alphabet-based languages, e.g., French and German. International Standards Organization (ISO) extends ASCII by using the unused bit of the 8-bit code to include more characters of alphabet-based languages, e.g., French and German.  Not a solution to many other languages like Chinese and Hebrew as the code can support up to 256 patterns only  Result: new competing standards arise, e.g., GB and Big-5 for Chinese. (support Chinese and English but not other language)

October 2005Documents11 Unicode First worked by Apple and Xerox in 1988 First worked by Apple and Xerox in 1988 A consortium formed with many other interested parties was founded in 1991 A consortium formed with many other interested parties was founded in 1991 ISO-10646, the “Universal Multiple-Octet (8-bit) Coded Character Set” (UCS) was released in 1993 ISO-10646, the “Universal Multiple-Octet (8-bit) Coded Character Set” (UCS) was released in 1993 Unicode aimed to represent scripts of languages in use around the world (and this has been achieved). Unicode aimed to represent scripts of languages in use around the world (and this has been achieved). Unicode offers a round-trip compatibility (e.g. ASCII – 7 bit) with existing encoding schemes. Unicode offers a round-trip compatibility (e.g. ASCII – 7 bit) with existing encoding schemes. Current work is to address historic languages and notations such as music. Current work is to address historic languages and notations such as music.

October 2005Documents12 Support for Unicode Computer languages Computer languages  Java (built-in Unicode support)  C, Perl, Python, etc. (standard Unicode libraries) All key operating systems, e.g., Windows 2000 and its successors All key operating systems, e.g., Windows 2000 and its successors Application support, e.g., Web browsers Application support, e.g., Web browsers Default encoding for HTML and XML Default encoding for HTML and XML

October 2005Documents13 Unicode character set Comes in two parts (ISO & ISO ) Comes in two parts (ISO & ISO ) Specifies a total of 94,000 characters Specifies a total of 94,000 characters ISO focuses on commonly used languages, called Basic Multilingual Plane (BMP) (express all info, ensure uniqueness of character) ISO focuses on commonly used languages, called Basic Multilingual Plane (BMP) (express all info, ensure uniqueness of character)  Divided into 1,000 pages  Contains 49,000 characters  Code space into different scripts Distinguishes characters by script and not by language, e.g., CJK ideographs for Chinese, Japan, and Korean Hangul characters Distinguishes characters by script and not by language, e.g., CJK ideographs for Chinese, Japan, and Korean Hangul characters

October 2005Documents14 Unicode character code charts by script

October 2005Documents15 CJK unified ideographs Japanese Simplified Chinese Traditional Chinese

October 2005Documents16 Composite and combining characters (1/3) A code point (modify specific character) is a Unicode value prefixed by U+ to the numeric value given in hexadecimal A code point (modify specific character) is a Unicode value prefixed by U+ to the numeric value given in hexadecimalhexadecimal A code point does not always represent a character, e.g., the Latin small ligature fi A code point does not always represent a character, e.g., the Latin small ligature fi A code point may specify part of a character, e.g., combining diaeresis (i.e., the two dots above ‘ï’, ‘ä’ and ‘ü’, etc codes to make a new word) which must be combined with a base character to form meaningful characters A code point may specify part of a character, e.g., combining diaeresis (i.e., the two dots above ‘ï’, ‘ä’ and ‘ü’, etc codes to make a new word) which must be combined with a base character to form meaningful characters

October 2005Documents17 Composite and combining characters (2/3) To comply with the round-trip compatibility requirement, Unicode has single code points that correspond to precisely the same units as character combinations and this means that some characters have more than one representation To comply with the round-trip compatibility requirement, Unicode has single code points that correspond to precisely the same units as character combinations and this means that some characters have more than one representation Combining characters help compensate for omissions which may otherwise be fixed by a lengthy standardization process Combining characters help compensate for omissions which may otherwise be fixed by a lengthy standardization process

October 2005Documents18 Composite and combining characters (3/3) Problem Problem  Alternative representations of the same character complicate:  Text searching (as alterative representations have to be considered)  String comparison  Sorting text into lexicographic order Solution Solution  Representing text in some normalized form; or normalized formnormalized form  Excluding the use of combining characters (Unicode Level 1)

October 2005Documents19 Unicode character encodings (1/3) UTF-32 UTF-32  ISO standard reserves 32 bits for each Unicode character with future extension in mind  Unicode consortium limits the range of code values to the first 21 bits  Rarely used in practice due to inefficient use of space

October 2005Documents20 Unicode character encodings (2/3) UTF-16 UTF-16  Basic Multilingual Plane (BMP) can be expressed in 16 bits (UTF-16)  A pair of surrogate characters (1 surrogate = 16 bit), defined inside the BMP, are used to represent characters outside BMP (see

October 2005Documents21 Unicode character encodings (3/3) UTF-8 UTF-8  Motivation  the majority of UNIX tools expects ASCII files and cannot read 16-bit words as characters without major modifications  Ken Thompson (inventor of the Unix operating system and computer programming language B, predecessor of C programming language) invented UTF-8 over a placemat on a dining table in September, 1992 invented UTF-8 over a placematinvented UTF-8 over a placemat

October 2005Documents22 Encoding Unicode characters as UTF-8

October 2005Documents23 UTF-8 properties UCS characters U+0000 to U+007F (ASCII) are encoded as bytes 0x00 to 0x7F (ASCII compatibility). Thus files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. UCS characters U+0000 to U+007F (ASCII) are encoded as bytes 0x00 to 0x7F (ASCII compatibility). Thus files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. All possible 2 31 UCS codes can be encoded. All possible 2 31 UCS codes can be encoded. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long. To avoid ambiguity, UTF-8 must use the shortest possible encoding. To avoid ambiguity, UTF-8 must use the shortest possible encoding.

October 2005Documents24 Using Unicode in a DL Ensure that the underlying programming language selects the right data type to represent Unicode characters Ensure that the underlying programming language selects the right data type to represent Unicode characters If possible, work with BMP and Unicode level 1 to make implementation of string matching operations easy. If possible, work with BMP and Unicode level 1 to make implementation of string matching operations easy. Save documents in UTF-8 to save storage space Save documents in UTF-8 to save storage space

October 2005Documents25 Documents in plain text (1/2) Ensure two communicating programs make the same assumption about the text coding or characters may not be displayed correctly Ensure two communicating programs make the same assumption about the text coding or characters may not be displayed correctly Paragraphs are separated by two successive line breaks, or the first line is indented with a tab character (which usually stops at every eighth character position). Paragraphs are separated by two successive line breaks, or the first line is indented with a tab character (which usually stops at every eighth character position). Different operating systems adopt conflicting conventions for specifying line breaks Different operating systems adopt conflicting conventions for specifying line breaks

October 2005Documents26 Documents in plain text (2/2) Different operating systems adopt conflicting conventions for specifying line breaks Different operating systems adopt conflicting conventions for specifying line breaks  Microsoft Windows inserts a new line when encountering carriage return followed by line feed but Apple Macintosh and Unix insert a new line when hitting line feed (and this may cause display program when a text file is created and opened in two computing environments) No metadata can be included explicitly No metadata can be included explicitly

October 2005Documents27 Data quality issues in DL Three levels of data quality Three levels of data quality  Absolute data quality (quality of original source)  Overall level of data quality of both digital objects and metadata within a DL  Faithful reproduction data quality (e.g. scanning)  Data quality of objects that originated elsewhere, e.g. are scan-in documents correctly recognized? (Data quality of translation process)  Digital data quality  Data quality of objects and metadata that were born digital within a DL

October 2005Documents28 Typographical errors (Gardner, 1992) Errors of letter omission, letter insertion, letter substitution, and letter transposition Typographical errors can also occur at the word or sentence level, e.g. omission of words or paragraphs Typographical errors can hinder access for phrase and proximity searching, and typos can create greater obstacles when they occur in the metadata associated with an object

October 2005Documents29 Scanning and data conversion errors Scanning errors like insertion of extra space or misread a letter or character Scanning errors like insertion of extra space or misread a letter or character It’s also possible for errors to creep into text documents when they are converted from one format to another, e.g. conversion of a document from MS Word format to HTML. Using find-and-replace only with care

October 2005Documents30 Metadata errors Metadata errors can block access to documents in a DL Metadata errors can block access to documents in a DL Particularly bad for objects like images that full text searching is not available Particularly bad for objects like images that full text searching is not available Automatic creation of metadata by means of special software can lead to errors in the metadata. Metadata creation generally needs human intervention to be successful. Translating metadata from one scheme or format to another can also be a source of errors.

October 2005Documents31 Managing errors (problem of data quality) Fix the error in the document (if you own it) Make available a new document that replaces the document with the error Make available a new document that contains a notice of the error and its correction (this document would be hyperlinked to and from the original document containing the error) Use special search software to compensate for some errors

October 2005Documents32 Document management issues Document storage issue is only one of the DL issues; other important questions to ask include Document storage issue is only one of the DL issues; other important questions to ask include  Would documents in a “digital library” be updated? If yes,  who can perform the update?  would an update need someone to approve it before it is effective?  would content changes be tracked?  How can a document be searched? By metadata, by content, or by both?

October 2005Documents33 References National Institute of Standards and Technology. (2002). Digital media file types: Survey of common formats. ( ces/Digital%20Media%20File%20Formats.pdf) National Institute of Standards and Technology. (2002). Digital media file types: Survey of common formats. ( ces/Digital%20Media%20File%20Formats.pdf) ces/Digital%20Media%20File%20Formats.pdfhttp:// ces/Digital%20Media%20File%20Formats.pdf Beall, J. (2005). Metadata and Data Quality Problems in the Digital Library. Journal of Digital Information, 6(3), Article No. 355, (Available: Beall, J. (2005). Metadata and Data Quality Problems in the Digital Library. Journal of Digital Information, 6(3), Article No. 355, (Available: