Lecture 2 Character Codes and Low-Structure Text Document Formats
Character Codes A character encoding consists of a code that pairs a sequence of characters from a given character set with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers and the transmission of text through telecommunication networks. (Wikipedia)
Character Codes (contd.) Morse Code – Samuel F. B. Morse, 1838; first employed in the American telegraphs. Baudot Code - J. Baudot, 1874; 5-bit code, which was also the world's first binary character code for processing textual data; standardized it in 1932 as "International Telegraphic Alphabet No. 2." Hollerith code – employed in the Hollerith's tabulating machines used in the 1890 census in the USA; first use of punched cards for processing of data; IBM core technology until the 1960’s.
Character Codes (contd.) ASCII - The American Standard Code for Information Interchange, standard 7-bit code proposed by ANSI in 1963, and finalized in 1968; EBCDIC - Extended Binary-Coded Decimal Interchange Code; an 8-bit IBM code for representing characters as numbers first used in 1964; based on the earlier 6-bit BCDIC (BCD). ISO European 7-bit coded character set for information interchange adopted in 1967 by ISO. Extended ASCII - 8-bit code with an additional 128 characters to represent characters from foreign languages and special symbols for drawing pictures. ISO 2022, ISO 8859 – Set of 8-bit character codes which cover accented Latin characters, Greek, Cyrillic and Hebrew alphabets. ANSI - The Microsoft collective name for all Windows code pages. Sometimes used specifically for code page 1252, which is a superset of ISO (Latin 1). See for example
Character Codes (contd.) Unicode – 16-bit multilingual character code project launched in the 1980s in the USA by the Unicode Consortium ( UCS (ISO 10646) - Universal Coded Character Set (UCS) first introduced in 1993; UCS-4 (32-bit), UCS-2 (16-bit). ASCII (x41) (Latin 1) (x41) Unicode, UCS (x0041) UCS Standard Represenation of ‘A’ (Latin) x
Character Codes (contd.) UTF-8 and UTF-16 – UCS/Unicode Transformation Format, methods for encoding a string of UCS/Unicode characters as a sequence of bytes. U U F:0xxxxxxx U U FF:110xxxxx 10xxxxxx U U-0000FFFF:1110xxxx 10xxxxxx 10xxxxxx U U-001FFFFF:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U U-03FFFFFF:111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx U U-7FFFFFFF: x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx UCS-4 (hex) UTF-8 (binary)
ASCII Text Files (.txt) ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. x09 – Character Tab (HT)x0A – Line Feed (LF) x0D – Carriage Return (CR)x20 – Space x48 – ‘H’x65 – ‘e’ Hello World Document Architectures ANSI hw-ansi.txt C 6C 6F F 72 6C 64 0D 0A 44 6F D 65 6E
ASCII Text Files (.txt) ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. Hello World Document Architectures Unicode hw-unicode.txt Little-Endian, Intel Architecture FF FE C 00 6C 00 6F F C D 00 0A F D E
Hello World Document Architectures ASCII Text Files (.txt) Big-Endian, Sun’s SPARC, Motorola’s 68K, Java VM ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. Unicode Big Endian hw-unicode-be.txt FE FF C 00 6C 00 6F F C D 00 0A F D E
Hello World Document Architectures ASCII Text Files (.txt) ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. UTF-8 hw-utf-8.txt UTF-8 signature EF BB BF C 6C 6F F 72 6C 64 0D 0A 44 6F D 65 6E
Rich Text Format (.rtf) Microsoft, RTF 1.9.1, 2007 (first version from the 1980s) Microsoft, RTF 1.9.1, 2007 (first version from the 1980s) RTF is an 8-bit ASCII format. RTF is an 8-bit ASCII format. An RTF file consists of unformatted text and commands (or control words) that RTF uses to mark printer control codes and information that applications use to manage documents. An RTF file consists of unformatted text and commands (or control words) that RTF uses to mark printer control codes and information that applications use to manage documents. A control word takes the following form \LetterSequence where LetterSequence is made up of lowercase alphabetic characters. A control word takes the following form \LetterSequence where LetterSequence is made up of lowercase alphabetic characters.
Rich Text Format (.rtf) ……………\li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\it ap0\pararsid \f2\fs20\lang1033\langfe1033\cgrid\langnp1033\langfenp1033 {\insrsid Hello}{\insrsid \charrsid World \par Document Architectures \par }} Example 1. “Hello World/Document Architectures” ……………\pnlcrm\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}{\*\pnseclvl8\pnlcltr\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}{\*\pnseclvl9\pnlcrm\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}\pard\plain \ql \li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0 \fs24\lang2057\langfe1033\cgrid\langnp2057\langfenp1033 {\lang1049\langfe1033\langnp1049 \'cd\'e8\'ea\'ee\'eb\'e0\par }} Example 2. “Никола” (a word in Cyrillic) saved with a Latin code page
MS Word and PowerPoint (.doc,.ppt) ………………… Ъ Ак‹РїУВј 0 м Ъ Ъ + 0 [ Ъ № № Ъ 0 ј ј Ё Ё Ё Ё Щ Hello World Document Architectures …………………….doc,.ppt are binary document formats Example 1. “Hello World/Document Architectures” in a.doc file
Postscript (.ps) PostScript is a programming (or page description) language optimized for printing graphics and text. PostScript is a programming (or page description) language optimized for printing graphics and text. Introduced by Adobe in (Alternative Printer Control Language (PCL) has been introduced by HP). Introduced by Adobe in (Alternative Printer Control Language (PCL) has been introduced by HP). The main purpose of PostScript was to provide a convenient language in which to describe images in a device independent manner. The main purpose of PostScript was to provide a convenient language in which to describe images in a device independent manner. Requires a PostScript (capable) printer. Requires a PostScript (capable) printer.
Postscript (.ps) %! % Sample of printing text /Times-Roman findfont% Get the basic font 20 scalefont % Scale the font to 20 points setfont % Make it the current font newpath % Start a new path moveto % Lower left corner of text at (72, 72) (Hello, world!) show % Typeset "Hello, world!“ showpage Example PostScript program
Portable Document Format (.pdf) Adobe’s binary document format based on the PostScript language. Adobe’s binary document format based on the PostScript language. Widely used for submitting digital files for printing. Widely used for submitting digital files for printing. PDF is designed to preserve all the fonts, formatting, graphics, and color of any source document, regardless of the application and platform used to create it. PDF is designed to preserve all the fonts, formatting, graphics, and color of any source document, regardless of the application and platform used to create it.
Structured vs. Unstructured Documents Highly structured text document formats are for instance SGML, HTML and LaTeX. Highly structured text document formats are for instance SGML, HTML and LaTeX. Structure in a documents makes it easier to process documents automatically. Structure in a documents makes it easier to process documents automatically. However, more structure in a document leads to a bigger file. However, more structure in a document leads to a bigger file.
References Online Hex Dump Online Hex Dump Unicode Home Page Unicode Home Page