Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 2 Character Codes and Low-Structure Text Document Formats.

Similar presentations


Presentation on theme: "Lecture 2 Character Codes and Low-Structure Text Document Formats."— Presentation transcript:

1 Lecture 2 Character Codes and Low-Structure Text Document Formats

2 Character Codes A character encoding consists of a code that pairs a sequence of characters from a given character set with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers and the transmission of text through telecommunication networks. (Wikipedia)

3 Character Codes (contd.) Morse Code – Samuel F. B. Morse, 1838; first employed in the American telegraphs. Baudot Code - J. Baudot, 1874; 5-bit code, which was also the world's first binary character code for processing textual data; standardized it in 1932 as "International Telegraphic Alphabet No. 2." Hollerith code – employed in the Hollerith's tabulating machines used in the 1890 census in the USA; first use of punched cards for processing of data; IBM core technology until the 1960’s.

4 Character Codes (contd.) ASCII - The American Standard Code for Information Interchange, standard 7-bit code proposed by ANSI in 1963, and finalized in 1968; EBCDIC - Extended Binary-Coded Decimal Interchange Code; an 8-bit IBM code for representing characters as numbers first used in 1964; based on the earlier 6-bit BCDIC (BCD). ISO 646 - European 7-bit coded character set for information interchange adopted in 1967 by ISO. Extended ASCII - 8-bit code with an additional 128 characters to represent characters from foreign languages and special symbols for drawing pictures. ISO 2022, ISO 8859 – Set of 8-bit character codes which cover accented Latin characters, Greek, Cyrillic and Hebrew alphabets. ANSI - The Microsoft collective name for all Windows code pages. Sometimes used specifically for code page 1252, which is a superset of ISO 8859-1 (Latin 1). See for example http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm

5 Character Codes (contd.) Unicode – 16-bit multilingual character code project launched in the 1980s in the USA by the Unicode Consortium (www.unicode.org). UCS (ISO 10646) - Universal Coded Character Set (UCS) first introduced in 1993; UCS-4 (32-bit), UCS-2 (16-bit). ASCII1000001 (x41) 8859-1 (Latin 1)01000001 (x41) Unicode, UCS-200000000 01000001 (x0041) UCS-400000000 00000000 00000000 01000001 Standard Represenation of ‘A’ (Latin) x00000041

6 Character Codes (contd.) UTF-8 and UTF-16 – UCS/Unicode Transformation Format, methods for encoding a string of UCS/Unicode characters as a sequence of bytes. U-00000000 - U-0000007F:0xxxxxxx U-00000080 - U-000007FF:110xxxxx 10xxxxxx U-00000800 - U-0000FFFF:1110xxxx 10xxxxxx 10xxxxxx U-00010000 - U-001FFFFF:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U-00200000 - U-03FFFFFF:111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx U-04000000 - U-7FFFFFFF:1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx UCS-4 (hex) UTF-8 (binary)

7 ASCII Text Files (.txt) ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. The ASCII codes between 00-31 (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. The ASCII codes between 00-31 (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. x09 – Character Tab (HT)x0A – Line Feed (LF) x0D – Carriage Return (CR)x20 – Space x48 – ‘H’x65 – ‘e’ Hello World Document Architectures ANSI hw-ansi.txt 48 65 6C 6C 6F 20 57 6F 72 6C 64 0D 0A 44 6F 63 75 6D 65 6E 74 20 41 72 63 68 69 74 65 63 74 75 72 65 73

8 ASCII Text Files (.txt) ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. The ASCII codes between 00-31 (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. The ASCII codes between 00-31 (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. Hello World Document Architectures Unicode hw-unicode.txt Little-Endian, Intel Architecture FF FE 48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00 6F 00 72 00 6C 00 64 00 0D 00 0A 00 44 00 6F 00 63 00 75 00 6D 00 65 00 6E 00 74 00 20 00 41 00 72 00 63 00 68 00 69 00 74 00 65 00 63 00 74 00 75 00 72 00 65 00 73 00

9 Hello World Document Architectures ASCII Text Files (.txt) Big-Endian, Sun’s SPARC, Motorola’s 68K, Java VM ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. The ASCII codes between 00-31 (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. The ASCII codes between 00-31 (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. Unicode Big Endian hw-unicode-be.txt FE FF 00 48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00 6F 00 72 00 6C 00 64 00 0D 00 0A 00 44 00 6F 00 63 00 75 00 6D 00 65 00 6E 00 74 00 20 00 41 00 72 00 63 00 68 00 69 00 74 00 65 00 63 00 74 00 75 00 72 00 65 00 73

10 Hello World Document Architectures ASCII Text Files (.txt) ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. The ASCII codes between 00-31 (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. The ASCII codes between 00-31 (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. UTF-8 hw-utf-8.txt UTF-8 signature EF BB BF 48 65 6C 6C 6F 20 57 6F 72 6C 64 0D 0A 44 6F 63 75 6D 65 6E 74 20 41 72 63 68 69 74 65 63 74 75 72 65 73

11 Rich Text Format (.rtf) Microsoft, RTF 1.9.1, 2007 (first version from the 1980s) Microsoft, RTF 1.9.1, 2007 (first version from the 1980s) RTF is an 8-bit ASCII format. RTF is an 8-bit ASCII format. An RTF file consists of unformatted text and commands (or control words) that RTF uses to mark printer control codes and information that applications use to manage documents. An RTF file consists of unformatted text and commands (or control words) that RTF uses to mark printer control codes and information that applications use to manage documents. A control word takes the following form \LetterSequence where LetterSequence is made up of lowercase alphabetic characters. A control word takes the following form \LetterSequence where LetterSequence is made up of lowercase alphabetic characters.

12 Rich Text Format (.rtf) ……………\li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\it ap0\pararsid4738926 \f2\fs20\lang1033\langfe1033\cgrid\langnp1033\langfenp1033 {\insrsid15036461 Hello}{\insrsid4002691\charrsid4738926 World \par Document Architectures \par }} Example 1. “Hello World/Document Architectures” ……………\pnlcrm\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}{\*\pnseclvl8\pnlcltr\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}{\*\pnseclvl9\pnlcrm\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}\pard\plain \ql \li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0 \fs24\lang2057\langfe1033\cgrid\langnp2057\langfenp1033 {\lang1049\langfe1033\langnp1049 \'cd\'e8\'ea\'ee\'eb\'e0\par }} Example 2. “Никола” (a word in Cyrillic) saved with a Latin code page

13 MS Word and PowerPoint (.doc,.ppt) ………………… Ъ Ак‹РїУВј 0 м Ъ Ъ + 0 [ Ъ № № Ъ 0 ј ј Ё Ё Ё Ё Щ Hello World Document Architectures …………………….doc,.ppt are binary document formats Example 1. “Hello World/Document Architectures” in a.doc file

14 Postscript (.ps) PostScript is a programming (or page description) language optimized for printing graphics and text. PostScript is a programming (or page description) language optimized for printing graphics and text. Introduced by Adobe in 1985. (Alternative Printer Control Language (PCL) has been introduced by HP). Introduced by Adobe in 1985. (Alternative Printer Control Language (PCL) has been introduced by HP). The main purpose of PostScript was to provide a convenient language in which to describe images in a device independent manner. The main purpose of PostScript was to provide a convenient language in which to describe images in a device independent manner. Requires a PostScript (capable) printer. Requires a PostScript (capable) printer.

15 Postscript (.ps) %! % Sample of printing text /Times-Roman findfont% Get the basic font 20 scalefont % Scale the font to 20 points setfont % Make it the current font newpath % Start a new path 72 72 moveto % Lower left corner of text at (72, 72) (Hello, world!) show % Typeset "Hello, world!“ showpage Example PostScript program http://www.cs.indiana.edu/docproject/programming/postscript/postscript.html

16 Portable Document Format (.pdf) Adobe’s binary document format based on the PostScript language. Adobe’s binary document format based on the PostScript language. Widely used for submitting digital files for printing. Widely used for submitting digital files for printing. PDF is designed to preserve all the fonts, formatting, graphics, and color of any source document, regardless of the application and platform used to create it. PDF is designed to preserve all the fonts, formatting, graphics, and color of any source document, regardless of the application and platform used to create it.

17 Structured vs. Unstructured Documents Highly structured text document formats are for instance SGML, HTML and LaTeX. Highly structured text document formats are for instance SGML, HTML and LaTeX. Structure in a documents makes it easier to process documents automatically. Structure in a documents makes it easier to process documents automatically. However, more structure in a document leads to a bigger file. However, more structure in a document leads to a bigger file.

18 References Online Hex Dump http://mark0.net/hexdump.html Online Hex Dump http://mark0.net/hexdump.html http://mark0.net/hexdump.html Unicode Home Page http://www.unicode.org/ Unicode Home Page http://www.unicode.org/ http://www.unicode.org/


Download ppt "Lecture 2 Character Codes and Low-Structure Text Document Formats."

Similar presentations


Ads by Google