Lecture 2 Character Codes and Low-Structure Text Document Formats.

Slides:



Advertisements
Similar presentations
Information Representation
Advertisements

Computer Codes Rohit Khokher. Computer Codes Data types NumericNonnumeric IntegerRealAlphabet A, B, C, …,Z a, b, c,…,z Digits 0,…,9 Special Characters.
WMES3103 : INFORMATION RETRIEVAL
1. Discrete / Continuous Representations Of numbers – binary & decimal Bits Hexadecimal - 'Hex' Representing text Bits and Bytes.
Lecture 6 Graphics, Number Systems. 7.2 Bit-map Graphics Similar to real painting on the canvas, there is no way to change something but paint over it.
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
1 The Information School of the University of Washington Nov 6fit more-digital © 2006 University of Washington Digital Information INFO/CSE 100,
 Method of representing or encoding numbers  Two main notation types  Sign-value  Roman numerals  Positional (place-value)  Modern decimal notation.
© BYU 02 NUMBERS Page 1 ECEn 224 Binary Number Systems and Codes.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
Representing Information in Binary (Continued)
DAT2343 Basic Character Encoding Including ASCII © Alan T. Pinck / Algonquin College; 2003.
COMPUTER FUNDAMENTALS David Samuel Bhatti
ITEC 1011 Introduction to Information Technologies 2. Data Formats Chapt. 3.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
©Brooks/Cole, 2003 Chapter 2 Data Representation.
LING 408/508: Programming for Linguists Lecture 2 August 28 th.
Chapter Four Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge.
Chapter 3 Representing Numbers and Text in Binary Information Technology in Theory By Pelin Aksoy and Laura DeNardis.
Agenda Data Representation – Characters Encoding Schemes ASCII
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI N305 Information Representation: Characters and Images.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Chapter 2 Computer Hardware
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
CS151 Introduction to Digital Design
Text and Graphics September 26, Unit 3.
Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
SEC (1.4) Representing Information as bit patterns.
Lis508 lecture 2: characters to textual documents Thomas Krichel
Text processing Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 12.1.
Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
Data Representation. How is data stored on a computer? Registers, main memory, etc. consists of grids of transistors Transistors are in one of two states,
1 Introduction to PostScript Sep. 21 Dae-Eun Hyun 3D MAP Lab.
DATA REPRESENTATION 4 Y. Colette Lemard February 2009.
ASCII AND EBCDIC CODES By : madam aisha.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
2. Data Formats. Introduction Examples pp Real World Data Computer Data Input device Dear Mom: Keyboard … Digital camera …
 Method of representing or encoding numbers  Two main notation types  Sign-value  Roman numerals  Positional (place-value)  Modern decimal notation.
1.4 Representation of data in computer systems Character.
Lecture Coding Schemes. Representing Data English language uses 26 symbols to represent an idea Different sets of bit patterns have been designed to represent.
Binary Representation in Text
Binary Representation in Text
Chapter 8 & 11: Representing Information Digitally
Machine level representation of data Character representation
Chapter 3 Data Representation Text Characters
Lesson Objectives Aims You should be able to:
Representing Information as bit patterns
Phnom Penh International University (PPIU)
TOPICS Information Representation Characters and Images
Representing Characters
LING 388: Computers and Language
Computers & Programming Languages
LING 408/508: Computational Techniques for Linguists
Ch2: Data Representation
Computer Data Types Basics of Computing.
INFO/CSE 100, Spring 2005 Fluency in Information Technology
2. Data Formats Chapt. 3.
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Comp Org & Assembly Lang
LING 388: Computers and Language
C Programming Language
ASCII and Unicode.
Varying Character Lengths
Presentation transcript:

Lecture 2 Character Codes and Low-Structure Text Document Formats

Character Codes A character encoding consists of a code that pairs a sequence of characters from a given character set with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers and the transmission of text through telecommunication networks. (Wikipedia)

Character Codes (contd.) Morse Code – Samuel F. B. Morse, 1838; first employed in the American telegraphs. Baudot Code - J. Baudot, 1874; 5-bit code, which was also the world's first binary character code for processing textual data; standardized it in 1932 as "International Telegraphic Alphabet No. 2." Hollerith code – employed in the Hollerith's tabulating machines used in the 1890 census in the USA; first use of punched cards for processing of data; IBM core technology until the 1960’s.

Character Codes (contd.) ASCII - The American Standard Code for Information Interchange, standard 7-bit code proposed by ANSI in 1963, and finalized in 1968; EBCDIC - Extended Binary-Coded Decimal Interchange Code; an 8-bit IBM code for representing characters as numbers first used in 1964; based on the earlier 6-bit BCDIC (BCD). ISO European 7-bit coded character set for information interchange adopted in 1967 by ISO. Extended ASCII - 8-bit code with an additional 128 characters to represent characters from foreign languages and special symbols for drawing pictures. ISO 2022, ISO 8859 – Set of 8-bit character codes which cover accented Latin characters, Greek, Cyrillic and Hebrew alphabets. ANSI - The Microsoft collective name for all Windows code pages. Sometimes used specifically for code page 1252, which is a superset of ISO (Latin 1). See for example

Character Codes (contd.) Unicode – 16-bit multilingual character code project launched in the 1980s in the USA by the Unicode Consortium ( UCS (ISO 10646) - Universal Coded Character Set (UCS) first introduced in 1993; UCS-4 (32-bit), UCS-2 (16-bit). ASCII (x41) (Latin 1) (x41) Unicode, UCS (x0041) UCS Standard Represenation of ‘A’ (Latin) x

Character Codes (contd.) UTF-8 and UTF-16 – UCS/Unicode Transformation Format, methods for encoding a string of UCS/Unicode characters as a sequence of bytes. U U F:0xxxxxxx U U FF:110xxxxx 10xxxxxx U U-0000FFFF:1110xxxx 10xxxxxx 10xxxxxx U U-001FFFFF:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U U-03FFFFFF:111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx U U-7FFFFFFF: x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx UCS-4 (hex) UTF-8 (binary)

ASCII Text Files (.txt) ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. x09 – Character Tab (HT)x0A – Line Feed (LF) x0D – Carriage Return (CR)x20 – Space x48 – ‘H’x65 – ‘e’ Hello World Document Architectures ANSI hw-ansi.txt C 6C 6F F 72 6C 64 0D 0A 44 6F D 65 6E

ASCII Text Files (.txt) ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. Hello World Document Architectures Unicode hw-unicode.txt Little-Endian, Intel Architecture FF FE C 00 6C 00 6F F C D 00 0A F D E

Hello World Document Architectures ASCII Text Files (.txt) Big-Endian, Sun’s SPARC, Motorola’s 68K, Java VM ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. Unicode Big Endian hw-unicode-be.txt FE FF C 00 6C 00 6F F C D 00 0A F D E

Hello World Document Architectures ASCII Text Files (.txt) ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. ASCII/ANSI/Unicode file is a file in which characters are represented by their ASCII/ANSI/Unicode codes respectively. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. The ASCII codes between (Hex x00 – x1F) and 127 (Hex x7F) are nonprintable control characters. UTF-8 hw-utf-8.txt UTF-8 signature EF BB BF C 6C 6F F 72 6C 64 0D 0A 44 6F D 65 6E

Rich Text Format (.rtf) Microsoft, RTF 1.9.1, 2007 (first version from the 1980s) Microsoft, RTF 1.9.1, 2007 (first version from the 1980s) RTF is an 8-bit ASCII format. RTF is an 8-bit ASCII format. An RTF file consists of unformatted text and commands (or control words) that RTF uses to mark printer control codes and information that applications use to manage documents. An RTF file consists of unformatted text and commands (or control words) that RTF uses to mark printer control codes and information that applications use to manage documents. A control word takes the following form \LetterSequence where LetterSequence is made up of lowercase alphabetic characters. A control word takes the following form \LetterSequence where LetterSequence is made up of lowercase alphabetic characters.

Rich Text Format (.rtf) ……………\li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\it ap0\pararsid \f2\fs20\lang1033\langfe1033\cgrid\langnp1033\langfenp1033 {\insrsid Hello}{\insrsid \charrsid World \par Document Architectures \par }} Example 1. “Hello World/Document Architectures” ……………\pnlcrm\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}{\*\pnseclvl8\pnlcltr\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}{\*\pnseclvl9\pnlcrm\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}\pard\plain \ql \li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0 \fs24\lang2057\langfe1033\cgrid\langnp2057\langfenp1033 {\lang1049\langfe1033\langnp1049 \'cd\'e8\'ea\'ee\'eb\'e0\par }} Example 2. “Никола” (a word in Cyrillic) saved with a Latin code page

MS Word and PowerPoint (.doc,.ppt) ………………… Ъ Ак‹РїУВј 0 м Ъ Ъ + 0 [ Ъ № № Ъ 0 ј ј Ё Ё Ё Ё Щ Hello World Document Architectures …………………….doc,.ppt are binary document formats Example 1. “Hello World/Document Architectures” in a.doc file

Postscript (.ps) PostScript is a programming (or page description) language optimized for printing graphics and text. PostScript is a programming (or page description) language optimized for printing graphics and text. Introduced by Adobe in (Alternative Printer Control Language (PCL) has been introduced by HP). Introduced by Adobe in (Alternative Printer Control Language (PCL) has been introduced by HP). The main purpose of PostScript was to provide a convenient language in which to describe images in a device independent manner. The main purpose of PostScript was to provide a convenient language in which to describe images in a device independent manner. Requires a PostScript (capable) printer. Requires a PostScript (capable) printer.

Postscript (.ps) %! % Sample of printing text /Times-Roman findfont% Get the basic font 20 scalefont % Scale the font to 20 points setfont % Make it the current font newpath % Start a new path moveto % Lower left corner of text at (72, 72) (Hello, world!) show % Typeset "Hello, world!“ showpage Example PostScript program

Portable Document Format (.pdf) Adobe’s binary document format based on the PostScript language. Adobe’s binary document format based on the PostScript language. Widely used for submitting digital files for printing. Widely used for submitting digital files for printing. PDF is designed to preserve all the fonts, formatting, graphics, and color of any source document, regardless of the application and platform used to create it. PDF is designed to preserve all the fonts, formatting, graphics, and color of any source document, regardless of the application and platform used to create it.

Structured vs. Unstructured Documents Highly structured text document formats are for instance SGML, HTML and LaTeX. Highly structured text document formats are for instance SGML, HTML and LaTeX. Structure in a documents makes it easier to process documents automatically. Structure in a documents makes it easier to process documents automatically. However, more structure in a document leads to a bigger file. However, more structure in a document leads to a bigger file.

References Online Hex Dump Online Hex Dump Unicode Home Page Unicode Home Page