Digital Text Primer Prepared for: AIEA Roundtable on Digitization of Armenian Documents Saturday 7 October 2006, University of Geneva, Switzerland Roland.

Slides:



Advertisements
Similar presentations
Learning Objectives Explain the link between patterns, symbols, and information Determine possible PandA encodings using a physical phenomenon Encode.
Advertisements

Craig Schock, 2003 Binary Numbers Numbering Systems Counting Symbolic Bases Common Bases (10, 2, 8, 16) Representing Information Binary to Decimal Conversions.
Unicode and the Web Nathan Schneider. Special Text In our interactions with computers, it is often desirable to use characters other than the standard.
EECC250 - Shaaban #1 lec #13 Winter HEX DEC CHR Ctrl 00 0NUL 01 1 SOH ^A 02 2STX ^B 03 3ETX ^C 04 4EOT ^D 05 5ENQ ^E 06 6ACK ^F 07 7BEL.
1/25 Writing Character sets Unicode Input methods.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Representing Information Digitally Bits and the “Why” of Bytes lawrence snyder.
9/14/2004Comp 120 Fall September 2004 First Exam next Tuesday 21 September Programming Questions? All of Chapter 3 and Appendix A are relevant.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Fluency with Information Technology Third Edition by Lawrence Snyder Chapter.
Advanced OCR with OmniPage and FineReader. Overview Optical character recognition Optical character recognition Structural recognition Structural recognition.
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Characters & Strings Lesson 1 CS1313 Spring Characters & Strings Lesson 1 Outline 1.Characters & Strings Lesson 1 Outline 2.Numeric Encoding of.
1 Data Representation Computer Organization Prof. H. Yoon DATA REPRESENTATION Data Types Complements Fixed Point Representations Floating Point Representations.
1 CP586 © Peter Lo 2003 Multimedia Communication Font and Text.
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
Agenda Data Representation – Characters Encoding Schemes ASCII
The character data type char
Lecture 2 Character Codes and Low-Structure Text Document Formats.
Chapter 7 Data Coding. Agenda Coding Code efficiency and conversion Compression/compaction Code encryption/decryption.
BIOS1 Basic Input Output System BIOS BIOS refers to a set of procedures or functions that enable the programmer have access to the hardware of the computer.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Digital Design: From Gates to Intelligent Machines
Decimal Binary Octal Hex
BIOS1 Basic Input Output System BIOS BIOS refers to a set of procedures or functions that enable the programmer have access to the hardware of the computer.
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
PHY281Variables operators and math functions Slide 1 More On Variables and Operators, And Maths Functions In this section we will learn more about variables.
Lesson 8 Keyboarding Unit 2—Using the Computer. Computer Concepts BASICS - 2 Objectives Define keyboarding. Identify the parts of the standard keyboard.
Dept. of Computer Science Engineering Islamic Azad University of Mashhad 1 DATA REPRESENTATION Dept. of Computer Science Engineering Islamic Azad University.
Informatics I101 February 25, 2003 John C. Paolillo, Instructor.
Postacademic Interuniversity Course in Information Technology – Module C1p1 Chapter 1 Evolution of Communication Networks.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Representing Information Digitally Bits and the “Why” of Bytes lawrence snyder.
Lesson 2 – Editing a Document Microsoft Word
CS151 Introduction to Digital Design
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Review Spatial Filters – Smooth – Blur – Low Pass Filter – Sharpen – High Pass Filter – Edge detection – Erosion – Dilation Other Pixel Filters – Thresholding.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
How to Create Accessible Online Course Content Shivan Mahabir Athanasia (Tania) Kalaitzidis Kevin Korber Danny Villaroel.
1 Information Representation in Computer Lecture Nine.
CS 2130 Lecture 23 Data Types.
Systems Architecture, Fourth Edition 1 Data Representation Chapter 3.
Programming for GCSE Topic 2.2: Binary Representation T eaching L ondon C omputing William Marsh School of Electronic Engineering and Computer Science.
Section 3: Computing With Confidence. How to Become a Skilled Learner Outline: –Computing problem –Becoming a skilled learner –How to learn on your own.
PRIMITIVE TYPES IN JAVA Primitive Types Operations on Primitive Types.
 After completing this lesson, you will be able to:  Describe the page setup options.  Describe how to insert page numbers and page breaks in a document.
4th Edition, Irv Englander
Bits, Data Types, and Operations
Chapter 2 Bits, Data Types, and Operations
Machine level representation of data Character representation
C Programming Language Review and Dissection
Chapter 3 Data Representation Text Characters
Digitizing Discrete Information
Characters Lesson Outline
Chapter 2 Data Types and Representations
Javascript, Loops, and Encryption
EEL 3705 / 3705L Digital Logic Design
Text.
ASCII Character Codes nul soh stx etx eot 1 lf vt ff cr so
October 1 Programming Questions?
Cosc 2P12 Week 2.
Number Systems Lecture 2.
Text Encoding.
School of Computer Science and Technology
Characters Lesson Outline
Introduction to Computer Engineering
Rayat Shikshan Sanstha’s S. M. Joshi College, Hadapsar
Text Representation ASCII Collating Sequence
Cosc 2P12 Week 2.
Literal data.
ASCII and Unicode.
Chapter 2 Bits, Data Types, and Operations
Presentation transcript:

Digital Text Primer Prepared for: AIEA Roundtable on Digitization of Armenian Documents Saturday 7 October 2006, University of Geneva, Switzerland Roland Telfeyan Robert Coffin October 2, 2006 Charlotte, North Carolina

2 Contents Text encoding  ASCII Problem  Unicode Solution OCR  ABBYY FineReader  Sample scans

3 1963: ASCII Telegraph machines American Standard Code for Information Interchange (ASCII) 128 numbers representing  Printed characters, like ‘A’, ‘B’, ‘+’, ‘=’, etc.  Commands to control the print head of the teletype, like “carriage return”, “line feed”, “tab”, “back space”, etc.

4 ASCII: Cont’d 0 nul 1 soh 2 stx 3 etx 4 eot 5 enq 6 ack 7 bel 8 bs 9 ht 10 nl 11 vt 12 np 13 cr 14 so 15 si 16 dle 17 dc1 18 dc2 19 dc3 20 dc4 21 nak 22 syn 23 etb 24 can 25 em 26 sub 27 esc 28 fs 29 gs 30 rs 31 us 32 sp 33 ! 34 " 35 # 36 $ 37 % 38 & 39 ' 40 ( 41 ) 42 * , / : 59 ; ? 65 A 66 B 67 C 68 D 69 E 70 F 71 G 72 H 73 I 74 J 75 K 76 L 77 M 78 N 79 O 80 P 81 Q 82 R 83 S 84 T 85 U 86 V 87 W 88 X 89 Y 90 Z 91 [ 92 \ 93 ] 94 ^ 95 _ 96 ` 97 a 98 b 99 c 100 d 101 e 102 f 103 g 104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o 112 p 113 q 114 r 115 s 116 t 117 u 118 v 119 w 120 x 121 y 122 z 123 { 124 | 125 } 126 ~ 127 del No indication of type appearance Only numbers representing letters

5 Early Keyboards Keyboards were “hard-wired.” To get a lowercase ‘b’, you press the [B] key, making the keyboard emit code 98.

6 Mid 1970’s: Computer Fonts An array of glyphs, one per ASCII code Character code 97 (‘a’) can be rendered variously: a, a, a, ա,...

7 Significance of Fonts Fonts were the first flexible mapping interposed between the hard-wired keyboard and the printed glyphs. This technology made the Macintosh famous.

8 Font Design Dilemma Now 97 can mean not only ‘a’ but ‘ ա ’. However, should 98 mean ‘ բ ’ or ‘ պ ’?

9 Font Design Dilemma Font designers assigned glyphs to specific character codes that satisfied their own personal keyboard layout preferences. An Armenian text file could not be viewed reliably in absence of the font used to create it.

to 2006: NeXT to Mac Steve Jobs (whose mother is a Hagopian) invented the NeXT computer It had user-definable Keyboard Layouts Today’s Mac OS X is 90% NeXT Today, the placement of letters on a keyboard is a user preference, like the location of windows on a screen.

11 Unicode The character set has been extended to allow for more than 95,000 characters. The goal is a set of standard character codes for every known language. For the first time, Armenian (and other) characters have their own codes, defined by a de-facto international standard.

12 Unicode (Cont'd) The Unicode Character Set is a standard definition of character codes for the glyphs of most known languages. Armenian codes range from 1328 to 1423 (95 codes).

13 But I like my old system If you want Armenian, Georgian, Greek, Hebrew, Arabic, Chinese, and more all on the same page using one font with with a consistent look, … If you want to type using your own key layout, … If you want others to be able to read your text in absence of the font or keyboard layout or computer system you used, … … use Unicode.

14 But I have a lot of ASCII Unicode conversion tools at:

15 95,000 Glyphs? With more than 95,000 potential glyphs in a Unicode font, any one font can represent multiple language scripts. How can a computer keyboard address all these characters? User-defined keyboard layouts map selected characters in the Unicode font to the physical keyboard.

16 Review: Two Main Points Keyboard layouts are user preferences that have nothing to do with legibility of text on another system. Unicode text is legible in absence of the fonts or keyboard mappings or possibly the application used to compile it.

“K” ASCII 67 Kevork font Code saved in file Tigran font “G” Different fonts had different glyphs for the same character. Physical Keyboard

“Գ”“Գ” Any Unicode Standard Font (Multi- lingual) ABCD… ΑΒΓΔ … … … אבגד ԱԲԳԴ … ႠႡႢႣ … Physical Keyboard Virtual Keyboard (User Selected) Armenian Georgian Hebrew Armenian Letter “Gim” Unicode 0533 “ג”“ג” “Ⴂ”“Ⴂ” Hebrew Letter “Gimel” Unicode 05D2 Georgian Letter “Gan” Unicode 10A2 Unicode characters are saved in text file— the same Unicode character code for the same glyph, regardless of font. Key Code 67 Keyboard Preference

19 OCR ABBYY FineReader is a commercial multilingual OCR software that recognizes Armenian and many other languages. Built-in dictionaries assist in checking accuracy, and all text is handled through Unicode.

20 FineReader The program is simple yet powerful. The program links each letter of text with its location in the scanned image, for fast proofreading.

21 FineReader (Cont’d) Ample control over page layout Tools to automate large batches Outputs Word, PDF, HTML, XML, …

22 OCR: Results Armenian accuracy depends on typeface and richness of the internal dictionary. Arial Armenian: ~99.9% Times, Aramian, Nork: ~96% Երկաթագիր, Գրաբար manuscripts: not too good ~70%

23 FineReader: Conclusion Tuned for modern, Arial-like letters. We are working with ABBYY to improve recognition rates on old manuscripts and books.

24 Screen Shots On the next slides are:  A screenshot of FineReader  A scanned image  MS Word output

25 FineReader Screen Recognized Text Scan

26 Example Original Scan

27 MS Word Text Output

28 Further Information Questions, suggestions, and corrections are welcome. Updates will be posted to