Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?

Slides:



Advertisements
Similar presentations
Solutions for Multilingual Literature by XSL Formatter 6,800 known languages.
Advertisements

Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Starting Out with Programming Logic & Design First Edition by Tony Gaddis.
Binary Expression Numbers & Text CS 105 Binary Representation At the fundamental hardware level, a modern computer can only distinguish between two values,
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
1/25 Writing Character sets Unicode Input methods.
Chinese Information Processing (I): Basic Concepts and Practice Unit 7: Web Pages in Chinese.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Bits and Bytes.
Computer Systems Nat 4/5 Computing Science Data Representation Lesson 3: Storing Text.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Unicode & W3C Jataayu Software C. Kumar January 2007.
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
Creating Interfaces: Localization Language & other issues character codes Homework: preparation for future topics.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Week 4 Number Systems.
Agenda Data Representation – Characters Encoding Schemes ASCII
APPX Unicode Support APPX Release 6.0 will support Unicode APPX will support languages worldwide.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
Chapter 2 Computer Hardware
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
Chapter 3: The UNIX Editors ASCII and vi Editors.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
1 INFORMATION IN DIGITAL DEVICES. 2 Digital Devices Most computers today are composed of digital devices. –Process electrical signals. –Can only have.
Text and Graphics September 26, Unit 3.
1 3 Computing System Fundamentals 3.5 Data Representation.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
Data Representation The storage of Text Numbers Graphics.
Representation of Characters
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
Requirements of End User Defined Characters & Some Frequently Used Solutions Chinese Foundation for Digitization Technology (CMEX) Phobos Chang.
Text encoding: or how to get 黃慧儀 and Ίων Ανδρουτσόπουλος into the same document. Chris Brew Linguistics, Ohio State.
1 Problem Solving using Computers “Data....Representation, and Storage.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Characters CS240.
Representing Characters in a Computer System Representation of Data in Computer Systems.
WHAT IS BINARY CODE!!! BY JASON AND DENNIS. BINARY CODE IS!!! The code that is used to make up data in computers. All images, apps, commands and data.
© 2016 AQA. Created by Teachit for AQA Character encoding and Representing images Lesson.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
Nat 4/5 Computing Science Data Representation Lesson 3: Storing Text
DATA REPRESENTATION - TEXT
Binary Representation in Text
Binary Representation in Text
Conversion of information in different coding systems
Unit 2.6 Data Representation Lesson 2 ‒ Characters
INTERNATIONALIZATION
INFS 211: Introduction to Information Technology
Characters & Fonts Digital Multimedia, 2nd edition
Data Transfer ASCII FILES.
Data Encoding Characters.
TOPICS Information Representation Characters and Images
Workshop on XML-Based Library Applications 5
Topics Introduction Hardware and Software How Computers Store Data
Digital Representation
Characters & Fonts Digital Multimedia, 2nd edition
INFOCODING BASICS & EXAMPLES OF CURRENT USE
ASCII and Unicode.
Presentation transcript:

Character Encoding, F onts

Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise? How might they be solved?

Why does this matter to linguists? We work in a global market… –Might receive a file or which you cannot read. –Might be asked to translate a file which has been encoded in a way that is not supported by the translations tools available to you. –Even if you’re not directly responsible for a particular language, you might be responsible for a translation project overall.

You might visit a web page that looks like this… …in fact, you may be asked to translate a web page that looks like this!

You might open a file to edit it, and find it looks like this:

Or this…

Why does this happen? No standard has been universally implemented to store and transfer characters. Typically different standards are used by: –Different applications –Different types of computer –Different languages –Different countries (or locales) Although the character has been correctly interpreted, it is not covered by the font in use

Alright, alright… …how does it work?

Bits and bytes Computers deal in binary digits. These are known as bits. Each bit has a value of 0 or 1. These are stored and transferred as groups, known as bytes. Bytes have 8 bits. –The value of a byte is a number between 0 ( ) and 255 ( )

Characters and numbers Computers must be able to associate the characters used in human writing systems with binary values. Every character is associated with a number. The number is called a code point.

A coded character set relates each character in a character repertoire to a code point using a table. (Example from Roman Czyborra:

The range of characters encoded Limited by size of table used in coded character set. In turn, this limited by number of bits used to encode characters: –8 bits offer 256 combinations Number of bits available is limited by number of bytes used to store a character: –1 byte per character = 256 possible combinations –2 bytes per character = 65,536 possible combinations Which brings us to…

…Character encoding schemes A character encoding scheme relates each code point in a coded character set to a byte or series of bytes. –This is how character codes are packaged for storage and transfer. Problems occur when applications misinterpret these sequences of bytes. –One character can be mistaken for another.

Coded character sets for alphabets Region or language group covered ISO Standard Other Name Other available character encodings Western EuropeISO Latin 1CP1252 (WinLatin1), Unicode Languages which use the Cyrillic alphabet ISO CyrillicCP1251 (WinCyrillic), KOI8- R, KOI8-U, Unicode Arabic. Missing the 4 additional letters required for Farsi and 8 required for Urdu. ISO ArabicCP1256 (WinArabic), Unicode Modern GreekISO GreekCP1253 (WinGreek), Unicode

Coded character set families for other writing systems EUC (Extended Unix Code): –Used for Chinese, Japanese and Korean (CJK) languages. Big5 and GB (Guojia Biaozhun - 国家标准 ) –Big5 used in Taiwan and Hong Kong - also known as Chinese Traditional –GB used in PRC and Singapore - also known as Chinese Simplified JIS (Japanese Industry Standard) –Used for Japanese Unicode All use multibyte character encoding schemes

Unicode 16-bit planes = each holds up to 65,536 characters –Characters from nearly all contemporary writing systems, including both alphabetic and ideographic scripts…plus additional characters

Unicode Transformation Formats Unicode standard is a coded character set Different encoding schemes, or Unicode Transformation Formats, are used: –UTF-8: uses 1, 2, 3, or 4 byte sequence for each character Will be most common on WWW Less efficient than legacy systems for CJK characters –UTF-16: uses 2 bytes per character Known as Unicode in Microsoft applications Double size of ASCII/ISO8859 files

Fonts Character encoding schemes deal with abstract characters: –“a” is the same character as “ a ” By contrast, fonts are collections of specific forms or images. Different fonts can be applied to a single encoded character.

Some solutions Problems should be reduced by implementation of Unicode. Web pages - experiment with settings in your browser: –In IE: use Encoding on the View menu Text files: –Open file in Microsoft Word Word should identify the character encoding scheme used –Save as Plain Text Select required character encoding Check font settings –Use Unicode fonts for multilingual documents Send text as attachment (rather than in body)