Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?

Character Encoding, F onts

Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise? How might they be solved?

Why does this matter to linguists? We work in a global market… –Might receive a file or email which you cannot read. –Might be asked to translate a file which has been encoded in a way that is not supported by the translations tools available to you. –Even if you’re not directly responsible for a particular language, you might be responsible for a translation project overall.

You might visit a web page that looks like this… …in fact, you may be asked to translate a web page that looks like this!

You might open a file to edit it, and find it looks like this:

Or this…

Why does this happen? No standard has been universally implemented to store and transfer characters. Typically different standards are used by: –Different applications –Different types of computer –Different languages –Different countries (or locales) Although the character has been correctly interpreted, it is not covered by the font in use

Alright, alright… …how does it work?

Bits and bytes Computers deal in binary digits. These are known as bits. Each bit has a value of 0 or 1. These are stored and transferred as groups, known as bytes. Bytes have 8 bits. –The value of a byte is a number between 0 (00000000) and 255 (11111111)

Characters and numbers Computers must be able to associate the characters used in human writing systems with binary values. Every character is associated with a number. The number is called a code point.

A coded character set relates each character in a character repertoire to a code point using a table. (Example from Roman Czyborra: http://czyborra.com/charsets/iso646.html)http://czyborra.com/charsets/iso646.html

The range of characters encoded Limited by size of table used in coded character set. In turn, this limited by number of bits used to encode characters: –8 bits offer 256 combinations Number of bits available is limited by number of bytes used to store a character: –1 byte per character = 256 possible combinations –2 bytes per character = 65,536 possible combinations Which brings us to…

…Character encoding schemes A character encoding scheme relates each code point in a coded character set to a byte or series of bytes. –This is how character codes are packaged for storage and transfer. Problems occur when applications misinterpret these sequences of bytes. –One character can be mistaken for another.

Coded character sets for alphabets Region or language group covered ISO Standard Other Name Other available character encodings Western EuropeISO 8859-1Latin 1CP1252 (WinLatin1), Unicode Languages which use the Cyrillic alphabet ISO 8859-5CyrillicCP1251 (WinCyrillic), KOI8- R, KOI8-U, Unicode Arabic. Missing the 4 additional letters required for Farsi and 8 required for Urdu. ISO 8859-6ArabicCP1256 (WinArabic), Unicode Modern GreekISO 8859-7GreekCP1253 (WinGreek), Unicode

Coded character set families for other writing systems EUC (Extended Unix Code): –Used for Chinese, Japanese and Korean (CJK) languages. Big5 and GB (Guojia Biaozhun - 国家标准 ) –Big5 used in Taiwan and Hong Kong - also known as Chinese Traditional –GB used in PRC and Singapore - also known as Chinese Simplified JIS (Japanese Industry Standard) –Used for Japanese Unicode All use multibyte character encoding schemes

Unicode 16-bit planes = each holds up to 65,536 characters –Characters from nearly all contemporary writing systems, including both alphabetic and ideographic scripts…plus additional characters

Unicode Transformation Formats Unicode standard is a coded character set Different encoding schemes, or Unicode Transformation Formats, are used: –UTF-8: uses 1, 2, 3, or 4 byte sequence for each character Will be most common on WWW Less efficient than legacy systems for CJK characters –UTF-16: uses 2 bytes per character Known as Unicode in Microsoft applications Double size of ASCII/ISO8859 files

Fonts Character encoding schemes deal with abstract characters: –“a” is the same character as “ a ” By contrast, fonts are collections of specific forms or images. Different fonts can be applied to a single encoded character.

Some solutions Problems should be reduced by implementation of Unicode. Web pages - experiment with settings in your browser: –In IE: use Encoding on the View menu Text files: –Open file in Microsoft Word Word should identify the character encoding scheme used –Save as Plain Text Select required character encoding Check font settings –Use Unicode fonts for multilingual documents Send text as email attachment (rather than in body)

Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?

Similar presentations

Presentation on theme: "Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?

Similar presentations

Presentation on theme: "Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?"— Presentation transcript:

Similar presentations

About project

Feedback