Download presentation
Presentation is loading. Please wait.
Published byMarcus Andrews Modified over 9 years ago
1
Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant
2
Agenda Deliverables Definitions Scripts Latin scripts Greek Hebrew Cumulative testing Sorting (optional) References
3
Deliverables — English Internationalized Products We currently support Latin1 and Asian character sets: ISO-8859-1: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean Newly supported character sets: ISO-8859-2: Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian ISO-8859-7/8/9: Greek, Hebrew, Turkish Complex languages are not supported: Thai, Indic languages, Arabic Goal: Unicode We currently support Latin1 and Asian character sets: ISO-8859-1: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean Newly supported character sets: ISO-8859-2: Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian ISO-8859-7/8/9: Greek, Hebrew, Turkish Complex languages are not supported: Thai, Indic languages, Arabic Goal: Unicode We currently support Latin1 and Asian character sets: ISO-8859-1: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean Newly supported character sets: ISO-8859-2: Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian ISO-8859-7/8/9: Greek, Hebrew, Turkish Complex languages are not supported: Thai, Indic languages, Arabic Goal: Unicode We currently support Latin1 and Asian character sets: ISO-8859-1: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean Newly supported character sets: ISO-8859-2: Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian ISO-8859-7/8/9: Greek, Hebrew, Turkish Complex languages are not supported: Thai, Indic languages, Arabic Goal: Unicode
4
Definitions Script System of characters composed of: Letters, syllables or ideographs (with one or more possible directions) Punctuation symbols Numbers ( 0 1 2 3 4 5 6 7 8 9 ¼ ½ ¾ ) Other symbols ( ® $ # % & ± ° _ @ ) n scripts/language or n languages/script Character set (or code page, or coded character set) Ordered group of characters assigned to code points. Encoding System defining the storage mechanism for a given character set.
5
Single-Byte Character Sets Expressed in 8-bit sequences. The character set does not exceed 256 code points. The encoding is the order of the character set code points. A given code point may have a different value (character) depending on the character set. The first 128 code points are always the same.
6
Latin Scripts Latin 1 Character Set (ISO 8859-1) Latin 1 Languages covered Afrikaans, Albanian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Scottish, Spanish, Swahili, Swedish Notes Uppercase and lowercase letters have two code points even though they refer to 2 forms of the same letter. Some letters have no uppercase. The base characters are the same for all Latin character sets. Base characters a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 4 5 6 7 8 9 ! " ' ( ),. : ; ? [ ] ^ { | } ~ # $ % & ÷ × + - * / = \ _ Extended characters àÀ áÁ â ãà äÄ åÅ æÆ çÇ èÈ éÉ êÊ ëË ðÐ íÍ îÎ ïÏ ñÑ òÒ óÓ ôÔ õÕ öÖ øØ ß ùÙ úÚ ûÛ üÜ ýÝ ÿ þÞ
7
Latin Scripts ISO 8859-1 vs. Windows 1252 Microsoft Windows' Latin 1 character set (code page 1252) is different from ISO 8859-1. It contains about 20 extra characters, among others: The euro symbol ( ) The English curly quotes ( “ ” ) The ellipsis (…) The German opening quotes ( „ ) The bullet ( ) The n-dash (–) The m-dash (—) The French uppercase and lowercase oe ligatures (œ Œ) The English trademark symbol (™) These may not display correctly in non-Latin 1 systems.
8
Latin Scripts ISO 8859-1 vs. Windows 1252 Latin 1 (ISO 8859-1) Windows code page 1252
9
Latin Scripts Latin 2 Character Set (ISO 8859-2) Latin 1 Latin 2 Languages covered Czech, Hungarian, Polish, Romanian, Croatian, Slovak, Slovenian, Sorbian Notes Some characters are duplicates from the Latin 1 character set. The caron diacritic has two forms: “ ˘ ” and “ ’ ”. The T with cedilla has a glyph variant (T with comma) for Romanian. Latin 2 characters common to Latin 1 use identical code points. Extended characters ąĄ áÁ â ăĂ äÄ ćĆ çÇ čČ ďĎ éÉ ęĘ ëË ěĚ ðÐ íÍ îÎ łŁ ľĽ ĺĹ ńŃ ňŇ óÓ ôÔ őŐ öÖ ŕŔ řŘ śŚ šŠ şŞ § ß ťŤ ţŢ ůŮ úÚ űŰ üÜ ýÝ źŹ žŽ żŻ
10
ISO 8859-1 vs. ISO 8859-2 Latin 1 (ISO 8859-1) Latin 2 (ISO 8859-2)
11
ISO 8859-1 vs. ISO 8859-2 All common characters have the same code points. Characters that are different belong to separate language families (mostly West European vs. East European). Allows a certain level of flexibility between languages.
12
Latin Scripts Latin 3 Character Set (ISO 8859-3) Latin 1 Latin 2 Latin 3 Languages covered Esperanto, Maltese Notes Covered Turkish before the introduction of Latin 5 in 1988. Not supported. Extended characters àÀ áÁ â äÄ ċĊ ĉĈ çÇ èÈ éÉ êÊ ëË ğĞ ħĦ ĥĤ ıI iİ ìÌ íÍ îÎ ïÏ ĵĴ ñÑ òÒ óÓ ôÔ öÖ şŞ ŝŜ § ß ùÙ úÚ ûÛ üÜ ŭŬ żŻ £¤
13
Latin Scripts Latin 4 Character Set (ISO 8859-4) Latin 1 Latin 2 Latin 3 Latin 4 Languages covered Estonian, Latvian, Lithuanian, Greenlandic, Lappish Notes Not supported. Extended characters ąĄ āĀ áÁ â ãà äÄ åÅ æÆ čČ ēĒ éÉ ęĘ ëË ėĖ ðÐ ģĢ ĸ ķĶ ĩĨ íÍ îÎ īĪ įĮ ļĻ ņŅ ŋŊ ōŌ ôÔ õÕ öÖ øØ ŗŖ šŠ ß ŧŦ ųŲ úÚ ûÛ üÜ ũŨ ūŪ ¤ ÷
14
Latin Scripts Latin 5 Character Set (ISO 8859-9) Latin 1 Latin 2 Latin 3 Latin 4 Latin 5 Languages covered Turkish Notes Very similar to Latin 1. The letters ð, ý and þ from Latin 1 are replaced with Turkish letters. Latin 5 characters common to Latin 1 use identical code points. Issue: *.ini = *.İNİ, and *. n = *.INI *.ini *.INI, and *. n *.İNİ Extended characters àÀ áÁ â ãà äÄ åÅ æÆ çÇ èÈ éÉ êÊ ëË íÍ îÎ ïÏ ðÐ ---> ğĞ ñÑ òÒ óÓ ôÔ õÕ öÖ øØ ß ùÙ úÚ ûÛ üÜ ýÝ ---> ıİ ÿ þÞ ---> şŞ
15
Latin Scripts Latin 6 Character Set (ISO 8859-10) Latin 1 Latin 2 Latin 3 Latin 4 Latin 5 Latin 6 Languages covered Nordic area Inuit (Greenlandic Eskimo), non- Skolt Sami (Lappish), Icelandic Notes Similar characters to Latin 4, but with extra letters for the Nordic languages. Latin 6 characters common to Latin 4 use different code points. Very not supported. Extended characters ąĄ āĀ áÁ â ãà äÄ åÅ æÆ čČ ēĒ éÉ ęĘ ëË ėĖ ðÐ ģĢ ĸ ķĶ ĩĨ íÍ îÎ īĪ įĮ ļĻ ņŅ ŋŊ ōŌ ôÔ õÕ öÖ øØ ŗŖ šŠ ß ŧŦ ųŲ úÚ ûÛ üÜ ũŨ ūŪ ¤ ÷
16
Latin Scripts Latin 7 & 8 Character Sets (ISO 8859-13 & 14) Latin 1 Latin 2 Latin 3 Latin 4 Latin 5 Latin 6 Latin 7 Latin 8 Languages covered Latin 7: Baltic languages Latin 8: Celtic languages Notes Similar characters to Latin 4 and 6, but with extra letters for the Nordic languages. Latin 7 characters common to Latin 4 and 6 use different code points. Latin 8 characters common to Latin 1 use identical code points. Not supported.
17
Latin Scripts Latin 9 Character Set (ISO 8859-15) Latin 1 Latin 2 Latin 3 Latin 4 Latin 5 Latin 6 Latin 7 Latin 8 Latin 9 Languages covered Same as Latin 1. Notes Some Latin 9 characters common to Latin 1 use different code points. Less used characters are replaced: ¨ ---> š¦ ---> Š ¸ ---> ž´ ---> Ž ½ ---> œ¼ ---> Œ ¾ ---> Ÿ¤ ---> Extended characters àÀ áÁ â ãà äÄ åÅ æÆ çÇ èÈ éÉ êÊ ëË íÍ îÎ ïÏ ðÐ ñÑ òÒ óÓ ôÔ õÕ öÖ øØ œŒ šŠ ß ùÙ úÚ ûÛ üÜ ýÝ ÿ Ÿ žŽ þÞ
18
ISO 8859-15 vs. Windows 1252 Latin 9 (ISO 8859-15) Windows 1252
19
Latin Scripts in... Non-Latin Character Sets! Latin 1 Latin 2 Latin 3 Latin 4 Latin 5 Latin 6 Latin 7 Latin 8 Latin 9 Other Languages Traditional Chinese Simplified Chinese Japanese (romaji or romanji) Vietnamese Notes Chinese, Japanese and Korean use Latin letters for transliteration (sometime with tone accents) and numbers. Vietnamese uses Latin characters with diacritics. Latin characters are also used in the transliteration of Greek, Hebrew, Russian, etc. Some Vietnamese extended characters ðÐ ăĂ â êÊ ôÔ …with tones
20
Languages Covered by Latin Character Sets LanguageCharacter set (Latin-n) Czech2 Danish1456789 Dutch159 English123456789 Finnish123456789 French13589 German123456789 Hungarian2 Italian13589 Norwegian123456789 Polish27 Portuguese13589 Romanian2 Spanish189 Swedish1456789 Turkish35 LanguageCharacter set (Latin-n) Czech2 Danish1456789 Dutch159 English123456789 Finnish123456789 French13589 German123456789 Hungarian2 Italian13589 Norwegian123456789 Polish27 Portuguese13589 Romanian2 Spanish189 Swedish1456789 Turkish35
21
Greek Script Greek Character Set One script, one character set, one language. Contains modern monotonic upper & lowercase Greek letters, punctuation and a few accented Greek letters. The rest is almost identical to Latin 1 ! Missing from Latin 1: Latin punctuation: ¡ ¿ Currency symbols: ¢ ¤ ¥ Other symbols: ® ª º × ÷ µ ¶ Diacritics: ¸ Numbers: ¹ ¼ ¾ Extended characters αβγδεζηικλμν… ΑΒΓΔΖΗΘΙΚΛΝΞ… The rest... ² ³ ½ £ ¦ § © ¬ ¯ ° ± « » · ¨
22
Hebrew Script Hebrew Character Set One script, one character set: Hebrew Yiddish Directionality of text: Hebrew letters are written from right to left (RTL). Numbers (Arabic) are written from left to right (LTR). Latin characters are written from left to right (LTR). Order of the text depends on the predominant language. Order of mirrored characters depends on neighboring characters. Differences from Latin 1: Latin punctuation: ¡ ¿ are missing Currency symbol:₪ (new sheqel) is absent Other symbols: ª º are missing × ÷ have different code points Extended characters תשרקעסליטחזוהדגבא Final & nominal forms: ך -כ ן -נ ם -מ ף -פ ץ -צ Final form
23
Hebrew User Interface There are two types of Hebrew support: Hebrew-enabled product (supporting Hebrew characters) Hebrew product (translated into Hebrew) Both types must support RTL display. Text alignment may differ for characters, strings and document. Normally, the logical order (or storage order or file order) is the same as the reading order. The display order is bi-directional and does not follow the logical order.
24
Hebrew User Interface Logical vs. Visual Input string: "Hebrew text : ילגנא טסקט" In a LTR document: Hebrew text : טקסט אנגלי In a RTL document: טקסט אנגלי : Hebrew text 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 How should it be displayed? You get different displays depending on the main direction (script) of the document or the string. Notice the direction of the colon.
25
Hebrew User Interface — Issues Display of improper characters. Display in improper order. Display in correct order; cursor in logical position. Mix of Hebrew and Latin text. Alignment inside an input field. Copy and paste. Carriage returns inside a Hebrew or mixed string.
26
Cumulative Testing Premisses: Testing in French or German includes English issues. Testing of Greek includes non-Latin 1 character and font issues. Special cases: Cursory testing of character and font issues per character set. Sorting and comparision per language. Hebrew:Bi-directionality Turkish:INI files and anything related to case conversion
27
Total 50% Increase for ALL Languages French or German:100% Greek:15% Hebrew:15% Turkish:5% Czech or Polish:5% Cursory testing:10% English:0% English coverage:100%
28
Sorting — 1
29
Sorting — 2 Sort order The system generates a sort key based on locale-specific rules A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc. Simple example of French sorting: Sorting:Rules: élémentaire élève1) Alphanumeric base Élève2) Diacritics élevé3) Case Élevé4) Non-alphanumeric data élever Sort order The system generates a sort key based on locale-specific rules A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc. Simple example of French sorting: Sorting:Rules: elementaire eleve1) Alphanumeric base Eleve2) Diacritics eleve3) Case Eleve4) Non-alphanumeric data elever Sort order The system generates a sort key based on locale-specific rules A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc. Simple example of French sorting: Sorting:Rules: élémentaire élève1) Alphanumeric base Élève2) Diacritics élevé3) Case Élevé4) Non-alphanumeric data élever Sort order The system generates a sort key based on locale-specific rules A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc. Simple example of French sorting: Sorting:Rules: elementaire eleve1) Alphanumeric base Eleve2) Diacritics eleve3) Case Eleve4) Non-alphanumeric data elever Sort order The system generates a sort key based on locale-specific rules A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc. Simple example of French sorting: Sorting:Rules: élémentaire élève1) Alphanumeric base Élève2) Diacritics élevé3) Case Élevé4) Non-alphanumeric data élever Sort order The system generates a sort key based on locale-specific rules A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc. Simple example of French sorting: Sorting:Rules: élémentaire élève1) Alphanumeric base Élève2) Diacritics élevé3) Case Élevé4) Non-alphanumeric data élever e-lever
30
References The ISO 8859 Alphabet Soup by Roman Czyborra. An absolute classic... http://czyborra.com/charsets/iso8859.html Character table: http://www.microsoft.com/globaldev/reference/sbcs/1250.htm Some Internet Explorer limitations: http://sizif.mf.uni-lj.si/linux/cee/app/ie30.html#http More of the same: http://sizif.mf.uni-lj.si/linux/cee/charset.html On fonts (a bit specialized): http://studweb.euv-frankfurt-o.de/twardoch/f/en/index.html ISO 8859-2 vs.Windows Central European code page (1250): http://titus.uni-frankfurt.de/unicode/iso8859/iso8859b.htm#start
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.