Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant.

Similar presentations


Presentation on theme: "Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant."— Presentation transcript:

1 Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

2 Agenda  Deliverables  Definitions  Scripts  Latin scripts  Greek  Hebrew  Cumulative testing  Sorting (optional)  References

3 Deliverables — English Internationalized Products  We currently support Latin1 and Asian character sets:  ISO-8859-1: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish  Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean  Newly supported character sets:  ISO-8859-2: Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian  ISO-8859-7/8/9: Greek, Hebrew, Turkish  Complex languages are not supported: Thai, Indic languages, Arabic  Goal: Unicode  We currently support Latin1 and Asian character sets:  ISO-8859-1: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish  Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean  Newly supported character sets:  ISO-8859-2: Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian  ISO-8859-7/8/9: Greek, Hebrew, Turkish  Complex languages are not supported: Thai, Indic languages, Arabic  Goal: Unicode  We currently support Latin1 and Asian character sets:  ISO-8859-1: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish  Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean  Newly supported character sets:  ISO-8859-2: Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian  ISO-8859-7/8/9: Greek, Hebrew, Turkish  Complex languages are not supported: Thai, Indic languages, Arabic  Goal: Unicode  We currently support Latin1 and Asian character sets:  ISO-8859-1: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish  Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean  Newly supported character sets:  ISO-8859-2: Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian  ISO-8859-7/8/9: Greek, Hebrew, Turkish  Complex languages are not supported: Thai, Indic languages, Arabic  Goal: Unicode

4 Definitions  Script  System of characters composed of:  Letters, syllables or ideographs (with one or more possible directions)  Punctuation symbols  Numbers ( 0 1 2 3 4 5 6 7 8 9 ¼ ½ ¾ )  Other symbols ( ® $ # % & ± ° _ @ )  n scripts/language or n languages/script  Character set (or code page, or coded character set)  Ordered group of characters assigned to code points.  Encoding  System defining the storage mechanism for a given character set.

5 Single-Byte Character Sets  Expressed in 8-bit sequences.  The character set does not exceed 256 code points.  The encoding is the order of the character set code points.  A given code point may have a different value (character) depending on the character set.  The first 128 code points are always the same.

6 Latin Scripts Latin 1 Character Set (ISO 8859-1)  Latin 1  Languages covered  Afrikaans, Albanian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Scottish, Spanish, Swahili, Swedish  Notes  Uppercase and lowercase letters have two code points even though they refer to 2 forms of the same letter.  Some letters have no uppercase.  The base characters are the same for all Latin character sets.  Base characters  a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 4 5 6 7 8 9 ! " ' ( ),. : ; ? [ ] ^ { | } ~ # $ % & ÷ × + - * / = \ _  Extended characters  àÀ áÁ â ãà äÄ åÅ æÆ çÇ èÈ éÉ êÊ ëË ðÐ íÍ îÎ ïÏ ñÑ òÒ óÓ ôÔ õÕ öÖ øØ ß ùÙ úÚ ûÛ üÜ ýÝ ÿ þÞ

7 Latin Scripts ISO 8859-1 vs. Windows 1252  Microsoft Windows' Latin 1 character set (code page 1252) is different from ISO 8859-1.  It contains about 20 extra characters, among others:  The euro symbol ( )  The English curly quotes ( “ ” )  The ellipsis (…)  The German opening quotes ( „ )  The bullet ( )  The n-dash (–)  The m-dash (—)  The French uppercase and lowercase oe ligatures (œ Œ)  The English trademark symbol (™)  These may not display correctly in non-Latin 1 systems.

8 Latin Scripts ISO 8859-1 vs. Windows 1252 Latin 1 (ISO 8859-1) Windows code page 1252

9 Latin Scripts Latin 2 Character Set (ISO 8859-2)  Latin 1  Latin 2  Languages covered  Czech, Hungarian, Polish, Romanian, Croatian, Slovak, Slovenian, Sorbian  Notes  Some characters are duplicates from the Latin 1 character set.  The caron diacritic has two forms: “ ˘ ” and “ ’ ”.  The T with cedilla has a glyph variant (T with comma) for Romanian.  Latin 2 characters common to Latin 1 use identical code points.  Extended characters  ąĄ áÁ â ăĂ äÄ ćĆ çÇ čČ ďĎ éÉ ęĘ ëË ěĚ ðÐ íÍ îÎ łŁ ľĽ ĺĹ ńŃ ňŇ óÓ ôÔ őŐ öÖ ŕŔ řŘ śŚ šŠ şŞ § ß  ťŤ ţŢ  ůŮ úÚ űŰ üÜ  ýÝ źŹ žŽ żŻ

10 ISO 8859-1 vs. ISO 8859-2 Latin 1 (ISO 8859-1) Latin 2 (ISO 8859-2)

11 ISO 8859-1 vs. ISO 8859-2  All common characters have the same code points.  Characters that are different belong to separate language families (mostly West European vs. East European).  Allows a certain level of flexibility between languages.

12 Latin Scripts Latin 3 Character Set (ISO 8859-3)  Latin 1  Latin 2  Latin 3  Languages covered  Esperanto, Maltese  Notes  Covered Turkish before the introduction of Latin 5 in 1988.  Not supported.  Extended characters  àÀ áÁ â äÄ ċĊ ĉĈ çÇ èÈ éÉ êÊ ëË ğĞ ħĦ ĥĤ ıI iİ ìÌ íÍ îÎ ïÏ ĵĴ ñÑ òÒ óÓ ôÔ öÖ şŞ ŝŜ §  ß ùÙ úÚ ûÛ üÜ ŭŬ żŻ £¤

13 Latin Scripts Latin 4 Character Set (ISO 8859-4)  Latin 1  Latin 2  Latin 3  Latin 4  Languages covered  Estonian, Latvian, Lithuanian, Greenlandic, Lappish  Notes  Not supported.  Extended characters  ąĄ āĀ áÁ â ãà äÄ åÅ æÆ čČ ēĒ éÉ ęĘ ëË ėĖ ðÐ ģĢ ĸ ķĶ ĩĨ íÍ îÎ īĪ įĮ ļĻ ņŅ ŋŊ ōŌ ôÔ õÕ öÖ øØ ŗŖ šŠ ß ŧŦ ųŲ úÚ ûÛ üÜ ũŨ ūŪ ¤ ÷

14 Latin Scripts Latin 5 Character Set (ISO 8859-9)  Latin 1  Latin 2  Latin 3  Latin 4  Latin 5  Languages covered  Turkish  Notes  Very similar to Latin 1.  The letters ð, ý and þ from Latin 1 are replaced with Turkish letters.  Latin 5 characters common to Latin 1 use identical code points.  Issue:  *.ini = *.İNİ, and  *.  n  = *.INI  *.ini  *.INI, and  *.  n   *.İNİ  Extended characters  àÀ áÁ â ãà äÄ åÅ æÆ çÇ èÈ éÉ êÊ ëË íÍ îÎ ïÏ ðÐ ---> ğĞ ñÑ òÒ óÓ ôÔ õÕ öÖ øØ ß ùÙ úÚ ûÛ üÜ ýÝ ---> ıİ ÿ þÞ ---> şŞ

15 Latin Scripts Latin 6 Character Set (ISO 8859-10)  Latin 1  Latin 2  Latin 3  Latin 4  Latin 5  Latin 6  Languages covered  Nordic area Inuit (Greenlandic Eskimo), non- Skolt Sami (Lappish), Icelandic  Notes  Similar characters to Latin 4, but with extra letters for the Nordic languages.  Latin 6 characters common to Latin 4 use different code points.  Very not supported.  Extended characters  ąĄ āĀ áÁ â ãà äÄ åÅ æÆ čČ ēĒ éÉ ęĘ ëË ėĖ ðÐ ģĢ ĸ ķĶ ĩĨ íÍ îÎ īĪ įĮ ļĻ ņŅ ŋŊ ōŌ ôÔ õÕ öÖ øØ ŗŖ šŠ ß ŧŦ ųŲ úÚ ûÛ üÜ ũŨ ūŪ ¤ ÷

16 Latin Scripts Latin 7 & 8 Character Sets (ISO 8859-13 & 14)  Latin 1  Latin 2  Latin 3  Latin 4  Latin 5  Latin 6  Latin 7  Latin 8  Languages covered  Latin 7: Baltic languages  Latin 8: Celtic languages  Notes  Similar characters to Latin 4 and 6, but with extra letters for the Nordic languages.  Latin 7 characters common to Latin 4 and 6 use different code points.  Latin 8 characters common to Latin 1 use identical code points.  Not supported. 

17 Latin Scripts Latin 9 Character Set (ISO 8859-15)  Latin 1  Latin 2  Latin 3  Latin 4  Latin 5  Latin 6  Latin 7  Latin 8  Latin 9  Languages covered  Same as Latin 1.  Notes  Some Latin 9 characters common to Latin 1 use different code points.  Less used characters are replaced: ¨ ---> š¦ ---> Š  ¸ ---> ž´ ---> Ž  ½ ---> œ¼ ---> Œ  ¾ ---> Ÿ¤ --->  Extended characters  àÀ áÁ â ãà äÄ åÅ æÆ çÇ èÈ éÉ êÊ ëË íÍ îÎ ïÏ ðÐ ñÑ òÒ óÓ ôÔ õÕ öÖ øØ œŒ šŠ ß ùÙ úÚ ûÛ üÜ ýÝ ÿ Ÿ žŽ þÞ

18 ISO 8859-15 vs. Windows 1252 Latin 9 (ISO 8859-15) Windows 1252

19 Latin Scripts in... Non-Latin Character Sets!  Latin 1  Latin 2  Latin 3  Latin 4  Latin 5  Latin 6  Latin 7  Latin 8  Latin 9  Other  Languages  Traditional Chinese Simplified Chinese Japanese (romaji or romanji) Vietnamese  Notes  Chinese, Japanese and Korean use Latin letters for transliteration (sometime with tone accents) and numbers.  Vietnamese uses Latin characters with diacritics.  Latin characters are also used in the transliteration of Greek, Hebrew, Russian, etc.  Some Vietnamese extended characters  ðÐ ăĂ â êÊ ôÔ …with tones

20 Languages Covered by Latin Character Sets  LanguageCharacter set (Latin-n) Czech2 Danish1456789 Dutch159 English123456789 Finnish123456789 French13589 German123456789 Hungarian2 Italian13589 Norwegian123456789 Polish27 Portuguese13589 Romanian2 Spanish189 Swedish1456789 Turkish35  LanguageCharacter set (Latin-n) Czech2 Danish1456789 Dutch159 English123456789 Finnish123456789 French13589 German123456789 Hungarian2 Italian13589 Norwegian123456789 Polish27 Portuguese13589 Romanian2 Spanish189 Swedish1456789 Turkish35

21 Greek Script Greek Character Set  One script, one character set, one language.  Contains modern monotonic upper & lowercase Greek letters, punctuation and a few accented Greek letters.  The rest is almost identical to Latin 1 !  Missing from Latin 1:  Latin punctuation: ¡ ¿  Currency symbols: ¢ ¤ ¥  Other symbols: ® ª º × ÷ µ ¶  Diacritics: ¸  Numbers: ¹ ¼ ¾  Extended characters  αβγδεζηικλμν… ΑΒΓΔΖΗΘΙΚΛΝΞ…  The rest... ² ³ ½ £ ¦ § © ¬ ­ ¯ ° ± « » · ¨

22 Hebrew Script Hebrew Character Set  One script, one character set:  Hebrew  Yiddish  Directionality of text:  Hebrew letters are written from right to left (RTL).  Numbers (Arabic) are written from left to right (LTR).  Latin characters are written from left to right (LTR).  Order of the text depends on the predominant language.  Order of mirrored characters depends on neighboring characters.  Differences from Latin 1:  Latin punctuation: ¡ ¿ are missing  Currency symbol:₪ (new sheqel) is absent  Other symbols: ª º are missing × ÷ have different code points  Extended characters  תשרקעסליטחזוהדגבא  Final & nominal forms: ך -כ ן -נ ם -מ ף -פ ץ -צ Final form

23 Hebrew User Interface  There are two types of Hebrew support:  Hebrew-enabled product (supporting Hebrew characters)  Hebrew product (translated into Hebrew)  Both types must support RTL display.  Text alignment may differ for characters, strings and document.  Normally, the logical order (or storage order or file order) is the same as the reading order.  The display order is bi-directional and does not follow the logical order.

24 Hebrew User Interface Logical vs. Visual  Input string: "Hebrew text : ילגנא טסקט"  In a LTR document: Hebrew text : טקסט אנגלי  In a RTL document: טקסט אנגלי : Hebrew text 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  How should it be displayed?  You get different displays depending on the main direction (script) of the document or the string.  Notice the direction of the colon.

25 Hebrew User Interface — Issues  Display of improper characters.  Display in improper order.  Display in correct order; cursor in logical position.  Mix of Hebrew and Latin text.  Alignment inside an input field.  Copy and paste.  Carriage returns inside a Hebrew or mixed string.

26 Cumulative Testing  Premisses:  Testing in French or German includes English issues.  Testing of Greek includes non-Latin 1 character and font issues.  Special cases:  Cursory testing of character and font issues per character set.  Sorting and comparision per language.  Hebrew:Bi-directionality  Turkish:INI files and anything related to case conversion

27 Total 50% Increase for ALL Languages  French or German:100%  Greek:15%  Hebrew:15%  Turkish:5%  Czech or Polish:5%  Cursory testing:10%  English:0%  English coverage:100%

28 Sorting — 1

29 Sorting — 2  Sort order  The system generates a sort key based on locale-specific rules  A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.  Simple example of French sorting: Sorting:Rules: élémentaire élève1) Alphanumeric base Élève2) Diacritics élevé3) Case Élevé4) Non-alphanumeric data élever  Sort order  The system generates a sort key based on locale-specific rules  A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.  Simple example of French sorting: Sorting:Rules: elementaire eleve1) Alphanumeric base Eleve2) Diacritics eleve3) Case Eleve4) Non-alphanumeric data elever  Sort order  The system generates a sort key based on locale-specific rules  A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.  Simple example of French sorting: Sorting:Rules: élémentaire élève1) Alphanumeric base Élève2) Diacritics élevé3) Case Élevé4) Non-alphanumeric data élever  Sort order  The system generates a sort key based on locale-specific rules  A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.  Simple example of French sorting: Sorting:Rules: elementaire eleve1) Alphanumeric base Eleve2) Diacritics eleve3) Case Eleve4) Non-alphanumeric data elever  Sort order  The system generates a sort key based on locale-specific rules  A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.  Simple example of French sorting: Sorting:Rules: élémentaire élève1) Alphanumeric base Élève2) Diacritics élevé3) Case Élevé4) Non-alphanumeric data élever  Sort order  The system generates a sort key based on locale-specific rules  A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.  Simple example of French sorting: Sorting:Rules: élémentaire élève1) Alphanumeric base Élève2) Diacritics élevé3) Case Élevé4) Non-alphanumeric data élever e-lever

30 References  The ISO 8859 Alphabet Soup by Roman Czyborra. An absolute classic...  http://czyborra.com/charsets/iso8859.html  Character table:  http://www.microsoft.com/globaldev/reference/sbcs/1250.htm  Some Internet Explorer limitations:  http://sizif.mf.uni-lj.si/linux/cee/app/ie30.html#http  More of the same:  http://sizif.mf.uni-lj.si/linux/cee/charset.html  On fonts (a bit specialized):  http://studweb.euv-frankfurt-o.de/twardoch/f/en/index.html  ISO 8859-2 vs.Windows Central European code page (1250):  http://titus.uni-frankfurt.de/unicode/iso8859/iso8859b.htm#start


Download ppt "Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant."

Similar presentations


Ads by Google