Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant.

Slides:



Advertisements
Similar presentations
José Luis Otárola. Refers to Language family Lgs. That contains similar features of Lexicon, Phonology, Morphology and Syntax.
Advertisements

U.S. Government Language Requirements U.S. Government Language Requirements 7 September 2000 Everette Jordan Department of Defense
Paragon Software Group presents PenReader. Paragon Software Group – International Holding Founded in 1994 Location Germany (HQ), NL, Russia, USA, Japan.
Adaptxt® Enhanced Keyboards for Smartphones and Tablets: CUSTOM-MADE FOR OEM SUCCESS KeyPoint Technologies February 25, 2013.
Recording Audio with Audacity Workshop by Dr. Luba Iskold Fulvia Alderiso and Kellen Mickley August 2007 Dept. of Languages, Literatures and Cultures.
NorCal OAUG Training Day, Pres 5.09John Peters, JRPJR, Inc.1 So you want Multiple Languages in your Oracle E-Business Suite John Peters JRPJR, Inc.
Curricular exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
 They speak German  8.47 million of people live there.
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
JCI Ethics Certification and Compliance Training 2009.
Solutions for Multilingual Literature by XSL Formatter 6,800 known languages.
Clients for XProtect VMS What’s new presentation
English Language Proficiency 2011 Census Analysis Tristan Browne.
INTERNATIONAL MARKETING MANAGEMENT SESSION 7: CUSTOMER BEHAVIOR AND MARKET SEGMENTATION 1.
1 Linguistic Resources needed by Nuance Jan Odijk Cocosda/Write Workshop.
1/25 Writing Character sets Unicode Input methods.
INTERNATIONAL MARKETING MANAGEMENT SESSION 8: CUSTOMER BEHAVIOR 1.
Indo-European Language Branch
Learning Letter Sounds Jack Hartman Shake, Rattle, and Read
In the knowledge society of the 21st century, language competence and inter-cultural understanding are not optional extras, they are an essential part.
Computer Science and Software Engineering University of Wisconsin - Platteville Note 9. Internationalization Yan Shi SE 3730 / CS 5730 Lecture Notes Part.
Digital audio editing software (Audacity) Audacity Instructions Introduction What is Audacity What can you do with Audacity Audacity Control Panel How-To.
UNLIMITED. SIMULTANEOUS. NO CHECK-OUT. eREFERENCE.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Advanced Google Searching June Liebert Director and Assistant Professor The John Marshall Law School “Do no harm” – the Google mantra.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Overview of REALNEO Technologies REALNEO Web Platform Architecture Overview of Drupal.
Indo-European Branches
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
IBM Maximo Asset Management © 2007 IBM Corporation Tivoli Technical Exchange Calls Aug 31, Maximo - Multi-Language Capabilities Ritsuko Beuchert.
Week 4 Number Systems.
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
Although there are about 225 indigenous languages in Europe – they are still only 3% of the world’s total.
School improvement based on
2013 Court of Justice of the European Union Language arrangements at the Court of Justice of the European Union Interpretation - Translation.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
1 Translate and Translator Toolkit Universally accessible information through translation Jeff Chin Product Manager Michael Galvez Product Manager.
Skill Area 311 Part A. Lecture Overview Binary Numbers Binary Arithmetic ASCII Code Machine Code Instruction Format Advantages and disadvantages of machine.
Rosh ( ראש ) in Ezekiel Tim LaHaye writes that one way we know that Ezekiel 38 and 39 “can only mean modern-day Russia” is because of “etymology,”
Copyright © IBM Corp., The Eclipse™ Babel Project Translation Server Kit Lo IBM™ Corporation.
Video Podcast Localization Managing the Efficient Production of 1,000 Podcasts into a Variety of Languages.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Oracle9i Database Administrator: Implementation and Administration 1 Chapter 14 Globalization Support in the Database.
Week 7 Lecture 2 Globalization Support in the Database.
Why Study Languages Produced by the Subject Centre for Languages, Linguistics and Area Studies …When Everyone Speaks English?
Security Systems BU Communication Systems ST/SEU-CO 1 DCN MCCU IO Maintenance Select settings in Maintenance Menu  Default language for the.
Look of the new IPPOG Resources database website Proposal by BG + HP based on structure proposed (BG+RL+HP) 2/11/2015 Following and evolving from the discussion.
Curricular language exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
LanguagesLanguages. What is language? A human system of communication that uses arbitrary signals such as voice sounds, gestures, or written symbols.
F ACTORS TO G OOGLE A D S ENSE A PPROVAL By: Aarif Habeeb.
Tel: Fax: P.O. Box: 22392, Dubai - UAE
EUROPEAN DAY OF LANGUAGES. The European Year of Languages 2001 was organised by the Council of Europe and the European Union. Its activities celebrated.
Languages of Europe Romance, Germanic, and Slavic.
Mitubishi Chemical Holdings Group
Localization and Globalization in Windows Runtime Apps
Overview of REALNEO Technologies
Sales Presenter Available now
Sales Presenter Available now
Oracle Supplier Management Solution Product Availability
Representing Characters
Mitubishi Chemical Holdings Group
Digital Asset Management Part 11: Access

Definition of Health WHO approved translation
Mitubishi Chemical Holdings Group
Part of Speech Tagging with Neural Architecture Search
COUNTRIES NATIONALITIES LANGUAGES.
Sales Presenter Available now Standard v Slim

Presentation transcript:

Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Agenda  Deliverables  Definitions  Scripts  Latin scripts  Greek  Hebrew  Cumulative testing  Sorting (optional)  References

Deliverables — English Internationalized Products  We currently support Latin1 and Asian character sets:  ISO : Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish  Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean  Newly supported character sets:  ISO : Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian  ISO /8/9: Greek, Hebrew, Turkish  Complex languages are not supported: Thai, Indic languages, Arabic  Goal: Unicode  We currently support Latin1 and Asian character sets:  ISO : Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish  Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean  Newly supported character sets:  ISO : Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian  ISO /8/9: Greek, Hebrew, Turkish  Complex languages are not supported: Thai, Indic languages, Arabic  Goal: Unicode  We currently support Latin1 and Asian character sets:  ISO : Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish  Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean  Newly supported character sets:  ISO : Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian  ISO /8/9: Greek, Hebrew, Turkish  Complex languages are not supported: Thai, Indic languages, Arabic  Goal: Unicode  We currently support Latin1 and Asian character sets:  ISO : Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish  Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean  Newly supported character sets:  ISO : Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian  ISO /8/9: Greek, Hebrew, Turkish  Complex languages are not supported: Thai, Indic languages, Arabic  Goal: Unicode

Definitions  Script  System of characters composed of:  Letters, syllables or ideographs (with one or more possible directions)  Punctuation symbols  Numbers ( ¼ ½ ¾ )  Other symbols ( ® $ # % & ± ° )  n scripts/language or n languages/script  Character set (or code page, or coded character set)  Ordered group of characters assigned to code points.  Encoding  System defining the storage mechanism for a given character set.

Single-Byte Character Sets  Expressed in 8-bit sequences.  The character set does not exceed 256 code points.  The encoding is the order of the character set code points.  A given code point may have a different value (character) depending on the character set.  The first 128 code points are always the same.

Latin Scripts Latin 1 Character Set (ISO )  Latin 1  Languages covered  Afrikaans, Albanian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Scottish, Spanish, Swahili, Swedish  Notes  Uppercase and lowercase letters have two code points even though they refer to 2 forms of the same letter.  Some letters have no uppercase.  The base characters are the same for all Latin character sets.  Base characters  a b c d e f g h i j k l m n o p q r s t u v w x y z ! " ' ( ),. : ; ? [ ] ^ { | } ~ # $ % & ÷ × + - * / = \ _  Extended characters  àÀ áÁ â ãà äÄ åÅ æÆ çÇ èÈ éÉ êÊ ëË ðÐ íÍ îÎ ïÏ ñÑ òÒ óÓ ôÔ õÕ öÖ øØ ß ùÙ úÚ ûÛ üÜ ýÝ ÿ þÞ

Latin Scripts ISO vs. Windows 1252  Microsoft Windows' Latin 1 character set (code page 1252) is different from ISO  It contains about 20 extra characters, among others:  The euro symbol ( )  The English curly quotes ( “ ” )  The ellipsis (…)  The German opening quotes ( „ )  The bullet ( )  The n-dash (–)  The m-dash (—)  The French uppercase and lowercase oe ligatures (œ Œ)  The English trademark symbol (™)  These may not display correctly in non-Latin 1 systems.

Latin Scripts ISO vs. Windows 1252 Latin 1 (ISO ) Windows code page 1252

Latin Scripts Latin 2 Character Set (ISO )  Latin 1  Latin 2  Languages covered  Czech, Hungarian, Polish, Romanian, Croatian, Slovak, Slovenian, Sorbian  Notes  Some characters are duplicates from the Latin 1 character set.  The caron diacritic has two forms: “ ˘ ” and “ ’ ”.  The T with cedilla has a glyph variant (T with comma) for Romanian.  Latin 2 characters common to Latin 1 use identical code points.  Extended characters  ąĄ áÁ â ăĂ äÄ ćĆ çÇ čČ ďĎ éÉ ęĘ ëË ěĚ ðÐ íÍ îÎ łŁ ľĽ ĺĹ ńŃ ňŇ óÓ ôÔ őŐ öÖ ŕŔ řŘ śŚ šŠ şŞ § ß  ťŤ ţŢ  ůŮ úÚ űŰ üÜ  ýÝ źŹ žŽ żŻ

ISO vs. ISO Latin 1 (ISO ) Latin 2 (ISO )

ISO vs. ISO  All common characters have the same code points.  Characters that are different belong to separate language families (mostly West European vs. East European).  Allows a certain level of flexibility between languages.

Latin Scripts Latin 3 Character Set (ISO )  Latin 1  Latin 2  Latin 3  Languages covered  Esperanto, Maltese  Notes  Covered Turkish before the introduction of Latin 5 in  Not supported.  Extended characters  àÀ áÁ â äÄ ċĊ ĉĈ çÇ èÈ éÉ êÊ ëË ğĞ ħĦ ĥĤ ıI iİ ìÌ íÍ îÎ ïÏ ĵĴ ñÑ òÒ óÓ ôÔ öÖ şŞ ŝŜ §  ß ùÙ úÚ ûÛ üÜ ŭŬ żŻ £¤

Latin Scripts Latin 4 Character Set (ISO )  Latin 1  Latin 2  Latin 3  Latin 4  Languages covered  Estonian, Latvian, Lithuanian, Greenlandic, Lappish  Notes  Not supported.  Extended characters  ąĄ āĀ áÁ â ãà äÄ åÅ æÆ čČ ēĒ éÉ ęĘ ëË ėĖ ðÐ ģĢ ĸ ķĶ ĩĨ íÍ îÎ īĪ įĮ ļĻ ņŅ ŋŊ ōŌ ôÔ õÕ öÖ øØ ŗŖ šŠ ß ŧŦ ųŲ úÚ ûÛ üÜ ũŨ ūŪ ¤ ÷

Latin Scripts Latin 5 Character Set (ISO )  Latin 1  Latin 2  Latin 3  Latin 4  Latin 5  Languages covered  Turkish  Notes  Very similar to Latin 1.  The letters ð, ý and þ from Latin 1 are replaced with Turkish letters.  Latin 5 characters common to Latin 1 use identical code points.  Issue:  *.ini = *.İNİ, and  *.  n  = *.INI  *.ini  *.INI, and  *.  n   *.İNİ  Extended characters  àÀ áÁ â ãà äÄ åÅ æÆ çÇ èÈ éÉ êÊ ëË íÍ îÎ ïÏ ðÐ ---> ğĞ ñÑ òÒ óÓ ôÔ õÕ öÖ øØ ß ùÙ úÚ ûÛ üÜ ýÝ ---> ıİ ÿ þÞ ---> şŞ

Latin Scripts Latin 6 Character Set (ISO )  Latin 1  Latin 2  Latin 3  Latin 4  Latin 5  Latin 6  Languages covered  Nordic area Inuit (Greenlandic Eskimo), non- Skolt Sami (Lappish), Icelandic  Notes  Similar characters to Latin 4, but with extra letters for the Nordic languages.  Latin 6 characters common to Latin 4 use different code points.  Very not supported.  Extended characters  ąĄ āĀ áÁ â ãà äÄ åÅ æÆ čČ ēĒ éÉ ęĘ ëË ėĖ ðÐ ģĢ ĸ ķĶ ĩĨ íÍ îÎ īĪ įĮ ļĻ ņŅ ŋŊ ōŌ ôÔ õÕ öÖ øØ ŗŖ šŠ ß ŧŦ ųŲ úÚ ûÛ üÜ ũŨ ūŪ ¤ ÷

Latin Scripts Latin 7 & 8 Character Sets (ISO & 14)  Latin 1  Latin 2  Latin 3  Latin 4  Latin 5  Latin 6  Latin 7  Latin 8  Languages covered  Latin 7: Baltic languages  Latin 8: Celtic languages  Notes  Similar characters to Latin 4 and 6, but with extra letters for the Nordic languages.  Latin 7 characters common to Latin 4 and 6 use different code points.  Latin 8 characters common to Latin 1 use identical code points.  Not supported. 

Latin Scripts Latin 9 Character Set (ISO )  Latin 1  Latin 2  Latin 3  Latin 4  Latin 5  Latin 6  Latin 7  Latin 8  Latin 9  Languages covered  Same as Latin 1.  Notes  Some Latin 9 characters common to Latin 1 use different code points.  Less used characters are replaced: ¨ ---> š¦ ---> Š  ¸ ---> ž´ ---> Ž  ½ ---> œ¼ ---> Œ  ¾ ---> Ÿ¤ --->  Extended characters  àÀ áÁ â ãà äÄ åÅ æÆ çÇ èÈ éÉ êÊ ëË íÍ îÎ ïÏ ðÐ ñÑ òÒ óÓ ôÔ õÕ öÖ øØ œŒ šŠ ß ùÙ úÚ ûÛ üÜ ýÝ ÿ Ÿ žŽ þÞ

ISO vs. Windows 1252 Latin 9 (ISO ) Windows 1252

Latin Scripts in... Non-Latin Character Sets!  Latin 1  Latin 2  Latin 3  Latin 4  Latin 5  Latin 6  Latin 7  Latin 8  Latin 9  Other  Languages  Traditional Chinese Simplified Chinese Japanese (romaji or romanji) Vietnamese  Notes  Chinese, Japanese and Korean use Latin letters for transliteration (sometime with tone accents) and numbers.  Vietnamese uses Latin characters with diacritics.  Latin characters are also used in the transliteration of Greek, Hebrew, Russian, etc.  Some Vietnamese extended characters  ðÐ ăĂ â êÊ ôÔ …with tones

Languages Covered by Latin Character Sets  LanguageCharacter set (Latin-n) Czech2 Danish Dutch159 English Finnish French13589 German Hungarian2 Italian13589 Norwegian Polish27 Portuguese13589 Romanian2 Spanish189 Swedish Turkish35  LanguageCharacter set (Latin-n) Czech2 Danish Dutch159 English Finnish French13589 German Hungarian2 Italian13589 Norwegian Polish27 Portuguese13589 Romanian2 Spanish189 Swedish Turkish35

Greek Script Greek Character Set  One script, one character set, one language.  Contains modern monotonic upper & lowercase Greek letters, punctuation and a few accented Greek letters.  The rest is almost identical to Latin 1 !  Missing from Latin 1:  Latin punctuation: ¡ ¿  Currency symbols: ¢ ¤ ¥  Other symbols: ® ª º × ÷ µ ¶  Diacritics: ¸  Numbers: ¹ ¼ ¾  Extended characters  αβγδεζηικλμν… ΑΒΓΔΖΗΘΙΚΛΝΞ…  The rest... ² ³ ½ £ ¦ § © ¬ ­ ¯ ° ± « » · ¨

Hebrew Script Hebrew Character Set  One script, one character set:  Hebrew  Yiddish  Directionality of text:  Hebrew letters are written from right to left (RTL).  Numbers (Arabic) are written from left to right (LTR).  Latin characters are written from left to right (LTR).  Order of the text depends on the predominant language.  Order of mirrored characters depends on neighboring characters.  Differences from Latin 1:  Latin punctuation: ¡ ¿ are missing  Currency symbol:₪ (new sheqel) is absent  Other symbols: ª º are missing × ÷ have different code points  Extended characters  תשרקעסליטחזוהדגבא  Final & nominal forms: ך -כ ן -נ ם -מ ף -פ ץ -צ Final form

Hebrew User Interface  There are two types of Hebrew support:  Hebrew-enabled product (supporting Hebrew characters)  Hebrew product (translated into Hebrew)  Both types must support RTL display.  Text alignment may differ for characters, strings and document.  Normally, the logical order (or storage order or file order) is the same as the reading order.  The display order is bi-directional and does not follow the logical order.

Hebrew User Interface Logical vs. Visual  Input string: "Hebrew text : ילגנא טסקט"  In a LTR document: Hebrew text : טקסט אנגלי  In a RTL document: טקסט אנגלי : Hebrew text  How should it be displayed?  You get different displays depending on the main direction (script) of the document or the string.  Notice the direction of the colon.

Hebrew User Interface — Issues  Display of improper characters.  Display in improper order.  Display in correct order; cursor in logical position.  Mix of Hebrew and Latin text.  Alignment inside an input field.  Copy and paste.  Carriage returns inside a Hebrew or mixed string.

Cumulative Testing  Premisses:  Testing in French or German includes English issues.  Testing of Greek includes non-Latin 1 character and font issues.  Special cases:  Cursory testing of character and font issues per character set.  Sorting and comparision per language.  Hebrew:Bi-directionality  Turkish:INI files and anything related to case conversion

Total 50% Increase for ALL Languages  French or German:100%  Greek:15%  Hebrew:15%  Turkish:5%  Czech or Polish:5%  Cursory testing:10%  English:0%  English coverage:100%

Sorting — 1

Sorting — 2  Sort order  The system generates a sort key based on locale-specific rules  A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.  Simple example of French sorting: Sorting:Rules: élémentaire élève1) Alphanumeric base Élève2) Diacritics élevé3) Case Élevé4) Non-alphanumeric data élever  Sort order  The system generates a sort key based on locale-specific rules  A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.  Simple example of French sorting: Sorting:Rules: elementaire eleve1) Alphanumeric base Eleve2) Diacritics eleve3) Case Eleve4) Non-alphanumeric data elever  Sort order  The system generates a sort key based on locale-specific rules  A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.  Simple example of French sorting: Sorting:Rules: élémentaire élève1) Alphanumeric base Élève2) Diacritics élevé3) Case Élevé4) Non-alphanumeric data élever  Sort order  The system generates a sort key based on locale-specific rules  A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.  Simple example of French sorting: Sorting:Rules: elementaire eleve1) Alphanumeric base Eleve2) Diacritics eleve3) Case Eleve4) Non-alphanumeric data elever  Sort order  The system generates a sort key based on locale-specific rules  A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.  Simple example of French sorting: Sorting:Rules: élémentaire élève1) Alphanumeric base Élève2) Diacritics élevé3) Case Élevé4) Non-alphanumeric data élever  Sort order  The system generates a sort key based on locale-specific rules  A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.  Simple example of French sorting: Sorting:Rules: élémentaire élève1) Alphanumeric base Élève2) Diacritics élevé3) Case Élevé4) Non-alphanumeric data élever e-lever

References  The ISO 8859 Alphabet Soup by Roman Czyborra. An absolute classic...   Character table:   Some Internet Explorer limitations:   More of the same:   On fonts (a bit specialized):   ISO vs.Windows Central European code page (1250): 