Presentation is loading. Please wait.

Presentation is loading. Please wait.

Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language.

Similar presentations


Presentation on theme: "Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language."— Presentation transcript:

1 Internationalization Using Locales Achim Ruopp

2 Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language and locale identifiers Locale Data Locale Data Frameworks for locale support Frameworks for locale support Ideas/discussion how this could be used in compling Ideas/discussion how this could be used in compling

3 Not about character encoding Read Jeremy’s slides from last quarter Read Jeremy’s slides from last quarter http://students.washington.edu/jgk/talk s/char-enc/char-encodings.pdfhttp://students.washington.edu/jgk/talk s/char-enc/char-encodings.pdfhttp://students.washington.edu/jgk/talk s/char-enc/char-encodings.pdfhttp://students.washington.edu/jgk/talk s/char-enc/char-encodings.pdf Use Unicode wherever possible Use Unicode wherever possible

4 Internationalization More than Encoding Text Where are the word breaks? Where are the word breaks?คลิกปุ่มเมาส์ขวา Your balance is $1234.56... I think. How do I sort these words in French? How do I sort these words in French? cotedimensioncotedimension côtecoastcôtecoast cotéwith dimensionscotéwith dimensions côtésidecôtéside How do I uppercase this word in Turkish? How do I uppercase this word in Turkish? istiyorum - İstiyorumistiyorum - İstiyorum How do I transcribe this text into Latin characters? How do I transcribe this text into Latin characters? 인수문제를 - in'su'mun'je'reul' 인수문제를 - in'su'mun'je'reul'

5 Cultural Conventions What does this date stand for? What does this date stand for? 3/8/20063/8/2006 What is the currency symbol for Hungary? What is the currency symbol for Hungary? … linguistic characteristics of languages and cultural conventions – a locale … linguistic characteristics of languages and cultural conventions – a locale

6 Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language and locale identifiers Locale Data Locale Data Frameworks for locale support Frameworks for locale support Ideas/discussion how this could be used in compling Ideas/discussion how this could be used in compling

7 Internet Language Tags Used today: RFC 3066 (RFC 1766) Used today: RFC 3066 (RFC 1766) Generative:ISO 639-1/2 language tag[-ISO 3166 country tag]Generative:ISO 639-1/2 language tag[-ISO 3166 country tag] e.g. fr, en-US, ale-CA e.g. fr, en-US, ale-CA Registered with IANARegistered with IANA e.g. no-nyo, zh-Hant e.g. no-nyo, zh-Hant ExceptionsExceptions x-… x-… Several problems Several problems Dependency on ISO standardsDependency on ISO standards No generative options for dialects etc.No generative options for dialects etc. RFC3066bis should solve thisRFC3066bis should solve this

8 SIL Etnologue Cataloging all of the world’s 6,912 known living languages Cataloging all of the world’s 6,912 known living languages http://www.ethnologue.com/ http://www.ethnologue.com/ http://www.ethnologue.com/ Uses ISO/DIS 639-3 3-letter codes Uses ISO/DIS 639-3 3-letter codes E.g. Swabian dialect: x-sil-swg E.g. Swabian dialect: x-sil-swg Hope for consolidation with RFC3066 or successor once 639-3 becomes full standard Hope for consolidation with RFC3066 or successor once 639-3 becomes full standard Not so well supported in programming frameworks Not so well supported in programming frameworks

9 Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language and locale identifiers Locale Data Locale Data Frameworks for locale support Frameworks for locale support Ideas/discussion how this could be used in compling Ideas/discussion how this could be used in compling

10 Types of Locale Data Dates/time formats Dates/time formats Number/currency formats Number/currency formats Collation Specification Collation Specification For sorting and comparisonFor sorting and comparison Translated names for language, region, script, timezones, currencies,… Translated names for language, region, script, timezones, currencies,… Script and characters used by a language Script and characters used by a language Measurement System Measurement System Paper sizes Paper sizes …

11 Common Locale Data Repository “The purpose of the Common Locale Data Repository project is to provide a general XML format for the exchange of locale information for use in application and system development, and to gather, store, and make available a common set of locale data generated in that format.” “The purpose of the Common Locale Data Repository project is to provide a general XML format for the exchange of locale information for use in application and system development, and to gather, store, and make available a common set of locale data generated in that format.” http://www.unicode.org/cldr/ http://www.unicode.org/cldr/ http://www.unicode.org/cldr/

12 Common Locale Data Repository Collection/vetting process Collection/vetting process Contributors add/modify dataContributors add/modify data Reviewed by commiteeReviewed by commitee Accessible over the web Accessible over the web Locale Data Markup Language XML formatLocale Data Markup Language XML format E.g. http://unicode.org/cldr/data/common/m ain/fr.xmlE.g. http://unicode.org/cldr/data/common/m ain/fr.xml http://unicode.org/cldr/data/common/m ain/fr.xml http://unicode.org/cldr/data/common/m ain/fr.xml

13 Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language and locale identifiers Locale Data Locale Data Frameworks for locale support Frameworks for locale support Ideas/discussion how this could be used in compling Ideas/discussion how this could be used in compling

14 Frameworks Posix Locale Standard C/C++ libary Standard C/C++ libary LC_COLLATE – sorting/comparisonLC_COLLATE – sorting/comparison LC_CTYPE - behavior of character-handlingLC_CTYPE - behavior of character-handling LC_MONETARY - monetary formatting LC_NUMERIC – numeric formattingLC_MONETARY - monetary formatting LC_NUMERIC – numeric formatting LC_TIME – date/time formattingLC_TIME – date/time formatting Used in Un*x systems for command line functions too Used in Un*x systems for command line functions too Results can be platform-dependent Results can be platform-dependent Stable, but feature set stuck in the 1980s Stable, but feature set stuck in the 1980s

15 Frameworks ICU Library IBM Open Source project IBM Open Source project Developed originally for the Taligent OS project in the late 80s/early 90s Developed originally for the Taligent OS project in the late 80s/early 90s Java and C++ APIs Java and C++ APIs Extensive locale data and APIs to use it Extensive locale data and APIs to use it http://www.icu-project.org/cgi-bin/locexphttp://www.icu-project.org/cgi-bin/locexphttp://www.icu-project.org/cgi-bin/locexp Also includes localization support Also includes localization support Everybody (Mac OS X, Java, DB2, Mathworks …) is using it Everybody (Mac OS X, Java, DB2, Mathworks …) is using it But … But …

16 Frameworks Microsoft Windows NLS API Windows NLS API Microsoft.NET Framework System.Globalization namespace Microsoft.NET Framework System.Globalization namespace Similar set of data to ICU Similar set of data to ICU Vetted by subsidiariesVetted by subsidiaries APIs accessible from all MS programming languages APIs accessible from all MS programming languages Localization support in different API Localization support in different API

17 Microsoft demos Culture Explorer Microsoft Transliteration Utility

18 Extensibility What if I don’t find the locale I need? What if I don’t find the locale I need? What if I need to modify some of the data? What if I need to modify some of the data? ICU ICU Can create new localesCan create new locales Microsoft Microsoft.NET Framework v2.0: custom cultures.NET Framework v2.0: custom cultures Windows Vista: custom localesWindows Vista: custom locales LDML can be interchange format LDML can be interchange format

19 Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language and locale identifiers Locale Data Locale Data Frameworks for locale support Frameworks for locale support Ideas/discussion how this could be used in compling Ideas/discussion how this could be used in compling

20 Usages for Computational Linguistics Up to the imagination Up to the imagination Transliteration use in MTTransliteration use in MT Named Entity RecognitionNamed Entity Recognition … suggestions?suggestions? Most importantly: Do not reinvent the wheel! Most importantly: Do not reinvent the wheel! Check if API or data you need is availableCheck if API or data you need is available If possible write code in a language/locale- independent fashion If possible write code in a language/locale- independent fashion

21 References RFC3066bis RFC3066bis http://www.inter-locale.com/ID/why-rfc3066bis.htmlhttp://www.inter-locale.com/ID/why-rfc3066bis.htmlhttp://www.inter-locale.com/ID/why-rfc3066bis.html Etnologue Etnologue http://www.ethnologue.com/http://www.ethnologue.com/http://www.ethnologue.com/ Common Locale Data Repository Common Locale Data Repository http://www.unicode.org/cldr/http://www.unicode.org/cldr/http://www.unicode.org/cldr/ Posix Locale Posix Locale http://www.opengroup.org/onlinepubs/009695399/basedefs/x bd_chap07.htmlhttp://www.opengroup.org/onlinepubs/009695399/basedefs/x bd_chap07.htmlhttp://www.opengroup.org/onlinepubs/009695399/basedefs/x bd_chap07.htmlhttp://www.opengroup.org/onlinepubs/009695399/basedefs/x bd_chap07.html ICU ICU http://icu.sourceforge.net/http://icu.sourceforge.net/http://icu.sourceforge.net/ Microsoft Microsoft http://www.microsoft.com/globaldev/http://www.microsoft.com/globaldev/http://www.microsoft.com/globaldev/ UNGEGN Working Group on Romanization Systems UNGEGN Working Group on Romanization Systems http://www.eki.ee/wgrs/http://www.eki.ee/wgrs/http://www.eki.ee/wgrs/


Download ppt "Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language."

Similar presentations


Ads by Google