Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language.

Slides:



Advertisements
Similar presentations
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Advertisements

 Fundamentals of Web Design.  Describe the history and theory of XHTML  Understand the rules for creating valid XHTML documents  Apply a DTD to an.
23rd Internationalization and Unicode Conference, Prague, Czech Republic – March, 2003 Common XML Locale Repository Dr. Mark Davis
1 PROJECT Web-based Database Applications Lecture 1: Basic Internet Concepts & Databases - the History.
UPortal Internationalization Shoji Kajita Associate Professor, PhD IT Center, Nagoya University Japan Justin Tilton President Instructional Media + Magic.
1 CA201 Word Application Creating Document for the Web Week # 9 By Tariq Ibn Aziz Dammam Community college.
The Future of the Document Paper is OUT Trees are IN UVic Humanities Computing and Media Centre.
Internationalization of Java Platform Presenter: Ataru Nakazawa Advisor: Xiaoping Jia Date: January 23, 2004.
Modelling the spatial data of Hellenic Cadastre and generating the geodatabase schema Aris Sismanidis ARISTOTLE UNIVERSITY OF THESSALONIKI FACULTY OF ENGINEERING.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
24rd Internationalization and Unicode Conference, Atlanta, GA USA – Sept 2003 Common XML Locale Repository Dr. Mark Davis Steven.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
Overview of HTML. Three Different Approaches  Text editor like Notepad  HTML editor such as: –KompoZer –DreamWeaver –Microsoft Expression Web –iWeb.
By: Shawn Li. OUTLINE XML Definition HTML vs. XML Advantage of XML Facts Utilization SAX Definition DOM Definition History Comparison between SAX and.
Computer science Languages, etc.. Overview For web-applications (HTML, JS) – Designing languages (HMTL, CSS) – Server Languages (PHP, ASP) – Extensions.
1GMS-VU : Module 2 Introduction to Information and Communication Technologies Module 2 Computer Software.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.
1 Standards CIS*2450 Professional Aspects of Software Development.
Sakai: Localization & Internationalization Beth Kirschner University of Michigan
Agenda Data Representation – Characters Encoding Schemes ASCII
Spring /6.831 User Interface Design and Implementation1 Lecture 22: Internationalization.
119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt
Chapter 5 – Part II IT Infrastructure and Emerging Technologies.
B.Sc. Multimedia ComputingMedia Technologies Character Representation & Font Technology.
Localization Enablers Technology Development for Indian Languages (TDIL) Programme Department of Information Technology, Ministry of Communication & Information.
PHP TUTORIAL. HISTORY OF PHP  PHP as it's known today is actually the successor to a product named PHP/FI.  Created in 1994 by Rasmus Lerdorf, the very.
The Company….  The Market Leader in Globalization Technology –Pioneers in visual translation environments –Solutions for major platforms & programming.
Language / Locale IDs M. Davis, IBM A. Phillips, webMethods.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
San Jose, California – September, 2002 Transliteration of Indic Scripts Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency.
Unicode Support in ICU for Java Doug Felt Globalization Center of Competency, San Jose, CA.
Enter into new markets more easily Lower cost and time for development and translation Increase customer satisfaction and adoption Avoid costly mistakes.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Globalization Support in Microsoft.NET Framework François Liger Program Manager Microsoft Corporation.
Liang, Introduction to Java Programming, Fifth Edition, (c) 2005 Pearson Education, Inc. All rights reserved Chapter 26 Internationalization.
10 – 12 APRIL 2005 Riyadh, Saudi Arabia. Building multi-lingual ASP.Net application that handle western languages and Arabic with a single code base.
DEV382 Building International Applications with the.NET Framework Christian Nagel Microsoft Regional Director Global Knowledge.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
Web Technologies COMP6115 Session 4: Adding a Database to a Web Site Dr. Paul Walcott Department of Computer Science, Mathematics and Physics University.
XML Introduction. What is XML? XML stands for eXtensible Markup Language XML stands for eXtensible Markup Language XML is a markup language much like.
XML & varieties, e.g. VoiceXML By: Shawn Ramdass, Saji Abraham & Billy Santamorena.
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
XML stands for Extensible Mark-up Language XML is a mark-up language much like HTML XML was designed to carry data, not to display data XML tags are not.
Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft.
1 Chapter 20 Internationalization. 2 Objectives F To describe Java's internationalization features (§ 20.1). F To construct a locale with language, country,
XP Tutorial 9New Perspectives on HTML and XHTML, Comprehensive 1 Working with XHTML Creating a Well-Formed Valid Document Tutorial 9.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  
XML Tools (Chapter 4 of XML Book). What tools are needed for a complete XML application? n Fundamental components n Web infrasructure n XML development.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.
Markus W. Scherer IBM Cupertino August 20 th, 2001Globalizing eBusiness – SDForum Unicode and XML Globalizing eBusiness Tools of the Trade: Unicode and.
XML Extensible Markup Language
XML Notes taken from w3schools. What is XML? XML stands for EXtensible Markup Language. XML was designed to store and transport data. XML was designed.
7/23/2016 CSC 325 Advanced Programming Techniques Localization Slide #1 1 Localization Mikhail Brikman.
1.4 Representation of data in computer systems Character.
Website Source Code Free Download.
Complex Text Layout Issues with examples from Myanmar
The Internet and HTML Code
Data Encoding Characters.
Objective 2.02: Understand computer performance and features
Objective 2.02: Understand computer performance and features
Reach a worldwide audience by building a world-ready app
Chapter 35 Internationalization
Building RESTful services using OData
Provide Effective Internationalization and Accessibility Lecture-13
Introduction to UNICODE (ஒருங்குறி)
Presentation transcript:

Internationalization Using Locales Achim Ruopp

Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language and locale identifiers Locale Data Locale Data Frameworks for locale support Frameworks for locale support Ideas/discussion how this could be used in compling Ideas/discussion how this could be used in compling

Not about character encoding Read Jeremy’s slides from last quarter Read Jeremy’s slides from last quarter s/char-enc/char-encodings.pdfhttp://students.washington.edu/jgk/talk s/char-enc/char-encodings.pdfhttp://students.washington.edu/jgk/talk s/char-enc/char-encodings.pdfhttp://students.washington.edu/jgk/talk s/char-enc/char-encodings.pdf Use Unicode wherever possible Use Unicode wherever possible

Internationalization More than Encoding Text Where are the word breaks? Where are the word breaks?คลิกปุ่มเมาส์ขวา Your balance is $ I think. How do I sort these words in French? How do I sort these words in French? cotedimensioncotedimension côtecoastcôtecoast cotéwith dimensionscotéwith dimensions côtésidecôtéside How do I uppercase this word in Turkish? How do I uppercase this word in Turkish? istiyorum - İstiyorumistiyorum - İstiyorum How do I transcribe this text into Latin characters? How do I transcribe this text into Latin characters? 인수문제를 - in'su'mun'je'reul' 인수문제를 - in'su'mun'je'reul'

Cultural Conventions What does this date stand for? What does this date stand for? 3/8/20063/8/2006 What is the currency symbol for Hungary? What is the currency symbol for Hungary? … linguistic characteristics of languages and cultural conventions – a locale … linguistic characteristics of languages and cultural conventions – a locale

Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language and locale identifiers Locale Data Locale Data Frameworks for locale support Frameworks for locale support Ideas/discussion how this could be used in compling Ideas/discussion how this could be used in compling

Internet Language Tags Used today: RFC 3066 (RFC 1766) Used today: RFC 3066 (RFC 1766) Generative:ISO 639-1/2 language tag[-ISO 3166 country tag]Generative:ISO 639-1/2 language tag[-ISO 3166 country tag] e.g. fr, en-US, ale-CA e.g. fr, en-US, ale-CA Registered with IANARegistered with IANA e.g. no-nyo, zh-Hant e.g. no-nyo, zh-Hant ExceptionsExceptions x-… x-… Several problems Several problems Dependency on ISO standardsDependency on ISO standards No generative options for dialects etc.No generative options for dialects etc. RFC3066bis should solve thisRFC3066bis should solve this

SIL Etnologue Cataloging all of the world’s 6,912 known living languages Cataloging all of the world’s 6,912 known living languages Uses ISO/DIS letter codes Uses ISO/DIS letter codes E.g. Swabian dialect: x-sil-swg E.g. Swabian dialect: x-sil-swg Hope for consolidation with RFC3066 or successor once becomes full standard Hope for consolidation with RFC3066 or successor once becomes full standard Not so well supported in programming frameworks Not so well supported in programming frameworks

Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language and locale identifiers Locale Data Locale Data Frameworks for locale support Frameworks for locale support Ideas/discussion how this could be used in compling Ideas/discussion how this could be used in compling

Types of Locale Data Dates/time formats Dates/time formats Number/currency formats Number/currency formats Collation Specification Collation Specification For sorting and comparisonFor sorting and comparison Translated names for language, region, script, timezones, currencies,… Translated names for language, region, script, timezones, currencies,… Script and characters used by a language Script and characters used by a language Measurement System Measurement System Paper sizes Paper sizes …

Common Locale Data Repository “The purpose of the Common Locale Data Repository project is to provide a general XML format for the exchange of locale information for use in application and system development, and to gather, store, and make available a common set of locale data generated in that format.” “The purpose of the Common Locale Data Repository project is to provide a general XML format for the exchange of locale information for use in application and system development, and to gather, store, and make available a common set of locale data generated in that format.”

Common Locale Data Repository Collection/vetting process Collection/vetting process Contributors add/modify dataContributors add/modify data Reviewed by commiteeReviewed by commitee Accessible over the web Accessible over the web Locale Data Markup Language XML formatLocale Data Markup Language XML format E.g. ain/fr.xmlE.g. ain/fr.xml ain/fr.xml ain/fr.xml

Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language and locale identifiers Locale Data Locale Data Frameworks for locale support Frameworks for locale support Ideas/discussion how this could be used in compling Ideas/discussion how this could be used in compling

Frameworks Posix Locale Standard C/C++ libary Standard C/C++ libary LC_COLLATE – sorting/comparisonLC_COLLATE – sorting/comparison LC_CTYPE - behavior of character-handlingLC_CTYPE - behavior of character-handling LC_MONETARY - monetary formatting LC_NUMERIC – numeric formattingLC_MONETARY - monetary formatting LC_NUMERIC – numeric formatting LC_TIME – date/time formattingLC_TIME – date/time formatting Used in Un*x systems for command line functions too Used in Un*x systems for command line functions too Results can be platform-dependent Results can be platform-dependent Stable, but feature set stuck in the 1980s Stable, but feature set stuck in the 1980s

Frameworks ICU Library IBM Open Source project IBM Open Source project Developed originally for the Taligent OS project in the late 80s/early 90s Developed originally for the Taligent OS project in the late 80s/early 90s Java and C++ APIs Java and C++ APIs Extensive locale data and APIs to use it Extensive locale data and APIs to use it Also includes localization support Also includes localization support Everybody (Mac OS X, Java, DB2, Mathworks …) is using it Everybody (Mac OS X, Java, DB2, Mathworks …) is using it But … But …

Frameworks Microsoft Windows NLS API Windows NLS API Microsoft.NET Framework System.Globalization namespace Microsoft.NET Framework System.Globalization namespace Similar set of data to ICU Similar set of data to ICU Vetted by subsidiariesVetted by subsidiaries APIs accessible from all MS programming languages APIs accessible from all MS programming languages Localization support in different API Localization support in different API

Microsoft demos Culture Explorer Microsoft Transliteration Utility

Extensibility What if I don’t find the locale I need? What if I don’t find the locale I need? What if I need to modify some of the data? What if I need to modify some of the data? ICU ICU Can create new localesCan create new locales Microsoft Microsoft.NET Framework v2.0: custom cultures.NET Framework v2.0: custom cultures Windows Vista: custom localesWindows Vista: custom locales LDML can be interchange format LDML can be interchange format

Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language and locale identifiers Locale Data Locale Data Frameworks for locale support Frameworks for locale support Ideas/discussion how this could be used in compling Ideas/discussion how this could be used in compling

Usages for Computational Linguistics Up to the imagination Up to the imagination Transliteration use in MTTransliteration use in MT Named Entity RecognitionNamed Entity Recognition … suggestions?suggestions? Most importantly: Do not reinvent the wheel! Most importantly: Do not reinvent the wheel! Check if API or data you need is availableCheck if API or data you need is available If possible write code in a language/locale- independent fashion If possible write code in a language/locale- independent fashion

References RFC3066bis RFC3066bis Etnologue Etnologue Common Locale Data Repository Common Locale Data Repository Posix Locale Posix Locale bd_chap07.htmlhttp:// bd_chap07.htmlhttp:// bd_chap07.htmlhttp:// bd_chap07.html ICU ICU Microsoft Microsoft UNGEGN Working Group on Romanization Systems UNGEGN Working Group on Romanization Systems