New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data.

Slides:

Advertisements

Similar presentations

Unicode from a distance…

Advertisements

Module 3: Block 3 Call Management

Unicode Security Mark Davis. The Unicode Consortium Software globalization standards: define properties and behavior for every character in every script.

Unicode/IDN Security Mark Davis President, Unicode Consortium Chief SW Globalization Arch., IBM.

Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

Globalization Gotchas

Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect

Unicode Normalization Mark Davis

Mark Davis President, Unicode Consortium

Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.

Copyright © 2003 Pearson Education, Inc. Slide 3-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.

1 Integrating Arab e-Infrastructure in a Global Environment December 2011 Amman, Jordan KM, ICTs for development and e-learning initiatives in the.

doi> Digital Object Identifier: overview

13 September 2012 SDMX Technical Working Group1 Report of the SDMX Technical Standards Working Group SDMX Expert Group Meeting, Paris, September 2012.

1 Why ETSI is the place to bridge EU and LA initiatives on e-administration Francisco Da Silva Chairman of the Kick Off Meeting Sophia Antipolis,

From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.

The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.

© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.

Open-Source Approaches to Unicode Enablement Panel Discussion.

Treasury Board of Canada Secretariat Secrétariat du Conseil du Trésor du Canada IM Standards for E-government The Canadian Experience Managing Information.

September Public Library Web Managers Workshop 2000 Cascading Style Sheets Manjula Patel UKOLN University of Bath Bath, BA2 7AY UKOLN is funded.

4. Internet Programming ENG224 INFORMATION TECHNOLOGY – Part I

Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University.

June 2004 Adil Allawi Technical Director

1 Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha.

Enhanced Data Description for End Users ScribeKey, LLC Brian Hebert, Solutions Architect

Component-Based Software Engineering Main issues: assemble systems out of (reusable) components compatibility of components.

Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.

Getting Familiar with Web Pages 1 2 The Internet Worldwide collection of interconnected computer networks that enables businesses, organizations, governments,

Essentials for Design JavaScript Level One Michael Brooks

XML Craig Stewart Dr. Alexandra I. Cristea

Dr. Alexandra I. Cristea XHTML.

From Model-based to Model-driven Design of User Interfaces.

Text #ICANN50. Text #ICANN50 IDN Variant TLD Program GNSO Update Saturday 21 June 2014.

23rd Internationalization and Unicode Conference, Prague, Czech Republic – March, 2003 Common XML Locale Repository Dr. Mark Davis

Programming Languages Language Design Issues Why study programming languages Language development Software architectures Design goals Attributes of a good.

Internationalization of Java Platform Presenter: Ataru Nakazawa Advisor: Xiaoping Jia Date: January 23, 2004.

StandardsDIS W4 RJK1 Distributed Information Systems Standards Bob Kummerfeld Department of Computer Science.

Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.

CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.

COMPUTER FUNDAMENTALS David Samuel Bhatti

Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.

San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.

1 CIM User Group Conference Call december 8th 2005 Using UN/CEFACT Core Component methodology for EIC/TC 57 works and CIM Jean-Luc SANSON Electrical Network.

Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.

119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt

1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.

XML The Overview. Three Key Questions What is XML? What Problems does it solve? Where and how is it used?

ISO/IEC JTC 1 Special Working Group on Accessibility (SWG-A) JTC 1 SWG-A N Document Type: SWG-A Meeting Document Title: Task Group 2/Break.

By: Md Rezaul Huda Reza 5Ps for SE Process Project Product People Problem.

San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.

CIS 1310 – HTML & CSS 1 Introduction to the Internet.

Unicode Support in ICU for Java Doug Felt Globalization Center of Competency, San Jose, CA.

SEARCH Membership Group Systems & Technology PAC Global Justice XML Data Model (GJXDM) Update January 29, 2005.

Implementation Issues Mark Davis Properties.

Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.

Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.

UNICODE & Indic Scripts

Internet & World Wide Web How to Program, 5/e © by Pearson Education, Inc. All Rights Reserved.

Internet & World Wide Web How to Program, 5/e © by Pearson Education, Inc. All Rights Reserved.

32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008.

IDN UPDATE Tina Dam ICANN Chief gTLD Registry Liaison Public Forum, Wellington 30 March 2006.

Standards that might come up in discussion today EN 15038: quality standard developed especially for translation services providers, including regular.

San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.

Module Road Map Assignment Road Map Notice we have linked the conduit directly to the presentation layer. This is normally a bad idea!

Introduction to Scripting

TOPICS Information Representation Characters and Images

Unicode from a distance…

Sarmad Hussain Internationalized Domain Names (IDN) Programs Director

Presentation transcript:

New in Unicode Mark Davis, John Jenkins

Agenda Unicode UCA Regular Expressions Security Considerations Character Mapping Common Locale Data Repository Expanded Role for Consortium

Unicode Released 2005 March 31 New Characters New Unicode Character Database New Specifications

1,273 New Characters Roundtripping for HKSCS and GB Five new currency signs Additional characters for Indic and Korean Eight new scripts

Changes in the Standard Conformance Changes Modifications to Default Case Operations Clarification of Decomposition Mappings Other Changes SPACE not recommended as base for nonspacing marks Use of CGJ to prevent reordering, prevent contractions in sorting/matching (UCA) Positioning of Meteg Rendering of Thai Combining Marks

Unicode Character Database Determines the behavior of characters in modern software: Alphabetics, Letters, Numbers, Identifiers, Scripts, … New properties Grapheme_Cluster_Break, Sentence_Break, Word_Break, Pattern_Syntax, and Pattern_White_Space Revised Property Values Eg Alphabetic ( Lowercase Uppercase ) Expanded documentation Each release now complete, not delta

New Specifications UAX #31: Identifier and Pattern Syntax Basis for Backwards-Compatible Identifiers Programming Languages Resources and Services Basis for Stable Syntax characters Whitespace Operators UAX #34: Unicode Named Character Sequences Mechanism for identifying/naming significant sequences Standardized list

Major Revisions in Annexes UAX #15: Unicode Normalization Forms Correction for Idempotency Problem Enhanced discussion of Hangul UAX #14: Line Breaking Properties Modifications for Hangul Changes because SPACE not recommended as base for nonspacing marks Separated all suggested tailorings into separate section UAX #29: Text Boundaries Using new properties, adding Joiner/Non-Joiner Modifications to Word -Break

UTS #10: Unicode Collation Algorithm Basis for language-sensitive sorting, searching, and matching Synchronized with Unicode New: Characters Revised Weights Specification: matching, ignorables, Thai, …

UTS #18: Unicode Regular Expressions Regular expressions used widely in programs, for matching patterns (eg Wildcards) Unicode expands the scope drastically Explicit Conformance Clauses POSIX-Conformance

UAX #36: Unicode Security Incorrect usage of Unicode can expose programs or systems to possible security attacks! Examples: Numbers: = 42 ! Bengali { }, Oriya { }. Domain Names: StringUTF-16Internal - IDNA 1a ät.com E F 006D xn--t-zfa.com 1b ät.com 00E E F 006D xn--t-zfa.com 2a tοp.com BF E F 006D xn--tp-jbc.com 2b tοp.com F E F 006D top.com 4a so̷s.com F E F 006D xn--sos-rjc.com 4b søs.com F E F 006D xn--ss-lka.com

Character Mapping ML XML format for the interchange of mapping data for character encodings and aliases. Promoted to Unicode Technical Standard; with new Conformance section (2). Added explicit text about multi-character mappings.

Common Locale Data Repository Common, necessary software locale data for world languages XML format for effective interchange Δευτέρα, 05 Σεπτεμβρίου 2005 Montag, 5. September , ,57руб. Arabic – arabski Bulgarian – bułgarski Czech – czeski … Africa – Central America – Eastern Africa – Northern Africa – … AED – د. إ. BHD – د. ب. DZD – د. ج. EGP – ج. م. EUR – … Z < Å

Typical Locale Data Dates/time formats Number/Currency formats Measurement Systems Collation Specifications (UCA-based) Used for sorting, searching, matching Tailorings of translated names for language, territory, script, timezones, currencies, …...

Latest Release: CLDR locales: 96 languages, 130 territories Languages: Afar [Qafar]; Afrikaans; Albanian [shqipe]; Amharic []; Arabic [ العربية ]; Armenian [ Հայերէն ]; … Territories: Afghanistan [ افغانستان ]; Albania [Shqipëria]; Algeria [الجزائر ]; Argentina; Armenia [ Հայաստանի Հանրապետութիւն ]; Australia; Austria [Österreich]; Azerbaijan [Az ə rbaycan, Аз ә рбај ҹ ан]; … Complete set of generated POSIX-format data Plus tool to generate versions tuned for different platforms. Expanded locale data Timezone localizations Including UN M.49 continents and regions Many other revisions and additions of data New Tests & Tools

Expanded Role for Consortium Dedicated to the goal that all the world's languages can be used on computers everywhere, from mobile phones to mainframes. Providing the fundamental specifications for full software globalization, full interoperability

Full Members

Institutional & Supporting Members (New Membership Categories)

Associate Members

Liaison Members Center of Computer and Information Development (CCID), Beijing, China High Council of Informatics (HCI), Iran High Council of Informatics (HCI) Information and Communication Technology Agency of Sri Lanka (ICTA) Information and Communication Technology Agency of Sri Lanka (ICTA) The International Forum for Information Technology in Tamil (INFITT) The International Forum for Information Technology in Tamil (INFITT) The Internet Engineering Task Force (IETF) The Internet Engineering Task Force (IETF) ISO/IEC JTC1/SC2 and WG2 ISO/IEC JTC1/SC2WG2 Linguistic Society of America (LSA) National Endowment for the Humanities (NEH) National Endowment for the Humanities (NEH) National Information Standards Organization (NISO) National Information Standards Organization (NISO) NSAI/ICTSCC/SC4:Irish standardization: Codes, Character Sets, and Intlization NSAI/ICTSCC/SC4 Open I18n.org: The Free standards Group Open Internationalization Initiative Open I18n.org Research Institute for ILCAA, Tokyo University of Foreign Studies Research Institute for ILCAA Research Institute for the Languages of Finland (RILF) Research Institute for the Languages of Finland (RILF) Special Libraries Association (SLA ) Technical Committee on Information Technology (TCVN/TC1), Hanoi, Viet Nam Technical Committee on Information Technology (TCVN/TC1) United Nations Group of Experts on Geographical Names (UNGEGN) United Nations Group of Experts on Geographical Names (UNGEGN) World Wide Web Consortium - W3C I18N Core Working Group World Wide Web Consortium - W3C I18N Core Working Group

Unicode Technical Committee Multiple Globalization Standards The Unicode Standard, including UAXes Unicode Technical Standards: Collation, … Unicode Technical Notes: Best Practices, Background Information Quarterly F2F Meetings Discussion

CLDR Technical Committee Meetings Short, frequent: Telecon + Instant Messaging Discussion Data All additions / revisions in bug database Anyone can file; committee assesses, vets

Why Join? Support the technology That enables your success in international, technical, and emerging markets. Protect your investment The stability you need The extensions you require The developments you call for: security, … Demonstrate your leadership For the goal that all the world's languages can be used on computers everywhere, from mobile phones to mainframes.