New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data.

New in Unicode Mark Davis, John Jenkins

Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data Repository Expanded Role for Consortium

Unicode 4.1.0 Released 2005 March 31 New Characters New Unicode Character Database New Specifications

1,273 New Characters Roundtripping for HKSCS and GB 18030 Five new currency signs Additional characters for Indic and Korean Eight new scripts

Changes in the Standard Conformance Changes Modifications to Default Case Operations Clarification of Decomposition Mappings Other Changes SPACE not recommended as base for nonspacing marks Use of CGJ to prevent reordering, prevent contractions in sorting/matching (UCA) Positioning of Meteg Rendering of Thai Combining Marks

Unicode Character Database Determines the behavior of characters in modern software: Alphabetics, Letters, Numbers, Identifiers, Scripts, … New properties Grapheme_Cluster_Break, Sentence_Break, Word_Break, Pattern_Syntax, and Pattern_White_Space Revised Property Values Eg Alphabetic ( Lowercase Uppercase ) Expanded documentation Each release now complete, not delta

New Specifications UAX #31: Identifier and Pattern Syntax Basis for Backwards-Compatible Identifiers Programming Languages Resources and Services Basis for Stable Syntax characters Whitespace Operators UAX #34: Unicode Named Character Sequences Mechanism for identifying/naming significant sequences Standardized list

Major Revisions in Annexes UAX #15: Unicode Normalization Forms Correction for Idempotency Problem Enhanced discussion of Hangul UAX #14: Line Breaking Properties Modifications for Hangul Changes because SPACE not recommended as base for nonspacing marks Separated all suggested tailorings into separate section UAX #29: Text Boundaries Using new properties, adding Joiner/Non-Joiner Modifications to Word -Break

UTS #10: Unicode Collation Algorithm Basis for language-sensitive sorting, searching, and matching Synchronized with Unicode 4.1.0 New: Characters Revised Weights Specification: matching, ignorables, Thai, …

UTS #18: Unicode Regular Expressions Regular expressions used widely in programs, for matching patterns (eg Wildcards) Unicode expands the scope drastically Explicit Conformance Clauses POSIX-Conformance

UAX #36: Unicode Security Incorrect usage of Unicode can expose programs or systems to possible security attacks! Examples: Numbers: = 42 ! Bengali { }, Oriya { }. Domain Names: StringUTF-16Internal - IDNA 1a ät.com 0061 0308 0074 002E 0063 006F 006D xn--t-zfa.com 1b ät.com 00E4 0074 002E 0063 006F 006D xn--t-zfa.com 2a tοp.com 0074 03BF 0070 002E 0063 006F 006D xn--tp-jbc.com 2b tοp.com 0074 006F 0070 002E 0063 006F 006D top.com 4a so̷s.com 0073 006F 0337 0073 002E 0063 006F 006D xn--sos-rjc.com 4b søs.com 0073 00F8 0073 002E 0063 006F 006D xn--ss-lka.com

Character Mapping ML XML format for the interchange of mapping data for character encodings and aliases. Promoted to Unicode Technical Standard; with new Conformance section (2). Added explicit text about multi-character mappings.

Common Locale Data Repository Common, necessary software locale data for world languages XML format for effective interchange Δευτέρα, 05 Σεπτεμβρίου 2005 Montag, 5. September 2005 1,234.57 1 234,57руб. Arabic – arabski Bulgarian – bułgarski Czech – czeski … Africa – Central America – Eastern Africa – Northern Africa – … AED – د. إ. BHD – د. ب. DZD – د. ج. EGP – ج. م. EUR – … Z < Å

Typical Locale Data Dates/time formats Number/Currency formats Measurement Systems Collation Specifications (UCA-based) Used for sorting, searching, matching Tailorings of translated names for language, territory, script, timezones, currencies, …...

Latest Release: CLDR 1.3 296 locales: 96 languages, 130 territories Languages: Afar [Qafar]; Afrikaans; Albanian [shqipe]; Amharic []; Arabic [ العربية ]; Armenian [ Հայերէն ]; … Territories: Afghanistan [ افغانستان ]; Albania [Shqipëria]; Algeria [الجزائر ]; Argentina; Armenia [ Հայաստանի Հանրապետութիւն ]; Australia; Austria [Österreich]; Azerbaijan [Az ə rbaycan, Аз ә рбај ҹ ан]; … Complete set of generated POSIX-format data Plus tool to generate versions tuned for different platforms. Expanded locale data Timezone localizations Including UN M.49 continents and regions Many other revisions and additions of data New Tests & Tools

Expanded Role for Consortium Dedicated to the goal that all the world's languages can be used on computers everywhere, from mobile phones to mainframes. Providing the fundamental specifications for full software globalization, full interoperability

Full Members

Institutional & Supporting Members (New Membership Categories)

Associate Members

Liaison Members Center of Computer and Information Development (CCID), Beijing, China High Council of Informatics (HCI), Iran High Council of Informatics (HCI) Information and Communication Technology Agency of Sri Lanka (ICTA) Information and Communication Technology Agency of Sri Lanka (ICTA) The International Forum for Information Technology in Tamil (INFITT) The International Forum for Information Technology in Tamil (INFITT) The Internet Engineering Task Force (IETF) The Internet Engineering Task Force (IETF) ISO/IEC JTC1/SC2 and WG2 ISO/IEC JTC1/SC2WG2 Linguistic Society of America (LSA) National Endowment for the Humanities (NEH) National Endowment for the Humanities (NEH) National Information Standards Organization (NISO) National Information Standards Organization (NISO) NSAI/ICTSCC/SC4:Irish standardization: Codes, Character Sets, and Intlization NSAI/ICTSCC/SC4 Open I18n.org: The Free standards Group Open Internationalization Initiative Open I18n.org Research Institute for ILCAA, Tokyo University of Foreign Studies Research Institute for ILCAA Research Institute for the Languages of Finland (RILF) Research Institute for the Languages of Finland (RILF) Special Libraries Association (SLA ) Technical Committee on Information Technology (TCVN/TC1), Hanoi, Viet Nam Technical Committee on Information Technology (TCVN/TC1) United Nations Group of Experts on Geographical Names (UNGEGN) United Nations Group of Experts on Geographical Names (UNGEGN) World Wide Web Consortium - W3C I18N Core Working Group World Wide Web Consortium - W3C I18N Core Working Group

Unicode Technical Committee Multiple Globalization Standards The Unicode Standard, including UAXes Unicode Technical Standards: Collation, … Unicode Technical Notes: Best Practices, Background Information Quarterly F2F Meetings Email Discussion

CLDR Technical Committee Meetings Short, frequent: Telecon + Instant Messaging Email Discussion Data All additions / revisions in bug database Anyone can file; committee assesses, vets

Why Join? Support the technology That enables your success in international, technical, and emerging markets. Protect your investment The stability you need The extensions you require The developments you call for: security, … Demonstrate your leadership For the goal that all the world's languages can be used on computers everywhere, from mobile phones to mainframes.

New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data.

Similar presentations

Presentation on theme: "New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data.

Similar presentations

Presentation on theme: "New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data."— Presentation transcript:

Similar presentations

About project

Feedback