Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unicode 4.0 Mark Davis President, The Unicode Consortium.

Similar presentations


Presentation on theme: "Unicode 4.0 Mark Davis President, The Unicode Consortium."— Presentation transcript:

1 Unicode 4.0 Mark Davis President, The Unicode Consortium

2 Schedule 2003, April: UCD/UAXes Final data files available Implementation can proceed 2003: September: Book Available

3 New Characters: 1,228 Modern Scripts (additions to) Indic, Khmer, Latin, Greek, Arabic, Syriac (minority scripts) Limbu, Tai Le, Osmanya Historic Scripts Linear B, Cypriot, Ugaritic, Shavian, Aegean Numbers Symbols Monograms, digrams, tetragrams, other symbols modifier & combining characters

4 New Characters (cont.) Special Characters additional variation selectors (for future CJK variants), double-diacritics for dictionary use For a detailed list, see Derived Age in the UCD 4.0, and the beta Charts.UCD 4.0Charts Character repertoire corresponds to ISO/IEC 10646:2003.

5 Conformance Substantially improved specification of conformance requirements Incorporated UTR #17: Character Encoding Model, clearly separating encoding forms and encoding schemesUTR #17: Character Encoding Model Tightened definitions of UTF-8, UTF-16, UTF-32 Separate definition of Unicode String Clarified conformance status of Unicode Standard Annexes Formal definitions of properties & algorithms Provisional properties: draft, NRFPT

6 UTF vs Unicode String UTF Unique representation for Code Point All else illegal C0 80 D800 0061 Unicode String Sequence of code units Internal Processing, not interchange Not necessarily valid UTF C0 A0 D800 0061

7 Conformance (cont.) Formalized policies for stability of the standard Clarification of semantics of important characters, including BOM Revised scope of enclosing combining marks Revised semantics of ZWJ for cursive scripts Normalization Corrections U+2F868; U+2F874; U+2F91F; U+2F95F; U+2F9BF

8 Textual Clarifications Major changes to Chapters 2, 3, 6, 14 and 15 Definitive terminology for code points: graphic, format, control, private-use = assigned characters surrogate, noncharacter, reserved not characters Substantial improvements to many character block descriptions, especially Indic

9 Programming language identifiers Now backwards-compatible Once a Unicode identifier, Always a Unicode identifier Alternate definition for complete stability Fix set of allowed characters Allow all reserved code points + Complete stability - Odd characters

10 Case mappings now normative (but tailorable) Clearer definition of string functions: isUpper(), isLower(), isTitle(), isFold() toUpper(), toLower(), toTitle(), toFold() Definition of titlecase uses word boundaries Note that the Turkic mappings do not maintain canonical equivalence, without additional processing.

11 UAX #9: The Bidirectional Algorithm canonically equivalence now preserved data change, not algorithm shaping is done after reordering but not across directional boundaries clarifications of: ZWJ, ZWNJ intermediate level processing

12 UAX #14: Line Breaking Properties Negative numbers and dates with hyphens will not break across lines Word-Joiner will link any characters (except hard line breaks) Behavior of soft hyphen clarified marks opportunity for breaking, not specific graphic appearance. Rules for GL relaxed SP and ZW override GL New Property Values: NL, WJ

13 UAX #15: Unicode Normalization Forms Description of Stable Code Points.Stable Code Points Notation NFC(x) and isNFC(x), in Notation.Notation Added pointer to UTN #5 Canonical Equivalences in ApplicationsUTN #5 Canonical Equivalences in Applications Rewrote Annex 12: Corrigenda for clarity, and to describe the use of Normalization Corrections.Annex 12: Corrigenda Added Annex 13: Canonical Equivalence.Annex 13: Canonical Equivalence

14 UAX #29: Text Boundaries New: extracted from 3.0, but significantly revised Default definitions Word, sentence: tailoring expected Grapheme cluster ( user character ) Hangul Syllable or other Base plus (optionally) any number of NSMs

15 No Sub. Changes UAX #11: East Asian Width UAX #11: East Asian Width UAX #24: Script Names except now UAX!

16 Superseded UAXes Incorporated into and thus superseded by Unicode Version 4.0: UAX #13: Unicode Newline Guidelines UAX #19: UTF-32 UAX #21: Case Mappings UAX #27: Unicode 3.1 UAX #28: Unicode 3.2

17 Unicode Character Database Documentation coalesced into UCD.html. New properties and values Hangul_Syllable_Type, Unicode_Radical_Stroke CJK numeric values added. PropertyValueAliases adds block names UCD fallback props more precisely defined. for code points not explicitly in data files New Characters Appropriate properties assigned

18 UCD4.0 (cont.) Modifier letters The general category of 02B9..02BA, 02C6..02CF changed to general category Lm. Khmer Two Khmer characters are deprecated; four others strongly discouraged. Decimal Digits Numeric_Type=decimal digit now aligned with General_Category=Nd Braille Added script value

19 UCD4.0 (cont. 2) Case Mapping Fixed for Turkish, Lithuanian Default Ignorables Hangul Filler characters Soft-Hyphen, CGJ, ZWS Arabic End of Ayah and Syriac Abbreviation Mark no longer DI, shaping classes fixed. Grapheme_Extend removes halfwidth katakana marks, most Mc (except as needed for canonical equivalence)

20 Related Items UTS #10: Unicode Collation Algorithm Not part of Unicode 4.0, but closely related From 4.0 on, to be sync'ed in repertoire and version with the Unicode Standard. UTS #6: SCSU Added suitability for XML Draft UTS #18: Unicode Regular ExpressionsUTS #18: Unicode Regular Expressions Draft as UTS with conformance requirements Draft UTR #23: Character PropertiesUTR #23: Character Properties Draft Character Property Model

21 Q& A

22 Background Slides

23 Unicode 3.2 (March, 2002) New Characters: 1,016 Symbols Large collection of mathematical symbols, especially targeted at MathML, recycling symbols, ornamental brackets. Special Characters combining grapheme joiner, word joiner, invisible operators for math, variation selectors Modern Scripts minority scripts of the Philippines

24 Conformance Eliminates irregular UTF-8 Defines variation sequences Replaces ZWNBSP with Word Joiner Clarifies scope of combining marks (further revised in 4.0) Clarifications of conjoining jamo behavior, hangul syllable structure, decomposables,

25 Textual Clarifications Combined vowels in Khmer, characters discouraged in Khmer Use of dingbats

26 Unicode Standard Annexes UAX #21: Case Mappings (was UTR) UAX #21: Case Mappings

27 Unicode Character Database New properties: IDS_Binary_Operator, IDS_Trinary_Operator, Radical, Unified_Ideograph, Default_Ignorable_Code_Point, Deprecated Soft_Dotted, Logical_Order_Exception Grapheme_Base, Grapheme_Extend,Grapheme_Link DerivedAge Normalization Corrections Added Property & Property Value Aliases Adds StandardizedVariants.htmlStandardizedVariants.html

28 Related Items UTS #10: Unicode Collation Algorithm Ignorable character handling, dual versioning, more conditions on well-formed weights, separate weights for CJK and unassigned characters, non- characters Note: base version still U3.1 UTR #26: CESU-8 Unicode Technical Notes Updated Character Encoding Stability PolicyCharacter Encoding Stability Policy Added Public Review processPublic Review Updated GlossaryGlossary

29 Unicode 3.1 (March, 2001) New Characters: 44,946 First supplementaries encoded! Modern scripts CJK Ideographs (now totaling 71,039) Historic scripts Old Italic, Gothic, Deseret, Byzantine Musical Symbols Symbols Mathematical Alphanumeric Symbols, (Western) Musical Symbols

30 Conformance Non-shortest-form UTF-8 excluded Clarification of the stability of the standard, code units vs. code points, non-characters, normative properties, informative properties, normative references Revisions of guidelines: wchar_t, unassigned code points, identifiers Major revision of Georgian Use of ZWNJ and ZWJ for ligatures Language tag characters encoded but discouraged

31 Unicode Standard Annexes UAX #19: UTF-32

32 Unicode Character Database Major revision of PropList properties: White_Space, Bidi_Control, Join_Control, Hex_Digit Alphabetic, Ideographic, Lowercase, Uppercase ID_Start, ID_Continue, XID_Start, XID_Continue Noncharacter_Code_Point Quotation_Mark, Terminal_Punctuation, Math, Dash, Hyphen, Diacritic, Extender New properties: Case folding, Scripts Added DerivedProperties, NormalizationTest

33 Related Items Documented Character Encoding Stability PolicyCharacter Encoding Stability Policy UTS #10: Unicode Collation Algorithm Merged data files; updated to base version 3.1 UTR #18: Unicode Regular Expression Guidelines UTR #18: Unicode Regular Expression Guidelines UTR #20: Unicode in XML and other Markup Languages UTR #20: Unicode in XML and other Markup Languages UTR #22: Character Mapping Tables UTR #24: Script Names


Download ppt "Unicode 4.0 Mark Davis President, The Unicode Consortium."

Similar presentations


Ads by Google