Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings.

Slides:



Advertisements
Similar presentations
CS Data Structures I Chapter 6 Stacks I 2 Topics ADT Stack Stack Operations Using ADT Stack Line editor Bracket checking Special-Palindromes Implementation.
Advertisements

Using Matrices in Real Life
Globalization Gotchas
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Unicode Normalization Mark Davis
Unicode Mark Davis Unicode 3.0 New 3.0 Characters CategoryV 2.1V 3.0 Alphabetics, Symbols6,51110,236 CJK Ideographs21,20427,786.
Unicode 4.0 Mark Davis President, The Unicode Consortium.
Chapter 1: The Database Environment
Copyright © 2003 Pearson Education, Inc. Slide 8-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
Copyright © 2003 Pearson Education, Inc. Slide 3-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.
Chapter 7 Constructors and Other Tools. Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 7-2 Learning Objectives Constructors Definitions.
Copyright © 2003 Pearson Education, Inc. Slide 5-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Worldwide typography (and how to apply JIS-X to Unicode) Michel Suignard Microsoft Corporation.
1 The Ideographic Composition Scheme and Its Applications in Chinese Text Processing Qin LU Department of Computing, The Hong Kong Polytechnic University.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
XP New Perspectives on Microsoft Office Word 2003 Tutorial 5 1 Microsoft Office Word 2003 Tutorial 5 – Creating Styles, Outlines, Tables, and Tables of.
Tutorial 3 – Creating a Multiple-Page Report
Tutorial 9 – Creating On-Screen Forms Using Advanced Table Techniques
XP New Perspectives on Microsoft Office Word 2003 Tutorial 6 1 Microsoft Office Word 2003 Tutorial 6 – Creating Form Letters and Mailing Labels.
XP New Perspectives on Microsoft Office Word 2003 Tutorial 7 1 Microsoft Office Word 2003 Tutorial 7 – Collaborating With Others and Creating Web Pages.
Michigan Electronic Grants System Plus
Relational data integrity
Programming Language Concepts
4. Internet Programming ENG224 INFORMATION TECHNOLOGY – Part I
Configuration management
Fact-finding Techniques Transparencies
Information Systems Today: Managing in the Digital World
OOAD – Dr. A. Alghamdi Mastering Object-Oriented Analysis and Design with UML Module 3: Requirements Overview Module 3 - Requirements Overview.
Introduction AmeriCorps State & National 1 The following presentation will guide AmeriCorps State and National Program users through how to create Applicant-Determined.
Creating Tables in a Web Site
Yong Choi School of Business CSU, Bakersfield
Microsoft Access.
The World Wide Web. 2 The Web is an infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that.
 Copyright I/O International, 2013 Visit us at: A Feature Within from Item Class User Friendly Maintenance  Copyright.
Heppenheim Producer-Archive Interface Specification Status of standardisation project Main characteristics, major changes, items pending.
Traditional IR models Jian-Yun Nie.
Lecture plan Outline of DB design process Entity-relationship model
XHTML Week Two Web Design. 2 What is XHTML? XHTML is the current standard for HTML Newest generation of HTML (post-HTML 4) but has many new features which.
What is XML? a meta language that allows you to create and format your own document markups a method for putting structured data into a text file; these.
Getting Familiar with Web Pages 1 2 The Internet Worldwide collection of interconnected computer networks that enables businesses, organizations, governments,
Unicode and Windows XP Cathy Wissink Program Manager Globalization Infrastructure, Design and Development Windows International Microsoft.
Dr. Alexandra I. Cristea XHTML.
Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
Chapter 11 Describing Process Specifications and Structured Decisions
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
Chapter 13 Web Page Design Studio
Revision of WIPO Standard ST.14 Committee on WIPO Standards, third session Geneva 15 – 19 April 2013 Anna Graschenkova Standards Section.
A lesson approach © 2011 The McGraw-Hill Companies, Inc. All rights reserved. a lesson approach Microsoft® PowerPoint 2010 © 2011 The McGraw-Hill Companies,
Benchmark Series Microsoft Excel 2013 Level 2
1 Programming Languages (CS 550) Mini Language Interpreter Jeremy R. Johnson.
Media: Text “Words and symbols in any form, spoken or written, are the most common system of communication.” ~ unknown.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
1 © 2000, Cisco Systems, Inc. DNSSEC IDN Patrik Fältström
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Implementation Issues Mark Davis Properties.
New Perspectives on XML, 2nd Edition
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Presentation transcript:

Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings

Overview New Characters Conformance UAX:Unicode Standard Annexes UCD:Unicode Character Database UTS:Unicode Technical Standards Not part of the Standard, but can claim conformance

Properties and Behavior Unicode is not just a list of characters Properties and behavior are crucial With them, new characters can work out of the box Some are part of the standard (BIDI, Normalization), others are associated (Collation, Regular Expressions)

New Characters: 1,228 Modern Scripts (additions to) Indic, Khmer, Latin, Greek, Arabic, Syriac (minority scripts) Limbu, Tai Le, Osmanya Historic Scripts Linear B, Cypriot, Ugaritic, Shavian, Aegean Numbers Symbols Monograms, digrams, tetragrams, other symbols modifier & combining characters

New Characters (cont.) Special Characters additional variation selectors (for future CJK variants), double-diacritics for dictionary use For a detailed list, see Derived Age in the UCD 4.0, and the beta Charts.UCD 4.0Charts Character repertoire corresponds to ISO/IEC 10646:2003.

Conformance Substantially improved specification of conformance requirements Incorporated UTR #17: Character Encoding Model, clearly separating encoding forms and encoding schemesUTR #17: Character Encoding Model Tightened definitions of UTF-8, UTF-16, UTF-32 Separate definition of Unicode String Clarified conformance status of Unicode Standard Annexes Formal definitions of properties & algorithms Provisional properties

UTF vs. Unicode String Important Distinction UTF Unique representation for Code Point All else illegal C0 80 D Unicode String Sequence of code units Internal Processing, not interchange Not necessarily valid UTF C0 A0 D

Conformance (cont.) Formalized policies for stability of the standard Clarification of semantics of important characters, including BOM Revised scope of enclosing combining marks Revised semantics of ZWJ for cursive scripts Normalization Corrections U+2F868; U+2F874; U+2F91F; U+2F95F; U+2F9BF All corrections subject to strict stability constraints: For 3.2 repertoire, NFC 3.2 (X) = NFC 4.0 (X)

Textual Clarifications Major changes to Chapters 2, 3, 6, 14 and 15 Definitive terminology for code points: graphic, format, control, private-use = assigned characters surrogate, noncharacter, reserved not characters Substantial improvements to many character block descriptions, especially Indic

Programming language identifiers Now backwards-compatible Once a Unicode identifier, Always a Unicode identifier Alternate definition for complete stability Fix set of allowed characters Allow all reserved code points + Complete stability - Odd characters Also see new UTR on Syntax Characters

Case mappings now normative (but tailorable) Clearer definition of string functions: isUpper(), isLower(), isTitle(), isFold() toUpper(), toLower(), toTitle(), toFold() Definition of titlecase uses word boundaries Note that the Turkic mappings do not maintain canonical equivalence, without additional processing.

UAX #9: BIDI BIDI: Arabic/Hebrew Display HTML, all modern word processors, OSs, … New: canonically equivalence now preserved data change, not algorithm shaping is done after reordering but not across directional boundaries clarifications of: ZWJ, ZWNJ intermediate level processing

UAX #15: Normalization Unique form for text comparison W3C Character Model, International Domain Names, Network File System, … New: Description of Stable Code Points.Stable Code Points Notation NFC(x) and isNFC(x), in Notation.Notation Added pointer to UTN #5 Canonical Equivalences in ApplicationsUTN #5 Canonical Equivalences in Applications Rewrote Annex 12: Corrigenda for clarity, and to describe the use of Normalization Corrections.Annex 12: Corrigenda Added Annex 13: Canonical Equivalence.Annex 13: Canonical Equivalence

UAX #14: Line Breaking Line-Break (word-wrap) all Unicode text Customizable for different languages New: Negative numbers and dates with hyphens will not break across lines Word-Joiner will link any characters (except hard line breaks) Behavior of soft hyphen clarified marks opportunity for breaking, not specific graphic appearance. Rules for GL relaxed: SP and ZW override New Property Values: NL, WJ

UAX #29: Text Boundaries Default User Character, Word, Sentence boundaries Customizable for different languages Word, sentence: tailoring expected New: Extracted from 3.0, but significantly revised Grapheme cluster ( user character ) Hangul Syllable or other Base plus (optionally) any number of NSMs

No Sub. Changes UAX #11: East Asian Width UAX #11: East Asian Width Guidelines for choosing character width UAX #24: Script Names Default script assignment Used in regular expressions Now UAX

Superseded UAXes Incorporated into and thus superseded by Unicode Version 4.0: UAX #13: Unicode Newline Guidelines UAX #19: UTF-32 UAX #21: Case Mappings UAX #27: Unicode 3.1 UAX #28: Unicode 3.2

Unicode Character Database Crucial Component of Unicode Documentation coalesced into UCD.html. New properties and values Hangul_Syllable_Type, Unicode_Radical_Stroke CJK numeric values added. PropertyValueAliases adds block names UCD fallback props more precisely defined. for code points not explicitly in data files New Characters Appropriate properties assigned

UCD4.0 (cont.) Modifier letters The general category of 02B9..02BA, 02C6..02CF changed to general category Lm. Khmer Two Khmer characters are deprecated; four others strongly discouraged. Decimal Digits Numeric_Type=decimal digit now aligned with General_Category=Nd Braille Added script value

UCD4.0 (cont. 2) Case Mapping Fixed for Turkish, Lithuanian Default Ignorables Hangul Filler characters Soft-Hyphen, CGJ, ZWS Arabic End of Ayah and Syriac Abbreviation Mark no longer DI, shaping classes fixed. Grapheme_Extend removes halfwidth katakana marks, most Mc (except as needed for canonical equivalence)

Unicode Technical Standard UTS: separate standard independent conformance requirements UTR: information and guidelines Documents may move from UTR status to UTS

UTS #10: Unicode Collation Significance: String comparison, matching, searching Compares all Unicode characters Handles linguistic features Accents, Case, Punctuation, … Contextual weighting, … Tailor for different languages Version due Sept From now on, to be sync'ed in repertoire and version with the Unicode Standard.

UTS #18: Regular Exp. Significance: Crucial to many applications: web, XML, … Unicode adds significant requirements Level 1: Basic Support Perl Level 2: Extended Support Level 3: Tailored Support New: Recently approved as UTS (was UTR) Adds clearer conformance requirements Flexible list of features Partial conformance claims

UTS #6: SCSU Simple Unicode Compression Added suitability for XML See also Technical Note on BOCU Main difference: preserves binary order x BOCU(x) < BOCU(y)

New UTRs Draft UTR #23: Character PropertiesUTR #23: Character Properties Draft Character Property Model Character Folding Hiragana-Katakana, Case, … Programming Language IDs, Syntax characters

Q& A Other talks here: Common Locale Data interchange of language-specific data for sorting, dates, times, currencies ICU premier Unicode enablement library full-featured, x-platform C, C++, Java

Background Slides

Unicode 3.2 (March, 2002) New Characters: 1,016 Symbols Large collection of mathematical symbols, especially targeted at MathML, recycling symbols, ornamental brackets. Special Characters combining grapheme joiner, word joiner, invisible operators for math, variation selectors Modern Scripts minority scripts of the Philippines

Conformance Eliminates irregular UTF-8 Defines variation sequences Replaces ZWNBSP with Word Joiner Clarifies scope of combining marks (further revised in 4.0) Clarifications of conjoining jamo behavior, hangul syllable structure, decomposables,

Textual Clarifications Combined vowels in Khmer, characters discouraged in Khmer Use of dingbats

Unicode Standard Annexes UAX #21: Case Mappings (was UTR) UAX #21: Case Mappings

Unicode Character Database New properties: IDS_Binary_Operator, IDS_Trinary_Operator, Radical, Unified_Ideograph, Default_Ignorable_Code_Point, Deprecated Soft_Dotted, Logical_Order_Exception Grapheme_Base, Grapheme_Extend,Grapheme_Link DerivedAge Normalization Corrections Added Property & Property Value Aliases Adds StandardizedVariants.htmlStandardizedVariants.html

Related Items UTS #10: Unicode Collation Algorithm Ignorable character handling, dual versioning, more conditions on well-formed weights, separate weights for CJK and unassigned characters, non- characters Note: base version still U3.1 UTR #26: CESU-8 Unicode Technical Notes Updated Character Encoding Stability PolicyCharacter Encoding Stability Policy Added Public Review processPublic Review Updated GlossaryGlossary

Unicode 3.1 (March, 2001) New Characters: 44,946 First supplementaries encoded! Modern scripts CJK Ideographs (now totaling 71,039) Historic scripts Old Italic, Gothic, Deseret, Byzantine Musical Symbols Symbols Mathematical Alphanumeric Symbols, (Western) Musical Symbols

Conformance Non-shortest-form UTF-8 excluded Clarification of the stability of the standard, code units vs. code points, non-characters, normative properties, informative properties, normative references Revisions of guidelines: wchar_t, unassigned code points, identifiers Major revision of Georgian Use of ZWNJ and ZWJ for ligatures Language tag characters encoded but discouraged

Unicode Standard Annexes UAX #19: UTF-32

Unicode Character Database Major revision of PropList properties: White_Space, Bidi_Control, Join_Control, Hex_Digit Alphabetic, Ideographic, Lowercase, Uppercase ID_Start, ID_Continue, XID_Start, XID_Continue Noncharacter_Code_Point Quotation_Mark, Terminal_Punctuation, Math, Dash, Hyphen, Diacritic, Extender New properties: Case folding, Scripts Added DerivedProperties, NormalizationTest

Related Items Documented Character Encoding Stability PolicyCharacter Encoding Stability Policy UTS #10: Unicode Collation Algorithm Merged data files; updated to base version 3.1 UTR #18: Unicode Regular Expression Guidelines UTR #18: Unicode Regular Expression Guidelines UTR #20: Unicode in XML and other Markup Languages UTR #20: Unicode in XML and other Markup Languages UTR #22: Character Mapping Tables UTR #24: Script Names

Schedule 2003, April: UCD/UAXes Final data files available Implementation can proceed 2003: September: Book Available