Implementation Issues Mark Davis 2003-09-24. Properties.

Slides:



Advertisements
Similar presentations
Unicode from a distance…
Advertisements

Globalization Gotchas
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Draft Java/ICU Internationalization Architecture Mark Davis.
Unicode Normalization Mark Davis
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
Worldwide typography (and how to apply JIS-X to Unicode) Michel Suignard Microsoft Corporation.
ICS103 Programming in C Lecture 1: Overview of Computers & Programming
Lecture 1: Overview of Computers & Programming
21 st International Unicode Conference Dublin, Ireland, May Optimizing the Usage of Normalization Vladimir Weinstein Globalization.
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
Solutions for Multilingual Literature by XSL Formatter 6,800 known languages.
Internationalization of Java Platform Presenter: Ataru Nakazawa Advisor: Xiaoping Jia Date: January 23, 2004.
Visual Basic: An Object Oriented Approach 3 – Making Objects Work.
1/25 Writing Character sets Unicode Input methods.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Advanced Programming Collage of Information Technology University of Palestine, Gaza Prepared by: Mahmoud Rafeek Alfarra Lecture 16: Working with Text.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
Chapter 33 CGI Technology for Dynamic Web Documents There are two alternative forms of retrieving web documents. Instead of retrieving static HTML documents,
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
A Field Linguist’s Guide to Unicode Deborah Anderson Script Encoding Initiative (Universal Scripts Project) Dept. of Lings., UC-Berkeley LSA Panel: A Field.
Nepali Unicode नेपाली युनिकोद. Before Development of Nepali Unicode, Nepali fonts such as Himali, Preeti, Kantipur, Sama etc. used in Nepaliese documents.
ICTA Workshop on Unicode Publishing for Sinhala and Tamil
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
HTML (HyperText Markup Language)
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Computer Systems Organization CS 1428 Foundations of Computer Science.
Text and Graphics September 26, Unit 3.
Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge.
Data Storage Choices File or Database ? Binary or Text file ? Variable or fixed record length ? Choice of text file record and field delimiters XML anyone.
Serialization. Serialization is the process of converting an object into an intermediate format that can be stored (e.g. in a file or transmitted across.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
UNICODE & Indic Scripts
An ISO 9001:2008 Company With all the tools you need to compute in Indian Languages.
Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft.
32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008.
Marwan Al-Namari 1 Digital Representations. Bits and Bytes Devices can only be in one of two states 0 or 1, yes or no, on or off, … Bit: a unit of data.
Asstt. Prof Sonia Sharma Computer Dept 1 HTML ( Hypertext MarkUP Language ) HTML is the lingua franca for publishing hypertext on the World Wide Web.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Cupertino, CA, USA / September, 2000First ICU DeveloperWorkshop1 Transformation Support Alan Liu Globalization Center of Competency IBM Emerging Technology.
Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM.
Introduction to Indian language computing 20 th MAR 2014.
Writing System Implementation On-the-Fly Extensibility for the common man Sharon Correll, SIL International Copyright © 2001.
WELL- FORMEDNESS CH 6. Objective Well-formedness rules Text in XML Elements and Tags in Atributes Entity references CDATA sections Comments Unicode XML1.1.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
itranslit (Indic Transliteration Tool)
ICS103 Programming in C Lecture 1: Overview of Computers & Programming
Representing Information as bit patterns
Representing Characters
Automatic Language Identification – A Syntactic Approach
Project Tukaram Sagar Tamhane
Unicode from a distance…
SignWriting in Unicode Next
Sutton SignWriting Standard of 2017
Centre For Indian Language Technology
ASCII and Unicode.
Introduction to UNICODE (ஒருங்குறி)
Presentation transcript:

Implementation Issues Mark Davis

Properties

Behavior Bidirectional Algorithm (Arabic/Hebrew) Linebreak, User-Character, Word, … Normalization Collation Regular Expressions Programming Identifiers …

Scripts, not Languages a English German Italian. English Russian Armenian । Hindi Gujarati Marathi ¨ English Russian Greek

Size Doesn ’ t Matter Text storage size is approximately the same for all languages In real data, other data dominates Compression available if needed ZIP SCSU BOCU

Normalization Produces Unique Form Comparison, Matching, Counting Used in Collation International Domain Names W3C Character Model (Web) Network File System …

Transcoding: ISCII - Unicode ISCII Halant + Halant Halant + Nukta INV halant RA ATR EXT Unicode Halant + ZWJ Halant + ZWNJ SPACE virama RA Not in plain text Not required

Unicode = Lingua Franca Transcoding = Converting from one character encoding to another Many standards / systems defined in terms of Unicode C#, Java, XML, … Unicode cp1252 SJIS GB18030 ISCII

Transliteration Round-trip Transliterations श ↔ śa Ideal published form Unique source sequence → unique target Best-Fit Transliterations श → sa For limited environments Keyboard Transliterations श ← ssa Limited to QWERTY keys Indic-Indic not simple mapping; “ holes ”

Keyboards One key → many characters Many keys → one character क 0915 ् 094D ष 0937 a à 00E0 ` → →

Supporting Sequences Keyboards Fonts Selection

Fonts Required Glyphs, Positioning Sequences Necessary to produce them Context (e.g. in OpenType) क 0915 ् 094D ष 0937

Selection Use appropriate boundaries for user- characters Arrow keys, mouse selection, etc

Unicode Stability Encoding. Once a character is encoded, it will not be moved or removed. Name. Once a character is encoded, its character name will not be changed. Normalization. Once a character is encoded, its canonical combining class and decomposition mapping will not be changed in a way that will destabilize normalization. Identity. Once a character is encoded, its properties may still be changed, but not in such a way as to change the fundamental identity of the character. Property Value. The structure of certain property values in the Unicode Character Database will not be changed.

Locale Data (examples)

Q & A