Lecture 101 Unicode support status in various platforms (Microsoft Windows) Windows 9x / ME –Do not support Unicode internally –Limited Unicode APIs are.

Slides:



Advertisements
Similar presentations
26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer.
Advertisements

1 The Ideographic Composition Scheme and Its Applications in Chinese Text Processing Qin LU Department of Computing, The Hong Kong Polytechnic University.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
SYMBOL TABLES &CODE GENERATION FOR EXECUTABLES. SYMBOL TABLES Compilers that produce an executable (or the representation of an executable in object module.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
CS0007: Introduction to Computer Programming Console Output, Variables, Literals, and Introduction to Type.
Java Programming, 3e Concepts and Techniques Chapter 3 Section 63 – Manipulating Data Using Methods – Day 2.
Free Pascal compiler internationalisation Rimgaudas Laucius Institute of Mathematics and Informatics, Vilnius University Lithuania.
Lab Information Security Using Java (Review) Lab#0 Omaima Al-Matrafi.
Drives, Directories and Files. A computer file is a block of arbitrary information, or resource for storing information. Computer files can be considered.
1 CS 161 Introduction to Programming and Problem Solving Chapter 9 C++ Program Components Herbert G. Mayer, PSU Status 10/20/2014.
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
File Systems Examples.
Lab#1 (14/3/1431h) Introduction To java programming cs425
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
Portability CPSC 315 – Programming Studio Spring 2008 Material from The Practice of Programming, by Pike and Kernighan.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
Tahir Nawaz Visual Programming C# Week 2. What is C#? C# (pronounced "C sharp") is an object- oriented language that is used to build applications for.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
Supplementary Character Support in Microsoft Products Michael S. Kaplan Software Design Engineer Microsoft.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
CSC – Java Programming II Lecture 9 January 30, 2002.
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
Introduction to Java Appendix A. Appendix A: Introduction to Java2 Chapter Objectives To understand the essentials of object-oriented programming in Java.
Chapter 2 How to Compile and Execute a Simple Program.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
ASCII and Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Lecture 2 Character Codes and Low-Structure Text Document Formats.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Java: Chapter 1 Computer Systems Computer Programming II.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
The string data type String. String (in general) A string is a sequence of characters enclosed between the double quotes "..." Example: Each character.
Abstract Syntax Notation ASN.1 Week-5 Ref: “SNMP…” by Stallings (Appendix B)
Chapter 1: Introducing JAVA. 2 Introduction Why JAVA Applets and Server Side Programming Very rich GUI libraries Portability (machine independence) A.
Chapter 8 Introduction to HTML and Applets Fundamentals of Java.
Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Introduction Lecture 01.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
By Rachel Thompson and Michael Deck.  Java.io- a package for input and output  File I/O  Reads data into and out of the console  Writes and reads.
8-1 Compilers Compiler A program that translates a high-level language program into machine code High-level languages provide a richer set of instructions.
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
© 2012 Pearson Education, Inc. All rights reserved types of Java programs Application – Stand-alone program (run without a web browser) – Relaxed.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Characters CS240.
Lecture9 Page 1 CS 236 Online Operating System Security, Con’t CS 236 On-Line MS Program Networks and Systems Security Peter Reiher.
Review A program is… a set of instructions that tell a computer what to do. Programs can also be called… software. Hardware refers to… the physical components.
نظام المحاضرات الالكترونينظام المحاضرات الالكتروني Computer Software.
THE CODING SYSTEM FOR REPRESENTING DATA IN COMPUTER.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
A Simple Syntax-Directed Translator
University of Central Florida COP 3330 Object Oriented Programming
CPSC 315 – Programming Studio Spring 2012
CHAPTER 5 JAVA FILE INPUT/OUTPUT
Portability CPSC 315 – Programming Studio
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Presentation transcript:

Lecture 101 Unicode support status in various platforms (Microsoft Windows) Windows 9x / ME –Do not support Unicode internally –Limited Unicode APIs are supported. –Unicode applications compiled with Microsoft Layer for Unicode can be run on Win9x –Use code page to support different encodings Windows NT / 2000 / XP –Support Unicode –Use of wide char (fixed 2 bytes) –Use UCS-2

Lecture 102 Unicode support status in various platforms (Linux & Mac OS) Linux –Newer Kernel supports Unicode –Requires glibc and XFree or newer –Use UTF-8 in most case, e.g. filesystem –Set locale to _., e.g. zh_TW.utf8 –Enable UTF-8 support in console by executing unicode_start Mac OS –Mac OS 9.1, Mac OS X support Unicode –16-bit for Unicode character

Lecture 103 What is a code page There are a lot of different encodings, e.g. EUC-TW, Big5, Latin-1 etc. A code page (code page identifier) is a number to identify a codeset. –e.g. 950 – Traditional Chinese (Big5) –e.g – Windows Latin-1 –Other code page identifiers can be found in: In Windows NT/2000/XP, code page conversion table provides information to convert between different encodings.

Lecture 104 Java Java is in Unicode internally. The supported encoding sets are provided by Java library packages rt.jar and i18n.jar The supported encoding sets for java.io.*, java.lang.* and java.nio.* API can be found in: User Input/Output will be automatically convert between Unicode and System code page Specify the encoding of the source files when compiling. –javac –encoding Convert to other supported encoding: e.g. byte[] utf8Bytes = str.getBytes(“UTF-8”); Convert from other supported encoding: e.g. String str = new String(utf8Bytes, “UTF-8”);

Lecture 105 Code Conversion Generally codeset conversion cannot provide one-to-one mapping(unless the two character sets are exactly the same) Unicode is a superset of every existing national standard => guaranteed round-trip conversion Round-trip conversion: Suppose a file file 1 in codeset A is converted to a file file 2 in codeset B and then converted back to codeset A with a file name file 3. If file 3 =file 1, we say that codeset B guarantees round-trip conversion for A.

Lecture 106 Java Code conversion Conversion from multibyte to Unicode Byte[ ] my_data = { 0xA4, 0x40} String my_unicode_data = new String(my_data,”big5”) Where “big5” is the name of the multibyte code name. Unicode needs this to do code conversion to: Conversion from Unicode to multibyte String my_unicode_data =“\u4E00” ( 一 ) Byte[] my_b5_data=my_unicode_data.getBytes(“Big5”) My_b5_data will have the value of 0xA440 Byte[] my_gb_data= my_unicode_data.getBytes(“GBK”) My_gb_data will have the value of D2BB

Lecture 107 Text stream import File I = new File(“input”); FileInputStream tmpin = new FileInputStream(I); BufferedReader in = new BufferedReader( new InputStreamReader(tmpin, “Big5”)); Once the BufferedReader in is established, data can be read using the readLine() method. inputStr = in.readLine();

Lecture 108 Text Stream Export File o = new File(“output.big5”); FileOutputStream tmpout = new FileOutputStream(o); BufferedWriter out = new BufferedWriter(new OutputStreamWriter(tmpout, “Big5”)); …. Out.println(“\u6CB3\u8C5A”); (“ 河豚 ”) Out.close(); 0xAA65 0xB362

Lecture 109 Multilingual applications Software teaching Chinese for English people Software teaching English for Chinese Conceptually separate two types of data in a multilingual application: –Data related to display of menu/instructions, –Data related to the processing in the program –Multilingual application vs. I18n applications –I18N: data related to display and processing are the same and it is for the same language/convention –Multilingual applications: Data related to display is for one language(and can be internationalized). Data related to the processing can be multilingual and not necessarily related to the display language. –Unicode is the most convenient encoding for multilingual applications, but not absolutely necessary

Lecture 1010 The Ideographic Composition Scheme Used in ISO Introduction to Ideograph Description Characters(IDCs) The ideographic composition scheme Application using IDCs

Lecture 1011 What are Ideograph Description Characters 12 structure symbols used to describe the formation of characters using some smaller ideograph functional units such as character components 

Lecture 1012 Characteristics of Ideographs Ideograph characters are often formed by smaller ideographic elements such as Radicals, ideographs proper, and other ideographic components which we generally call ideograph components Natural in the formation of characters Examples: 2 components => Chinese uses components has long been using components to describe characters, especially characters with the same pronunciation

Lecture 1013 Problems with ideograph Character Encoding Each character is treated as a different symbol, and thus given a codepoint Codepoint assignment in a block does try to follow radical order, but codepoint assignment does not consider the substructures(components). Thus such information is not revealed. When new character is created, codepoint allocation is needed in new blocks, thus radical order cannot be globally maintained. Also there is a potentially endless standardization process –Encoding of rarely used ideograph characters is a waste of resource both in terms of code space and also standardization effort

Lecture 1014 Introduction of IDCs Work started in 1995 by ISO/IEC SC2/WG2/IRG in 1995 Objective of the Original proposal: use coded ideographs and “structure symbols” to describe not yet coded ideographs. Original proposal has 15 “Ideograph Structure Symbols” base on study on Han characters, three of them didn’t make it to ISO 10646/Unicode: –Ideograph_Proper( 日 ): Every coded character is considered ideograph proper, thus not needed –Left_Up_Encompass: no un-encoded example –Mirror_Symmetry ( 非 ): left being mirrored to the right, but can be describe by Left_to_Right Renames the 12 symbols as Ideograph Description Characters

Lecture 1015 Ideographic Composition Scheme IDS describes a character using its components and indicating the relative positions of the components. IDCs are considered operators to the components. IDSs can be expressed by a context free grammar through the Backus Naur Form(BNF). The grammar G has four components: Let G = { , N, P, S}, where  : the set of terminal symbols — coded radicals, coded ideographs, and the 12 IDCs. N:the set of 5 non-terminal symbols N={IDS, IDS1, Binary_Symbol, Ternary_Symbol, Ideograph_Component} S = {IDS}, which is the start symbol of the grammar P: a set of rewrite rules

Lecture 1016 The following is the set of rewriting rules P: IDS::= | ::= | ::= coded_ideograph | coded_radical | coded_component ::= |  |  |  |  |  |  |  |  |  ::=  |  Note that even though the IDCs are terminal symbols, they are not part of the ideograph components.

Lecture 1017 Examples

Lecture 1018 IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID (IDC-OLD,  ): – The IDS introduced by IDC-OLD describes the abstract form of the ideograph with D1 and D2 overlaying each other. –  从工 is an example of an IDS which represents the abstract from of 巫 IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT ( IDC-SUR,  ): –The IDS introduced by IDC-SUR describes the abstract form of the ideograph with D1 on the right top corner of D2, and D2 is encompassed by D1. –  is an example of an IDS which represents the abstract from of

Lecture 1019 IDS allows a character to be described by different sequences One IDS should describe only one character, yet one character can be described by different IDSs.

Lecture 1020 IDS describes ideographic character composition at the abstract level. It indicates the relative positions of the components, but does not indicate the proportions. Not intended for rendering. Nesting is natural in ideographs and they are reflected in in the IDS scheme

Lecture 1021 Components Ideographic Components(IRG definition): units which can be used to represent ideographs. These components consist of ideographs proper coded in ISO (BMP) and some basic elements used to form ideographs. Radicals(IRG definition): those ideographic components listed in index pages of KX1, DKW, DJW, HYD. ISO extensions: –Radicals –Components

Lecture from GBK and more from IRG ISO IRG component sample

Lecture 1023 Extending the Objectives of IDCs Using coded characters to describe not yet code ideographs both for representation and exchange Limit standardization to only modern characters, and not some rarely used characters Learning of character composition(education) Revealing substructures of ideograph characters Description of ideograph variants

Lecture 1024 Examples Given characters => IDS? 忂   蠿 Given a IDS => what are the characters 莫言 艹旲言 Is the following a legal IDS? 莫言 艹旲

Lecture 1025 Conclusion IDCs are introduced in Unicode 3.0 The use is going beyond the original objective Applications based on the IDCs were already developed such as in the the Hong Kong Glyph Specification. IDCs should also useful in ideograph variant specifications Additional search site: