Introduction to Character Encodings, Java and You.

Slides:



Advertisements
Similar presentations
Globalization Gotchas
Advertisements

© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
Introduction to Java 2 Programming Lecture 7 IO, Files, and URLs.
Unicode: A Grand Tour Character Encodings & Unicode.
Computer Science Basics CS 216 Fall Operating Systems interface to the hardware for the user and programs The two operating systems that you are.
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
Addition : _________________ Binary Numbers (contd)
Media: Text “Words and symbols in any form, spoken or written, are the most common system of communication.” ~ unknown.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Representing Information in Binary (Continued)
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Computer Science and Software Engineering University of Wisconsin - Platteville Note 9. Internationalization Yan Shi SE 3730 / CS 5730 Lecture Notes Part.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
1 © 2000, Cisco Systems, Inc. DNSSEC IDN Patrik Fältström
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Unicode & W3C Jataayu Software C. Kumar January 2007.
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
LING 408/508: Programming for Linguists Lecture 2 August 28 th.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Week 4 Number Systems.
Agenda Data Representation – Characters Encoding Schemes ASCII
Spring /6.831 User Interface Design and Implementation1 Lecture 22: Internationalization.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI 230 Dale Roberts, Lecturer Information.
ICT Foundation 1 Copyright © 2010, IT Gatekeeper Project – Ohiwa Lab. All rights reserved. Character representation.
Prepared by : A.Alzubair Hassan Kassala university Dept. Computer Science Lecture 2 I/O Streams 1.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Oracle9i Database Administrator: Implementation and Administration 1 Chapter 14 Globalization Support in the Database.
Week 7 Lecture 2 Globalization Support in the Database.
The character data type char. Character type char is used to represent alpha-numerical information (characters) inside the computer uses 2 bytes of memory.
8-1 Compilers Compiler A program that translates a high-level language program into machine code High-level languages provide a richer set of instructions.
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
Chapter 4 Literals, Variables and Constants. #Page2 4.1 Literals Any numeric literal starting with 0x specifies that the following is a hexadecimal value.
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
File Input and Output Chapter 14 Java Certification by:Brian Spinnato.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Data Encoding COSC Computers and Data Computers store information as sequences of bits Computers store many types of data: numbers text audio images.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
MISSION CRITICAL COMPUTING SQL Server Special Considerations.
Characters CS240.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
17-Mar-16 Characters and Strings. 2 Characters In Java, a char is a primitive type that can hold one single character A character can be: A letter or.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
Binary Representation in Text
Binary Representation in Text
Machine level representation of data Character representation
Characters & Fonts Digital Multimedia, 2nd edition
Representing Characters
Strings.
Characters & Fonts Digital Multimedia, 2nd edition
Text Encoding.
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Comp Org & Assembly Lang
ASCII and Unicode.
Varying Character Lengths
Presentation transcript:

Introduction to Character Encodings, Java and You

Private and Confidential 2 Agenda Defining the problem – Where webMethods products encounter character set problems. – What the symptoms look like. Understand core concepts – What is a character set? What’s an encoding? – What is Unicode, really? Code Examples to avoid problems

Private and Confidential 3 Confusion Reigns Generally, the most confusing aspect of internationalization. 1. Many, many standards to choose from. 2. Arcane terminology 3. American programmers rarely (seem) to encounter it head-on. – We’re presenting this because many of our products are encountering this problem now.

Private and Confidential 4 Problem Domain webMethods products interface with: – non-Java systems (for example, in the adapters) – non-Java environments (file systems, databases, libraries, , ftp, http, etc.).

Private and Confidential 5 Java’s Text Representation Java provides a convenient text processing architecture centered on the Java String object. – A Java String is basically an array of Java Character Objects.

Private and Confidential 6 Java Characters Each Java Character object represents a Unicode character. – (Currently) a 16-bit unsigned integer value between 0 and 65,535. – Character class provides access to character properties. UPPER, lower, and Titlecase mapping Comparison Directionality Compatibility C-TYPE values such as ‘alpha-ness’, ‘digit-ness’, ‘alphanumeric- ness’

Private and Confidential 7 Non-Java Text Non-Java files, applications, filesystems, database, et.al. typically do not use Unicode. Java sees them as an array of bytes ( byte[] ).

Private and Confidential 8 Three Problems ? Bad Conversion No glyph ƒ Ã ƒ \ǂ Ùƒ Ã ƒ \ǂ Ù Random-seeming trash characters

Private and Confidential 9 Bad Conversion Target character set doesn’t have this character in it. Java replaces each character with a “?” Input String: 日本語 Output String: ??? Typically: – Using the default encoding when we meant to specify one. – Writing on a device (such as System.out) whose legacy encoding doesn’t support the characters.

Private and Confidential 10 “No Glyph” Java knows what the character is and is handling it properly, but doesn’t have a picture of it to show you (in the current Font selected). Input String: 日本語 Output String: Typically: – Nothing is wrong, just using the wrong Font.

Private and Confidential 11 Random Trash A byte[] was converted using the wrong character encoding. Bytes were mapped to the wrong characters. Input String: 日本語 Output String: “ú–{Œê Typically: – Using the wrong encoding, the underlying bytes are mapped to different, random-seeming characters.

Private and Confidential 12 Examples Same byte sequences, different results: Shift JIS byte[] = 0xE0, 0x41, 0x83, 0x70 = “ 漓パ ” Latin-1 byte[] = 0xE0, 0x41, 0x83, 0x70 = “àAƒp” Java String = 0xE0, 0x41, 0x83, 0x70 = “  荰 ” Java String = “ 漓パ ” = U+6F13 U+30D1

Character Set Terminology

Private and Confidential 14 What is a Character? A character is a single, atomic unit of text. The definition has a different meaning according to the writing system and context.

Private and Confidential 15 Abstract characters Some abstract characters include: A Roman Letter Capital A ` Combining Accent Grave に Hiragana character “ni” 語 CJK Ideograph ي Arabic letter 앚 Hangul syllable A Fullwidth compatibility letter A

Private and Confidential 16 What is a Character Set? A character set is a “set”--- a collection of characters, usually organized in some fashion. You’re probably most familiar with ASCII: – 0x41 ‘A’ – 0x42 ‘B’ – Etc.

Private and Confidential 17 What is a Character Encoding? Character set: a collection of characters, basically, a bucket. Character encoding: the specific ones and zeroes assigned to a character set. Character Set: ‘A’ == 0x41 Character Encoding: ‘A’ == 0x41

Private and Confidential 18 Eight Bit Encodings 8-bit encodings allow for 256 characters. 128 ASCII 32 ‘C1’ controls 96 extended

Private and Confidential 19 Latin-1 The standard for Western Europe is generally ISO AKA “Latin-1” Used by UNIX systems and the Web. Extended version used by Microsoft for Windows.

Private and Confidential 20 Let a Thousand Encodings Bloom… Each language has it’s own character set… – Everywhere: ASCII* – Western European (like German or French): Latin-1 – Eastern European (like Polish or Slovak): Latin-2 – Simplified Chinese: GB2312

Private and Confidential 21 Actually, many for each language…

Private and Confidential 22 Other Writing Systems Writing systems vary around the world (in order of increasing complexity, more or less): – Latin-based alphabets (ABCDEFG…) English – Cyrillic and Greek-based alphabets ( АБВГДЕЖЩ...) Russian – Ideographic writing systems have thousands of characters ( 一丁勺両亀困... ) Japanese – Bi-directional (RTL) languages go right to left (... זוהדגבא ) Hebrew – Complex scripts (everything else): ( ऋऌऍऎ )Devanagari

Private and Confidential 23 Expanded Character Sets Most languages have alphabetic or phonetic writing systems: – Russian, Greek, Slavic, (many) Native American, Bahasa, Hebrew, Arabic, Semitic, etc.: alphabetic – Indian (subcontinent), Thai, Japanese kana, Korean: phonetic writing systems – 8 bits is enough for all of the above (with some tricks) Some languages use scripts based on Chinese ideographic writing (“Han” or “Hanja”): – Chinese – Korean – Vietnamese (traditional) – Japanese Kanji

Private and Confidential 24 “Double-Byte” 8-bit character encodings use eight bits per character. – 2 8 = 255 characters “Double-byte” character sets must be 2 bytes per character ? – 2 16 = 65,535 characters Should actually be called “multi-byte” (MBCS). – Each character can be ONE, TWO, THREE and sometimes FOUR bytes in length. – MAY involve shift states.

Private and Confidential 25 Multibyte Encodings A typical Japanese Character Set: JIS X 208 ( 漢字) Character Encodings of JIS X 208: Shift-JIS (CP932):0x8A 0xBF 0x8E 0x9A EUC-JP:0xB4 0xC1 0xBB 0xFA ISO 2022-JP:0x1B, 0x24, 0x42, 0x34 0x41 0x3B 0x7A 0x1B 0x28 0x4A Non-Legacy: UTF-16:(0x6F22 0x5B57)

Private and Confidential 26 An MBCS Example: Shift-JIS Character set used by DOS, Windows, Macs, and a few UNIX-like systems for Japanese. – Code Page 932 – JIS X 208:1997

Private and Confidential 27 Shift-JIS In order to reach more characters, double byte values start with a limited range of “lead bytes” These can be followed by any character value > 0x40 (“trail byte”)

Private and Confidential 28 Shift-JIS Each “lead byte” provides a “window” onto additional characters.

Private and Confidential 29 Shift-JIS Problems: – Lead byte values are also valid as trail bytes. – Common special characters (“\”!!) are valid trail bytes.

Private and Confidential 30 Han CJK scripts require up to 100,000 unique characters for complete representation. – Four major variants: Traditional Chinese Simplified Chinese Japanese Kanji Korean (non-Hangul)

Private and Confidential 31 “Kanji” Sometimes you hear Japanese called “kanji” – Kanji is actually one of four writing systems used in Japan. – Kanji should be avoided as a generic term for DBCS. Kanji (“Han” or Chinese writing): 日本語 Hiragana (phonetic for Japanese words): にほんご Katakana (phonetic for “foreign” words): ニホンゴ Romanji (“Roman script”): nihongo

Private and Confidential 32 Chinese Upper two are Traditional. Lower character is the Simplified variant.

Private and Confidential 33 Hangul Korean Hangul is a syllabic phonetic system, which has thousands of combinations. – Hangul is not related to Han ideographic writing.

Private and Confidential 34 Code Page Hell With hundreds of encodings and character sets to choose from, making internationalized code work in the late 1980’s and early 1990’s was “hellish”. Internationalization folks referred to this as “code page hell”

Unicode and Java To the Rescue

Private and Confidential 36 Unicode (ISO )  Unicode is a character set that supports all of the world’s languages and writing systems.*  Originally designed as a “wide character set”--every character was represented by 16-bits. This allowed for 65,535 potential characters.  Extended to allow 1.1 million characters.  Unicode is maintained by an industry consortium. ISO is maintained by WG2. The two are exactly identical.

Private and Confidential 37 It’s a character set? Unicode is a character set. It has these encodings: – UTF-32. (BE/LE) A 32-bit encoding. All characters 32 bits. – UTF-16. (BE/LE) A 16-bit encoding. All characters are 16-bits. Characters above 0xFFFF (the “Basic Multilingual Plane”) require two special “surrogate” characters. – UTF-8. An 8-bit variable width encoding. Characters are 1, 2, 3 or 4 bytes long. Always non-endian. ASCII == ASCII All other characters have a special bit pattern

Private and Confidential 38 UTF-8 Bit Pattern ASCII == ASCII – 0x41 == ‘A’ All other characters are multibyte. – 110xxxxx == two bytes – 1110xxxx == three bytes – 11110xxx == four bytes – 10xxxxxx == trail byte – U+00C0 == À == 0xC3 0x80 ( )

Private and Confidential 39 Convenience Method for UTF8 Almost True: readUTF and writeUTF allow direct access to UTF-8 DataInput/DataOutputStreams. – This is not really UTF-8, but a Sun specialized version. – Use InputStreamReader/OutputStreamWriter to do proper conversions.

Private and Confidential 40 Java Uses Unicode Every character in every Java String object is encoded as UTF-16 Unicode. – Every string is converted from a legacy encoding, either by the compiler or by the String class. – This is the reason for native2ascii and –encoding switches. Once you have a String object, everything is Unicode UTF-16.

Private and Confidential 41 “Special” encodings There are two encodings that the system treats as special: – file.encoding – ISO All basic conversion functions use your system default encoding. Most servlet conversion functions use ISO as the default.

Private and Confidential 42 Two File Encodings Windows systems generally have two different file encodings: – “ANSI” encoding is the Windows default code page for GUI applications. – “OEM” encoding is the code page used by the ‘cmd’ or ‘command’ interpreter shells.

Private and Confidential 43 Stream Readers and Writers InputStreamReader and OutputStreamWriter classes perform controlled conversion between byte[] and String. – Always pass the encoding as a variable. – Use the IANA preferred name for the encoding, if possible (see ftp://ftp.isi.edu/in-notes/iana/assignments/)ftp://ftp.isi.edu/in-notes/iana/assignments/ – Prefer UTF8 for on-the-wire transport.

Private and Confidential 44 Code Sample // use with any type of InputStream class InputStream is = new FileInputStream(file); InputStreamReader isr = new InputStreamReader(is, encoding); // use Buffered Reader for efficiency BufferedReader br = new BufferedReader(isr); StringBuffer sb = new StringBuffer(); int chr; while ((chr = br.read() > -1) { sb.append(chr); } * Note: Try blocks eliminated for clarity.

Private and Confidential 45 OutputStreamWriter Code Sample // use with any type of OutputStream class OutputStream os = new ByteArrayOutputStream(file); OutputStreamWriter osw = new OutputStreamWriter((OutputStream)os, encoding); osw.write(myString, 0, myString.length()); osw.flush(); * Note: Try blocks eliminated for clarity.

Private and Confidential 46 Character Class Provides access to Unicode character properties. – UnicodeBlock inside class – Character getType (defined types) – isDigit – isLetter – isLetterOrDigit – isUpperCase/isLowerCase/isTitleCase – toUpperCase/toLowerCase/toTitleCase – isSpace/isWhitespace – isISOControl/isJavaIdentifierStart/isJavaIdentiferPart

Private and Confidential 47 Normalization Many characters have two (or more) representations in Unicode. – Normalization makes the sequences the same. – Simplifies user input parsing and validation.

Private and Confidential 48 ICUj Normalizer Class Four forms of Normalization: – Form C (composed) – Form D (decomposed) – Form KC (canonical composed) – Form KD (canonical decomposed) – Special handling for Hangul characters! – Note that there is a private class java.text.Normalizer in the JDK.

Private and Confidential 49 Demo Programs UnicodeDemo – a Java program that demonstrates the byte sequences of different encodings and also provides some code that shows ISR and OSW in action. Charsets – a Windows program by my buddy Bill Hall for playing with encodings my personal website, with examples and demos of certain Java I18n things.

Questions? Addison Phillips