Download presentation
1
Unicode (and Java) Brice Giesbrecht
2
Objective of Presentation
The need for Unicode How it works Differentiate between encodings How to get your browser to work… See how Java consumes and produces data
3
Overview of Presentation
Character Sets Unicode Encodings Unicode Support in Java Unicode Support in Databases (?) Demonstration (web app) Resources Door Prizes (for those still awake…)
4
Character Sets What is a character set?
Code Page: a mapping in which a sequence of bits, usually a single octet representing integer values 0 through 255, is associated with a specific character (wikipedia) Most character sets are a direct mapping of a value to a number (7 bit / 8 bit) Character sets are NOT fonts! Encoding is usually a lookup in a table Most IBM and Microsoft code pages use ASCII as their base set of characters The English bias (compare to Indic languages)
5
Character Sets Issues Within a single Language
Selectors to overcome 8 bit limitations (especially for CJK sets) Historical importance of platforms and hardware Compatibility (or more likely, lack thereof) ISCII as an example Issues outside a single Language How do you produce content using multiple languages? (Or the characters from those languages?)
6
Character Sets Enter the standards ISO-646 (ASCII, still 7 bit)
12 whole code points to play with! C0 Control Set (0x00 – 0x1F) ISO-8859-n 0x00 – 0x7F ISO-646 IRV 0x80 – 0xFF Different for each set (or part) ISO (Latin1) C1 Control Set (0x80 – 0X9F) ISO-2022 Designed for transmission Non Latin bases & multi byte sets
7
Character Sets Enter Microsoft! Windows code pages Cp1252
Cp1252 Based on ISO C1 code points used for printable characters Often mislabeled as ISO due to their similarities
8
Unicode What is Unicode?
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
9
Unicode ISO Merged with the Unicode Consortium Ties a character, name, and a code point together BMP – Basic Multilingual Plane (the first 65,536 code points) ISO and UC Character repertoire are synchronized UCS (Universal Character Set)
10
Unicode Q: So are they the same thing? A: No. Although the character codes and encoding forms are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an extensive set of functional character specifications, character data, algorithms and substantial background material that is not in ISO/IEC (
11
Unicode The Unicode Standard includes a set of characters, names, and coded representations that are identical with those in ISO/IEC 10646:2003. It additionally provides details of character properties, processing algorithms, and definitions that are useful to implementers. [It] strengthens Unicode support for worldwide communication, software availability, and publishing. (
12
Unicode UCS Code space: (0x – 0x7FFFFFFF) 128 x 256 x 256 x 256 (GPRC)
2,147,483,648 possible code points The Unicode Character Database Main Definition (UnicodeData.txt) Available on line Unicode Code Space (0x – 0x10FFFF) 17 x 256 x 256 1,114,112 code points
13
Unicode As of Unicode 5.0.0, 101,063 (9.1%) of these codepoints are assigned, with another 137,468 (12.3%) reserved for private use, leaving 875,441 (78.6%) unassigned. The number of assigned code points is made up as follows: 98,884 graphemes 140 formatting characters 65 control characters 2,048 surrogate characters
14
Unicode Plane 0 (0000-FFFF) Basic Multilingual Plane (BMP)
Used for most of the alphabets Not all code points are used Allocated in areas/blocks
15
Unicode Plane 1 (10000-1FFFF): Supplementary Multilingual Plane (SMP)
Historic scripts such as Linear B, but is also used for musical and mathematical symbols.
16
Unicode Plane 2 (20000-2FFFF) Supplementary Ideographic Plane (SIP)
Used for about 40,000 rare Chinese characters that are mostly historic
17
Unicode Planes 3 to 13 (30000-DFFFF) Unassigned
18
Unicode Plane 14 (E0000-EFFFF)
Supplementary Special-purpose Plane (SSP) glyph (font) selection code point + variation selector = variation sequence (Ideographic Variation Database)
19
Unicode Plane 15 (F0000-FFFFF) Plane 16 (100000-10FFFF)
Plane 0 (E000-F8FF) Private Use Area (PUA) The use of the PUA was a concept inherited from certain Asian encoding systems. These systems had private use areas to encode Japanese Gaiji (rare personal name characters) in application-specific ways)
20
Unicode ConScript Unicode Registry
The purpose of the ConScript Unicode Registry (CSUR) is to coordinate the assignment of blocks out of the Unicode Private Use Area (E000-F8FF and 000F FFFF) to constructed/artificial scripts, including scripts for constructed/artificial languages. Cirth, Klingon, Tengwar, etc.
21
Encodings Purpose of the following encodings is to get the Unicode value to you. Depending on the storage or transmission protocols, different encodings will need to be used. These are not different character sets, they are ways of representing the characters in Unicode.
22
Encodings Endianness Byte Order Mark - 0xFEFF 0x1234 LE 34 12 BE 12 34
Helps Determine Endianness Unicode 3.2 (0x2060) 0xFFFE reserved 0XFEFF set aside for BOM Also used to declare encoding (UTF-8)
23
Encodings UTF-8 Variable-length character encoding
Can address all characters in the UCS but was limited by RFC 3629 to just address the Unicode code space. BOM – EF BB BF Format F 0zzzzzzz FF 110yyyyy 10zzzzzz FFFF 1110xxxx 10yyyyyy 10zzzzzz FFFF www 10xxxxxx 10yyyyyy 10zzzzzz
24
Encodings UTF-32/UCS-4 Fixed-length character encoding Uses 31 bits
UCS-4 capable of addressing entire UCS, but was restricted to only cover the Unicode code space UTF-32 only covers the Unicode code space 4E8C, = 00004E8C, BE BOM – FE FF LE BOM – FF FE 00 00
25
Encodings UCS-2 Fixed-length encoding Two-octet It is NOT UTF-16!
Only addresses BMP UCS-2BE, UCS-2LE Obsoleted by UTF-16
26
Encodings UTF-16 Variable-length encoding UTF-16BE, UTF-16LE
BE BOM – FEFF LE BOM – FFFE Surrogates are used to address code points outside the BMP. (We will cover this later)
27
Encodings UTF-16 Surrogate Pairs Needed for code points > 0xFFFF
High Byte 0xD800 – 0xDBFF first surrogate Low Byte 0xDC00 – 0xDFFF second surrogate Algorithm: ((cp - 0x10000) high 10 bits) | 0xD800 ((cp - 0x10000) low 10 bits) | 0xDC00
28
Encodings Which Encoding should you use?
If dealing with CJK or Hindi (>0x0800), UTF-8 requires 3 bytes whereas UTF-16 needs only 2 UTF-8 is great for ASCII whereas UTF-16 needs 2 bytes for it Java uses UTF-16 Windows uses UTF-16LE internally UTF-32 not really used that much UTF-8 and UTF-16 are the most common
29
Java J2SE 1.5 version 4.0 J2SE 1.4 version 3.0 J2SE 1.3 version 2.1
Supplementary characters were part of Unicode 3.1 Addressed in JSR 204 (
30
Java Unicode characters are specified using \u such as \u0039
Unicode can be used in source files file.encoding=Cp1252 on my machine You can change this, but beware… Java reads and writes using this encoding by default You can specify the character set to use for reading or writing
31
Java Big5 Big5-HKSCS EUC-JP EUC-KR GB18030 GB2312 GBK IBM-Thai
ISO-2022-CN ISO-2022-JP ISO-2022-KR ISO ISO ISO ISO ISO ISO ISO ISO ISO ISO ISO JIS_X0201 JIS_X KOI8-R Shift_JIS TIS-620 US-ASCII UTF-16 UTF-16BE UTF-16LE UTF-8 windows-1250 windows-1251 windows-1252 windows-1253 windows-1254 windows-1255 windows-1256 windows-1257 windows-1258 windows-31j x-Big5-Solaris x-euc-jp-linux x-EUC-TW x-eucJP-Open x-IBM1006 x-IBM1025 x-IBM1046 x-IBM1097 x-IBM1098 x-IBM1112 x-IBM1122 x-IBM1123 x-IBM1124 x-IBM1381 x-IBM1383 x-IBM33722 x-IBM737 x-IBM856 x-IBM874 x-IBM875 x-IBM921 x-IBM922 x-IBM930 x-IBM933 x-IBM935 x-IBM937 x-IBM939 x-IBM942 x-IBM942C x-IBM943 x-IBM943C x-IBM948 x-IBM949 x-IBM949C x-IBM950 x-IBM964 x-IBM970 x-ISCII91 x-ISO-2022-CN-CNS x-ISO-2022-CN-GB x-iso x-JIS0208 x-JISAutoDetect x-Johab x-MacArabic x-MacCentralEurope x-MacCroatian x-MacCyrillic x-MacDingbat x-MacGreek x-MacHebrew x-MacIceland x-MacRoman x-MacRomania x-MacSymbol x-MacThai x-MacTurkish x-MacUkraine x-MS950-HKSCS x-mswin-936 x-PCK x-windows-874 x-windows-949 x-windows-950
32
Databases (Maybe) SQL 92 NATIONAL CHARACTER Collation Database Support
The <key word>s NATIONAL CHARACTER are used to specify a character string data type with a particular implementation-defined character repertoire. Special syntax (N'string') is provided for representing literals in that character repertoire. Collation Database Support MySQL Oracle Sql Server Postgres
33
Demonstration Read/Write/Examine UTF-8/UTF-16/UTF-16LE encoded text (with Hex editor) Show encoding settings in Eclipse and Java Show how windows (and eclipse console) can/can't display some characters web browser settings Chinese article on cracking of SHA-1 Martin Fowler article on dependency Injection
34
Resources The big ones: The rest: For fun:
The rest: (GREAT resource) For fun:
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.