San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency
San Jose, California, September nd International Unicode Conference2 Agenda Encodings in files and protocols –Not: Processing encoding forms Unicode “is too big” –Issues and non-issues How to reduce size of Unicode text –Choice of encoding –Optional compression Examples and comparisons
San Jose, California, September nd International Unicode Conference3 What is ICU? Internationalization libraries for C, C++, Java* –Open source – non-viral –Sponsored by IBM *Sun’s Java licenses an earlier ICU version; ICU4J updates it. Unicode standard compliant –full supplementary support Cross-platform; extensible and customizable High performance and thread-safe –Multiple locales in same thread – simultaneously Converters for all Unicode charsets & hundreds of legacy codepages
San Jose, California, September nd International Unicode Conference4 Encodings of Unicode Common Unicode character set External encodings –Files and protocols –Almost always byte-serialized –Character Encoding Schemes/charsets Processing encodings –Character Encoding Forms, often 16/32-bit –Different requirements –Topic for different presentation…
San Jose, California, September nd International Unicode Conference5 Unicode “is too big”? Perceived large size of Unicode text –Compared with legacy codepages Size matters –Low-speed connections (dial-up, mobile) –Little memory (PDA, cell phone, embedded) Size does not matter when… –Images & other binaries swamp text size –High-speed network –Temporary documents –Large amounts of memory
San Jose, California, September nd International Unicode Conference6 How big is it? Size depends on language/script Bytes/char for some language groups: LanguagesLegacyUTF-8UTF-16 Western/Latin112 Russian/Arabic122 Hindi/Thai132 CJK232
San Jose, California, September nd International Unicode Conference7 Legacy codepages Compact because –Designed for single/few languages –Few characters compared with Unicode Conversion problems –Fallback/substitution of unmappable chars –Mapping table differences –Loss of parts of text common Large number/size of mapping tables
San Jose, California, September nd International Unicode Conference8 Reduce Unicode text size Choice of encoding –Encodings designed for different purposes –Compactness vs. direct applicability vs. software support etc. General-purpose compression –Best on top of compact encoding –Not available in all applications
San Jose, California, September nd International Unicode Conference9 UTF-8/16 Designed for processing but all-purpose UTF-8: –Byte-based, ASCII-compatible –BMP: up to 3 bytes/char UTF-16 (BE/LE): –Byte-serialization of 16-bit form, not ASCII-compatible –BE/LE forms or Byte Order Mark –BMP: always 2 bytes/char
San Jose, California, September nd International Unicode Conference10 UTF-7 7-bit encoding designed for –Obsolete: now 8-bit-safe Partially ASCII-compatible BMP: 2.67 bytes/char plus overhead –Base64-encoded UTF-16BE Stateful
San Jose, California, September nd International Unicode Conference11 SCSU & BOCU-1 About as compact as legacy codepages –1 byte/char for small scripts, 2 for CJK; stateful –Compress short strings better than LZW (zip) etc. SCSU: –Limited * ASCII compatibility (initial state) –Complex state, many encoding choices –Indeterministic; arbitrary byte values –Established encoding, supported in Various tools & editors (SC UniPad), ICU, Symbian OS (cell phones/PDAs)
San Jose, California, September nd International Unicode Conference12 BOCU-1 BOCU-1: –Delta-encoding; avoids control codes –MIME text-compatible but not ASCII –Deterministic –Preserves binary order (for sorting, databases) –New encoding; supported by ICU
San Jose, California, September nd International Unicode Conference13 SCSU & BOCU-1 text sizes Average bytes/char relative to UTF-8 LanguagesSCSUBOCU-1 English/French100% Russian/Arabic55%60% Hindi40% Thai40%45% Japanese55%60% Korean85%75% Chinese70%75%
San Jose, California, September nd International Unicode Conference14 Encoding vs. compression For example: BOCU-1 with WinZip (sum of seven files) UTF-8BOCU-1 Uncompressed32024 bytes20723 bytes Compressed11659 bytes10722 bytes
San Jose, California, September nd International Unicode Conference15 Performance Converter performance –Roundtrip to/from UTF-16 with ICU: SCSU: 45%..125% of UTF-8 roundtrip time BOCU-1: 40%..160% of UTF-8 roundtrip time Depends on encoding ratio –Fast for small scripts, 1 byte/char Separate compression adds to I/O time Conversion time typically swamped by –Transmission (low-bandwidth connections) Shorter texts transmit faster! –Parsing/processing
San Jose, California, September nd International Unicode Conference16 Further considerations In-document encoding declarations require ASCII readability (XML, HTML) Protocol may limit byte values (SMTP) –TES required for some encodings base64 for SCSU or UTF-16 in s Increases text size Compression removes ASCII readability and uses arbitrary byte values
San Jose, California, September nd International Unicode Conference17 Conclusion UTF-8 and/or UTF-16 work in most cases Size of text often not critical When small text size needed: –Use SCSU or BOCU-1 –Consider compression –Make sure receiver can handle it
San Jose, California, September nd International Unicode Conference18 References Forms of Unicode: Character Encoding Model: UTR # SCSU: UTS #6 BOCU-1: nversion/bocu1/bocu1.html nversion/bocu1/bocu1.html ICU homepage: Unicode Consortium: IBM developerWorks: