Download presentation
Presentation is loading. Please wait.
Published byMarvin Roberts Modified over 9 years ago
1
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency
2
San Jose, California, September 200222nd International Unicode Conference2 Agenda Encodings in files and protocols –Not: Processing encoding forms Unicode “is too big” –Issues and non-issues How to reduce size of Unicode text –Choice of encoding –Optional compression Examples and comparisons
3
San Jose, California, September 200222nd International Unicode Conference3 What is ICU? Internationalization libraries for C, C++, Java* –Open source – non-viral –Sponsored by IBM *Sun’s Java licenses an earlier ICU version; ICU4J updates it. Unicode standard compliant –full supplementary support Cross-platform; extensible and customizable High performance and thread-safe –Multiple locales in same thread – simultaneously Converters for all Unicode charsets & hundreds of legacy codepages http://oss.software.ibm.com/icu/
4
San Jose, California, September 200222nd International Unicode Conference4 Encodings of Unicode Common Unicode character set External encodings –Files and protocols –Almost always byte-serialized –Character Encoding Schemes/charsets Processing encodings –Character Encoding Forms, often 16/32-bit –Different requirements –Topic for different presentation…
5
San Jose, California, September 200222nd International Unicode Conference5 Unicode “is too big”? Perceived large size of Unicode text –Compared with legacy codepages Size matters –Low-speed connections (dial-up, mobile) –Little memory (PDA, cell phone, embedded) Size does not matter when… –Images & other binaries swamp text size –High-speed network –Temporary documents –Large amounts of memory
6
San Jose, California, September 200222nd International Unicode Conference6 How big is it? Size depends on language/script Bytes/char for some language groups: LanguagesLegacyUTF-8UTF-16 Western/Latin112 Russian/Arabic122 Hindi/Thai132 CJK232
7
San Jose, California, September 200222nd International Unicode Conference7 Legacy codepages Compact because –Designed for single/few languages –Few characters compared with Unicode Conversion problems –Fallback/substitution of unmappable chars –Mapping table differences –Loss of parts of text common Large number/size of mapping tables
8
San Jose, California, September 200222nd International Unicode Conference8 Reduce Unicode text size Choice of encoding –Encodings designed for different purposes –Compactness vs. direct applicability vs. software support etc. General-purpose compression –Best on top of compact encoding –Not available in all applications
9
San Jose, California, September 200222nd International Unicode Conference9 UTF-8/16 Designed for processing but all-purpose UTF-8: –Byte-based, ASCII-compatible –BMP: up to 3 bytes/char UTF-16 (BE/LE): –Byte-serialization of 16-bit form, not ASCII-compatible –BE/LE forms or Byte Order Mark –BMP: always 2 bytes/char
10
San Jose, California, September 200222nd International Unicode Conference10 UTF-7 7-bit encoding designed for email –Obsolete: email now 8-bit-safe Partially ASCII-compatible BMP: 2.67 bytes/char plus overhead –Base64-encoded UTF-16BE Stateful
11
San Jose, California, September 200222nd International Unicode Conference11 SCSU & BOCU-1 About as compact as legacy codepages –1 byte/char for small scripts, 2 for CJK; stateful –Compress short strings better than LZW (zip) etc. SCSU: –Limited * ASCII compatibility (initial state) –Complex state, many encoding choices –Indeterministic; arbitrary byte values –Established encoding, supported in Various tools & editors (SC UniPad), ICU, Symbian OS (cell phones/PDAs)
12
San Jose, California, September 200222nd International Unicode Conference12 BOCU-1 BOCU-1: –Delta-encoding; avoids control codes –MIME text-compatible but not ASCII –Deterministic –Preserves binary order (for sorting, databases) –New encoding; supported by ICU
13
San Jose, California, September 200222nd International Unicode Conference13 SCSU & BOCU-1 text sizes Average bytes/char relative to UTF-8 LanguagesSCSUBOCU-1 English/French100% Russian/Arabic55%60% Hindi40% Thai40%45% Japanese55%60% Korean85%75% Chinese70%75%
14
San Jose, California, September 200222nd International Unicode Conference14 Encoding vs. compression For example: BOCU-1 with WinZip (sum of seven files) UTF-8BOCU-1 Uncompressed32024 bytes20723 bytes Compressed11659 bytes10722 bytes
15
San Jose, California, September 200222nd International Unicode Conference15 Performance Converter performance –Roundtrip to/from UTF-16 with ICU: SCSU: 45%..125% of UTF-8 roundtrip time BOCU-1: 40%..160% of UTF-8 roundtrip time Depends on encoding ratio –Fast for small scripts, 1 byte/char Separate compression adds to I/O time Conversion time typically swamped by –Transmission (low-bandwidth connections) Shorter texts transmit faster! –Parsing/processing
16
San Jose, California, September 200222nd International Unicode Conference16 Further considerations In-document encoding declarations require ASCII readability (XML, HTML) Protocol may limit byte values (SMTP) –TES required for some encodings base64 for SCSU or UTF-16 in emails Increases text size Compression removes ASCII readability and uses arbitrary byte values
17
San Jose, California, September 200222nd International Unicode Conference17 Conclusion UTF-8 and/or UTF-16 work in most cases Size of text often not critical When small text size needed: –Use SCSU or BOCU-1 –Consider compression –Make sure receiver can handle it
18
San Jose, California, September 200222nd International Unicode Conference18 References Forms of Unicode: http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/ http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/ Character Encoding Model: UTR #17 http://www.unicode.org/reports/tr17/ http://www.unicode.org/reports/tr17/ SCSU: UTS #6 http://www.unicode.org/reports/tr6/ http://www.unicode.org/reports/tr6/ BOCU-1: http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/co nversion/bocu1/bocu1.html http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/co nversion/bocu1/bocu1.html ICU homepage: http://oss.software.ibm.com/icu/ http://oss.software.ibm.com/icu/ Unicode Consortium: http://www.unicode.org/ http://www.unicode.org/ IBM developerWorks: http://www.ibm.com/developerworks/unicode/ http://www.ibm.com/developerworks/unicode/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.