Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A
Table of contents Puzzles3 Game4 Questionnaire5 What Is It?6 What Contains More Characters?7 Unicode Overview8 Evolution of Character Encoding9 Can you read it?10 Unicode Standard11 Timeline12 Unicode Characters13 Unicode on the Web15 Unicode Script Blocks17 17 Code Point Planes18 BMP – Basic Multilingual Plane19 Characters per Plane20 Encoding21 Unicode Encoding22 UTF-3223 UTF-825 UTF-1629 From Unicode 1.0 to Unicode Surrogates31 UTF-16 Transformation32 Surrogates Mapping33 UTF-16 Encoding Example34 Endianness35 Endianness in Normal Life36 Unicode BOM – Byte Order Mark37 Encodings Summary 38 Properties of Each UTFs39 Unicode Encodings Length40 Unicode Encoding – Example 141 Unicode Encoding – Example 242 Useful Stuff43 2Unicode
Puzzles 3Unicode ¿ U+00BF
Game Unicode
Questionnaire Unicode Who knows what binary code is? Who can convert between decimals and binaries? Who knows what hexadecimal code is? Who can convert between hexadecimals and binaries? Who has heard the word Unicode? Who has heard the word UTF-8, UTF-16, UTF-32? Who creates web pages?
What Is It? Unicode
What Contains More Characters? Unicode? UTF-8? UTF-16? UTF-32? Unicode
Unicode Overview U+1F026
Evolution of Character Encoding Pre-standards ASCII – 1960s – 7 bits – 128 characters Extended ASCII – 8 bits – 128 characters more →Kód bratří Kamenických →MS-DOS CP852, … →ISO , ISO , ISO , … →Microsoft CP1252, CP1250, … →…, …, … Unicode
Can you read it? Unicode
Unicode Standard Character coding system Unicode
Timeline YearVersionFeaturesCharacters defined Address space 1991Unicode 1.0Code space: 16 bits U+0000 – U+FFFF ,16165, Unicode 2.0Code space: 21 bits U+0000 – U+10FFFF ,9501,114,112 (17 * 65536) 2015Unicode ,7371,114, Unicode
Unicode Characters 13Unicode
Unicode Characters 14Unicode
Unicode on the Web 15Unicode
Unicode on the Web Before HTML 5: HTML 5: NCR: a 16Unicode
Unicode Script Blocks 17Unicode
17 Code Point Planes 18Unicode
BMP – Basic Multilingual Plane 19Unicode
Characters per Plane 20Unicode
Encoding 21Unicode U+1F427
Unicode Encoding CEF: Character Encoding Form UTF: Unicode Transformation Format Unicode
UTF-32 U+1F467
UTF-32 1:1 24Unicode
UTF-8 U+1F466
UTF-8 Bits Last code point Byte 1Byte 2Byte 3Byte 4Byte 5Byte F0xxx xxxx FF110x xxxx10xx xxxx FFFF1110 xxxx10xx xxxx 21001F FFFF1111 0xxx10xx xxxx 2603FF FFFF xx10xx xxxx 317FFF FFFF x10xx xxxx Unicode
UTF-EBCDIC Bits Last code point Byte 1Byte 2Byte 3Byte 4Byte 5Byte F0xxx xxxx F100x xxxx FF110x xxxx101x xxxx FFF1110 xxxx101x xxxx FFFF xx101x xxxx FFFF x101x xxxx Unicode
UTF-8 Example 28Unicode
UTF-16 U+1F467
From Unicode 1.0 to Unicode ,536 characters ought to be enough for anybody Workaround concept for Backward Compatibility Surrogates Planes (65,536 characters) Original Unicode 1.0 Basic Multilingual Plane Added 16 extra planes Total 17 * 65,536 = 1,114,112 characters Unicode
Surrogates Range Mask HighU+D800 – U+DBFF D8 XX xxxx xxxx D9 XX xxxx xxxx DA XX xxxx xxxx DB XX xxxx xxxx 8*2561,024 high surrogates LowU+DC00 – U+DFFF DC XX xxxx xxxx DD XX xxxx xxxx DE XX xxxx xxxx DF XX xxxx xxxx 8*2561,024 low surrogates Combinations1,024 * 1,024 1,048,576 new characters Unicode
UTF-16 Transformation 32Unicode
Surrogates Mapping 33Unicode hi \ loDC00DC01DC02DC03…DFF0DFFF D …1 03FE1 03FF D …1 07FE1 07FF D …1 0BFE1 0BFF D8031 0C001 0C011 0C021 0C03…1 0FFE1 0FFF ⋮⋮⋮⋮⋮⋱⋮⋮ DBFB10 EC0010 EC0110 EC0210 EC03…10 EFFE10 EFFF DBFC10 F00010 F00110 F00210 F003…10 F3FE10 F3FF DBFD10 F40010 F40110 F40210 F403…10 F7FE10 F7FF DBFE10 F80010 F80110 F80210 F803…10 FBFE10 FBFF DBFF10 FC0010 FC0110 FC0210 FC03…10 FFFE10 FFFF
UTF-16 Encoding Example 34Unicode
Endianness 35Unicode
Endianness in Normal Life Language92Endian ninety-two (90-2) Big zweiundneunzig (2-and-90) Little quatre-vingt-douze ( ) UsageFormEndian Java packagecom.tieto.intraBig Domain nameintra.tieto.comLittle Unicode
Unicode BOM – Byte Order Mark U+FEFF BOM use is optional at the start of the text stream Unicode
Encodings Summary U+1F3B8
Properties of Each UTFs NameUTF-8UTF-16UTF-16BEUTF-16LEUTF-32UTF-32BEUTF-32LE Smallest code point 0000 Largest code point 10FFFF Code unit size 8 bits16 bits 32 bits Byte orderN/A big-endian little- endian big-endian little- endian Fewest bytes per character Most bytes per character Unicode
Unicode Encodings Length Code rangeUTF-8UTF-EBCDICUTF-16UTF-32GB – F – F 2 2 for characters inherited from GB 2312/GBK (e.g. most Chinese characters)GB 2312GBK 4 for everything else 00 00A0 – 00 03FF – 00 07FF – 00 3FFF – 00 FFFF – 03 FFFF – 10 FFFF Unicode
Unicode Encoding – Example 1 41Unicode
Unicode Encoding – Example 2 42Unicode
Useful Stuff א U+05D0
Useful Utilities Online Character Converter BabelMap Unibook Alan Wood’s Unicode Resources Microsoft TrueTypeProperty Extension Uniview Unicode
Useful links ntshttp://en.wikipedia.org/wiki/Unicode_font#Comparison_of_fo nts Unicode
Good Night! U+1F4A4