Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A.

Slides:



Advertisements
Similar presentations
Unicode: A Grand Tour Character Encodings & Unicode.
Advertisements

中文信息处理 Chinese NLP Lecture 2.

מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Introduction to Computing CPSC 203 January 24, 2006 Heejin Lim Chapter 1 Chapter 2 (part of)
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Digital Data Patrice Koehl Computer Science UC Davis.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.
ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s.
Supplementary Character Support in Microsoft Products Michael S. Kaplan Software Design Engineer Microsoft.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
CSIS 4823 Data Communications Networking – IPv6
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
Unicode (and Java) Brice Giesbrecht.
ASCII and Unicode.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Chapter 3 Representing Numbers and Text in Binary Information Technology in Theory By Pelin Aksoy and Laura DeNardis.
Agenda Data Representation – Characters Encoding Schemes ASCII
Data Representation Prepared by Dr P Marais (Modified by D Burford)
Lecture 2 Character Codes and Low-Structure Text Document Formats.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
News On The Go! How NewsHunt reached 1 Crore Downloads ? INDIAN LANGUAGES!!
IT-101 Section 001 Lecture #3 Introduction to Information Technology.
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
Binary, Decimal and Hexadecimal Numbers Svetlin Nakov Telerik Corporation
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
Number Systems and Digital Codes
1 Information Management DIG 3563 – Lecture 14 Data Formats J. Michael Moshell University of Central Florida Original image* by Moshell et al. Imagery.
Syntax of the HTML HyperText Markup Language. HTML Syntax  What is it?  Helps computer know how to display  What goes into it?  U+FEFF BYTE ORDER.
Number Systems Denary Base 10 Binary Base 2 Hexadecimal Base 16
OPERATING SYSTEMS Frans Sanen.  Analyze a FAT file system manually  FAT12 first and simplest version  Still used on smaller disks (e.g. floppies) 
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
Computing Basics.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Characters CS240.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
1.4 Representation of data in computer systems Character.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Machine level representation of data Character representation
Lesson Objectives Aims You should be able to:
GUID Partition Table Unified Extensible Firmware Interface (UEFI)
GUID Partition Table Unified Extensible Firmware Interface (UEFI)
Slide design: Dr. Mark L. Hornick
Binary, Decimal and Hexadecimal Numbers
Data Encoding Characters.
Strings.
Lecture 2 Data representation
Ch2: Data Representation
Devanagari Font Support For Linux
Presenting information as bit patterns
Lecture 9: Radix-64 Tutorial
Binary Lesson 4 Hexadecimal and Binary Practice
Comp Org & Assembly Lang
Abstraction – Number Systems and Data Representation
GUID Partition Table Unified Extensible Firmware Interface (UEFI)
Lab 3: File Permissions.
GUID Partition Table Unified Extensible Firmware Interface (UEFI)
Chapter 3 - Binary Numbering System
Lecture 36 – Unit 6 – Under the Hood Binary Encoding – Part 2
Presentation transcript:

Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A

Table of contents Puzzles3 Game4 Questionnaire5 What Is It?6 What Contains More Characters?7 Unicode Overview8 Evolution of Character Encoding9 Can you read it?10 Unicode Standard11 Timeline12 Unicode Characters13 Unicode on the Web15 Unicode Script Blocks17 17 Code Point Planes18 BMP – Basic Multilingual Plane19 Characters per Plane20 Encoding21 Unicode Encoding22 UTF-3223 UTF-825 UTF-1629 From Unicode 1.0 to Unicode Surrogates31 UTF-16 Transformation32 Surrogates Mapping33 UTF-16 Encoding Example34 Endianness35 Endianness in Normal Life36 Unicode BOM – Byte Order Mark37 Encodings Summary 38 Properties of Each UTFs39 Unicode Encodings Length40 Unicode Encoding – Example 141 Unicode Encoding – Example 242 Useful Stuff43 2Unicode

Puzzles 3Unicode ¿ U+00BF

Game Unicode

Questionnaire Unicode Who knows what binary code is? Who can convert between decimals and binaries? Who knows what hexadecimal code is? Who can convert between hexadecimals and binaries? Who has heard the word Unicode? Who has heard the word UTF-8, UTF-16, UTF-32? Who creates web pages?

What Is It? Unicode

What Contains More Characters? Unicode? UTF-8? UTF-16? UTF-32? Unicode

Unicode Overview U+1F026

Evolution of Character Encoding Pre-standards ASCII – 1960s – 7 bits – 128 characters Extended ASCII – 8 bits – 128 characters more →Kód bratří Kamenických →MS-DOS CP852, … →ISO , ISO , ISO , … →Microsoft CP1252, CP1250, … →…, …, … Unicode

Can you read it? Unicode

Unicode Standard Character coding system Unicode

Timeline YearVersionFeaturesCharacters defined Address space 1991Unicode 1.0Code space: 16 bits U+0000 – U+FFFF ,16165, Unicode 2.0Code space: 21 bits U+0000 – U+10FFFF ,9501,114,112 (17 * 65536) 2015Unicode ,7371,114, Unicode

Unicode Characters 13Unicode

Unicode Characters 14Unicode

Unicode on the Web 15Unicode

Unicode on the Web Before HTML 5: HTML 5: NCR: a 16Unicode

Unicode Script Blocks 17Unicode

17 Code Point Planes 18Unicode

BMP – Basic Multilingual Plane 19Unicode

Characters per Plane 20Unicode

Encoding 21Unicode U+1F427

Unicode Encoding CEF: Character Encoding Form UTF: Unicode Transformation Format Unicode

UTF-32 U+1F467

UTF-32 1:1 24Unicode

UTF-8 U+1F466

UTF-8 Bits Last code point Byte 1Byte 2Byte 3Byte 4Byte 5Byte F0xxx xxxx FF110x xxxx10xx xxxx FFFF1110 xxxx10xx xxxx 21001F FFFF1111 0xxx10xx xxxx 2603FF FFFF xx10xx xxxx 317FFF FFFF x10xx xxxx Unicode

UTF-EBCDIC Bits Last code point Byte 1Byte 2Byte 3Byte 4Byte 5Byte F0xxx xxxx F100x xxxx FF110x xxxx101x xxxx FFF1110 xxxx101x xxxx FFFF xx101x xxxx FFFF x101x xxxx Unicode

UTF-8 Example 28Unicode

UTF-16 U+1F467

From Unicode 1.0 to Unicode ,536 characters ought to be enough for anybody Workaround concept for Backward Compatibility Surrogates Planes (65,536 characters) Original Unicode 1.0  Basic Multilingual Plane Added 16 extra planes Total 17 * 65,536 = 1,114,112 characters Unicode

Surrogates Range Mask HighU+D800 – U+DBFF D8 XX xxxx xxxx D9 XX xxxx xxxx DA XX xxxx xxxx DB XX xxxx xxxx 8*2561,024 high surrogates LowU+DC00 – U+DFFF DC XX xxxx xxxx DD XX xxxx xxxx DE XX xxxx xxxx DF XX xxxx xxxx 8*2561,024 low surrogates Combinations1,024 * 1,024 1,048,576 new characters Unicode

UTF-16 Transformation 32Unicode

Surrogates Mapping 33Unicode hi \ loDC00DC01DC02DC03…DFF0DFFF D …1 03FE1 03FF D …1 07FE1 07FF D …1 0BFE1 0BFF D8031 0C001 0C011 0C021 0C03…1 0FFE1 0FFF ⋮⋮⋮⋮⋮⋱⋮⋮ DBFB10 EC0010 EC0110 EC0210 EC03…10 EFFE10 EFFF DBFC10 F00010 F00110 F00210 F003…10 F3FE10 F3FF DBFD10 F40010 F40110 F40210 F403…10 F7FE10 F7FF DBFE10 F80010 F80110 F80210 F803…10 FBFE10 FBFF DBFF10 FC0010 FC0110 FC0210 FC03…10 FFFE10 FFFF

UTF-16 Encoding Example 34Unicode

Endianness 35Unicode

Endianness in Normal Life Language92Endian ninety-two (90-2) Big zweiundneunzig (2-and-90) Little quatre-vingt-douze ( ) UsageFormEndian Java packagecom.tieto.intraBig Domain nameintra.tieto.comLittle Unicode

Unicode BOM – Byte Order Mark U+FEFF BOM use is optional at the start of the text stream Unicode

Encodings Summary U+1F3B8

Properties of Each UTFs NameUTF-8UTF-16UTF-16BEUTF-16LEUTF-32UTF-32BEUTF-32LE Smallest code point 0000 Largest code point 10FFFF Code unit size 8 bits16 bits 32 bits Byte orderN/A big-endian little- endian big-endian little- endian Fewest bytes per character Most bytes per character Unicode

Unicode Encodings Length Code rangeUTF-8UTF-EBCDICUTF-16UTF-32GB – F – F 2 2 for characters inherited from GB 2312/GBK (e.g. most Chinese characters)GB 2312GBK 4 for everything else 00 00A0 – 00 03FF – 00 07FF – 00 3FFF – 00 FFFF – 03 FFFF – 10 FFFF Unicode

Unicode Encoding – Example 1 41Unicode

Unicode Encoding – Example 2 42Unicode

Useful Stuff א U+05D0

Useful Utilities Online Character Converter BabelMap Unibook Alan Wood’s Unicode Resources Microsoft TrueTypeProperty Extension Uniview Unicode

Useful links ntshttp://en.wikipedia.org/wiki/Unicode_font#Comparison_of_fo nts Unicode

Good Night! U+1F4A4