Ruby M17N RubyKaigi 08 RubyKaigi 08 Martin J. D ü rst.

Slides:



Advertisements
Similar presentations
Globalization Gotchas
Advertisements

Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
Lis508 lecture 1: bits, bytes and characters Thomas Krichel
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
1 Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha.
Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.
ObjectStudio for Unicode Alexander Augustin Getting ready for global markets.
PACS – 11/16/13 1 Unicode With everything becoming globalized these days, more characters to represent a wider array of languages than just English are.
Bruce Beckles University of Cambridge Computing Service
Formal Language, chapter 4, slide 1Copyright © 2007 by Adam Webber Chapter Four: DFA Applications.
Review Binary –Each digit place is a power of 2 –Any two state phenomenon can encode a binary number –The number of bits (digits) required directly relates.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
LING 408/508: Programming for Linguists Lecture 2 August 28 th.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
Unicode (and Java) Brice Giesbrecht.
ASCII and Unicode.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Georgia Institute of Technology Creating and Modifying Text part 1 Barb Ericson Georgia Institute of Technology Oct 2005.
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
ICT Foundation 1 Copyright © 2010, IT Gatekeeper Project – Ohiwa Lab. All rights reserved. Character representation.
Binary, Decimal and Hexadecimal Numbers Svetlin Nakov Telerik Corporation
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
Java Programming, Second Edition Chapter Two Using Data Within a Program.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
Strings. Our civilization has been built on them but we are moving towards to digital media Anyway, handling digital media is similar to.. A string is.
File Input and Output Chapter 14 Java Certification by:Brian Spinnato.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Data Representation. How is data stored on a computer? Registers, main memory, etc. consists of grids of transistors Transistors are in one of two states,
Characters CS240.
Strings CSE 1310 – Introduction to Computers and Programming Alexandra Stefan University of Texas at Arlington 1.
Searching, Modifying, and Encoding Text. Parts: 1) Forming Regular Expressions 2) Encoding and Decoding.
Binary IO Writing and Reading Raw Data. Files Two major flavors of file: Text Binary.
Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A.
UTF-8, Perl and You By Rafael Almeria. Chapter 1: Introduction.
BINARY I/O IN JAVA CSC 202 November What should be familiar concepts after this set of topics: All files are binary files. The nature of text files.
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Lesson Objectives Aims You should be able to:
BINARY CODE.
Writing and Reading Raw Data
Binary, Decimal and Hexadecimal Numbers
Slide design: Dr. Mark L. Hornick
Data Encoding Characters.
LING/C SC/PSYC 438/538 Lecture 7 Sandiway Fong.
THE sic mACHINE CSCI/CMPE 3334 David Egle.
Binary Code  
LING 388: Computers and Language
String Encodings and Penny Math
Strings.
Presenting information as bit patterns
Digital Encodings.
Strings.
Trees Addenda.
Comp Org & Assembly Lang
LING 388: Computers and Language
CS2911 Week 2, Class 3 Today Return Lab 2 and Quiz 1
B065: PROGRAMMING Variables 2.
String Encodings and Penny Math
Electronic Memory.
Exploitation Part 1.
Presentation transcript:

Ruby M17N RubyKaigi 08 RubyKaigi 08 Martin J. D ü rst

Summary (intro) (intro) Ruby (Ruby s Case) Ruby (Ruby s Case) (transcoding) by Martin (transcoding) by Martin (questions) (questions)

Who is naruse nkf nkf Softbank Technology Softbank Technology –iPhone –iPhone

Ruby M17N Ruby M17N CSI CSI

M17N methods: UCS Normalization UCS Normalization CSI (Code Set Independent) CSI (Code Set Independent)

UCS Normalization UCS Normalization UCS (Universal Character Set) UCS (Universal Character Set)

Perl's case (Unicode) Decode: $str = decode("UTF-8", "\xE3\x81\x82"); $str = decode("UTF-8", "\xE3\x81\x82"); $str " " $str " "Encode: $bytes = encode("UTF-8", " "); $bytes = encode("UTF-8", " "); $bytes "\xE3\x81\x82" $bytes "\xE3\x81\x82"

CSI CSI Code Set Independent Code Set Independent Solaris, Citrus Solaris, Citrus __STDC_ISO_10646__ C __STDC_ISO_10646__ C

Ruby Ruby String String " ".encoding " ".encoding -> ->

3 Encoding Grades ASCII Compatible ASCII Compatible ASCII Incompatible ASCII Incompatible Dummy Dummy

ASCII Compatible full support full support script encoding script encoding faster faster UTF-8, Shift_JIS, EUC-JP,... UTF-8, Shift_JIS, EUC-JP,...

Major Encodings US-ASCII US-ASCII ASCII-8BIT ASCII-8BIT UTF-8 UTF-8

Japanese Encodings Shift_JIS Shift_JIS EUC-JP EUC-JP

Other Encodings Big5, EUC-KR, EUC-TW, GBK, Big5, EUC-KR, EUC-TW, GBK, ISO-8859-X, KOI8-R, KOI8-U, etc ISO-8859-X, KOI8-R, KOI8-U, etc

Machine dependend Encodings Windows-31J Windows-31J CP51932 CP51932 eucJP-ms eucJP-ms Windows-125X Windows-125X

ASCII-8BIT ASCII Compatible 8BIT String ASCII Compatible 8BIT String BINARY? BINARY?

ASCII Only 7BIT String is special 7BIT String is special "abcde".ascii_only? -> true "abcde".ascii_only? -> true "abcde" + " " "abcde" + " "

ASCII Incompatible limited support limited support Can t use as script encoding Can t use as script encoding UTF-{16,32}{BE,LE} UTF-{16,32}{BE,LE}

UTF-16 & UTF-32 UTF-16BE, UTF-16LE UTF-16BE, UTF-16LE UTF-32BE, UTF-32LE UTF-32BE, UTF-32LE UTF-16 UTF-32 UTF-16 UTF-32

Dummy encoding Ruby Ruby for stateful encodings for stateful encodings Encoding#dummy? -> true Encoding#dummy? -> true ISO-2022-JP, UTF-7 ISO-2022-JP, UTF-7

Encoding.list Encoding.list Encoding.list [encoding,..] [encoding,..] Encoding.name_list Encoding.name_list [enc_name,..] [enc_name,..] Encoding.aliases Encoding.aliases {alias => enc_name,..} {alias => enc_name,..}

$KCODE is obsolete $KCODE $KCODE Ruby1.9 Ruby1.9 $KCODE $KCODE

String 1.8: Byte String 1.8: Byte String –Ruby ignores encoding 1.9: Byte String with encoding 1.9: Byte String with encoding –Ruby knows the encoding of string

No Character Object but 1 Character String but 1 Character String ?.class -> String ?.class -> StringWhy?

A character has... codepoint codepoint encoding encoding byte string byte string 1 char string has them! 1 char string has them! cf. cf.

1.8: ?a ?a 97 (Fixnum) 97 (Fixnum) ?\x61 ?\x61 97 (Fixnum) 97 (Fixnum)1.9: ?a ?a "a" (US-ASCII) ?\x61 ?\x61 "a" (US-ASCII) ? ? " " (UTF-8) ?\u{3042} ?\u{3042} " " (UTF-8)

String#ord and Integer#chr " ".ord # Unicode " ".ord # Unicode chr chr RangeError: out of char range RangeError: out of char range chr("UTF-8") chr("UTF-8") " " " "

?a.encoding ?a.encoding "a".encoding "a".encoding "\xFF".encoding "\xFF".encoding "\u{3042}".encoding "\u{3042}".encoding "\u{ }".encoding "\u{ }".encoding

String#[] String#[] 1.8: String#[] integer (1 byte) String#[] integer (1 byte) [0] 0xE3 # UTF-8 [0] 0xE3 # UTF-81.9: String#[] 1 string String#[] 1 string [0] " " [0] " "

String#length 1.8: 1.8: String#length byte length String#length byte length.length 9 (UTF-8).length 9 (UTF-8) 1.9: 1.9: String#length character length String#length character length.length 3.length 3 String#bytesize byte length String#bytesize byte length.bytesize 9 (UTF-8).bytesize 9 (UTF-8)

String is not Enumerable String#each is removed. String#each is removed. "hoge".each{|l|p l} "hoge".each{|l|p l} NoMethodError NoMethodError

String#each_* String#each_byte (bytes) String#each_byte (bytes) String#each_char (chars) String#each_char (chars) String#each_line (lines) String#each_line (lines)

== == ( ArgumentError) 7bit 7bit

/(.)/ =~ " " /(.)/ =~ " " $1 " " $1 " "

/\xE3\x81\x82/n =~ " " /\xE3\x81\x82/n =~ " " ArgumentError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) ArgumentError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) ASCII-8BIT ASCII-8BIT

bytes = "A ".force_encoding( bytes = "A ".force_encoding("ASCII-8BIT") /\xE3\x81\x82/ =~ bytes 1 /\xE3\x81\x82/ =~ bytes 1 /a/ =~ bytes 0 /a/ =~ bytes 0

Script Encoding

Magic Comment #!/bin/env ruby # -*- coding: UTF-8 -*- /coding[:=]\s*(?<encname>[\w.-]+)[^\w.-]/

-K option -K -K Encoding.external_encoding Encoding.external_encoding script encoding script encoding -E (external encoding ) -E (external encoding )

script encoding 1. magic comment 2. -K 3. US-ASCII -E -E -e stdin 1. magic comment 2. -K or – E 3. Locale locale locale

String#inspect vs String#dump String#inspect String#dump: dump dump Escape dump Escape dump

IO open open(path, "r:utf-8") {|f| puts f.gets } open(path, "r:utf-8") {|f| puts f.gets } open(path, "r:utf-8:euc-jp") {.. } open(path, "r:utf-8:euc-jp") {.. } open(path, "mode:external:internal") open(path, "mode:external:internal")

IO with encoding option open(path, encoding: "utf-8") open(path, encoding: "utf-8") open(path, encoding: "utf-8:euc- jp") open(path, encoding: "utf-8:euc- jp") open(path, encoding: " external:internal ") open(path, encoding: " external:internal ")

IO with encoding option open(path, open(path, external_encoding: "utf-8") open(path, open(path, external_encoding: "utf-8, internal_encoding: "euc-jp")

Encoding.defult_external Default encoding for external input -K or -E > locale -K or -E > locale

String as Bytes String#getbyte(index) String#getbyte(index) String#setbyte(index, value) String#setbyte(index, value) String#bytesize String#bytesize

transcoding Martin Martin RubyKaigiM17N.html RubyKaigiM17N.html RubyKaigiM17N.html RubyKaigiM17N.html

Encoding Encoding String#encode String#encode Magic comment Magic comment

Dir.open Dir.open encoding encoding –Dir.glob, fnmatch String#encode (transcode) String#encode (transcode) Unicode Win32API ? Unicode Win32API ?

RubyM17N RubyM17N !!! !!!

any questions? any questions?

* UCS * UCS * CSI * CSI

UCS UCS * UCS * UCS * magic comment * magic comment * UCS * UCS

CSI CSI * encoding * encoding * magic comment * magic comment

FAQ Any questions? Any questions?

* * * US-ASCII * US-ASCII

[0x3042,0x3044].pack("U") * pack("U*") encoding * pack("U*") encoding * pack("U*") UTF-8 * pack("U*") UTF-8 * pack("UC") * pack("UC")

require require US-ASCII US-ASCII