Download presentation
Presentation is loading. Please wait.
Published byMichael Carlson Modified over 11 years ago
1
Ruby M17N RubyKaigi 08 RubyKaigi 08 Martin J. D ü rst
2
Summary (intro) (intro) Ruby (Ruby s Case) Ruby (Ruby s Case) (transcoding) by Martin (transcoding) by Martin (questions) (questions)
3
Who is naruse nkf nkf Softbank Technology Softbank Technology –iPhone –iPhone naruse@ruby-lang.org naruse@ruby-lang.org
4
Ruby M17N Ruby M17N CSI CSI
5
M17N methods: UCS Normalization UCS Normalization CSI (Code Set Independent) CSI (Code Set Independent)
6
UCS Normalization UCS Normalization UCS (Universal Character Set) UCS (Universal Character Set)
7
Perl's case (Unicode) Decode: $str = decode("UTF-8", "\xE3\x81\x82"); $str = decode("UTF-8", "\xE3\x81\x82"); $str " " $str " "Encode: $bytes = encode("UTF-8", " "); $bytes = encode("UTF-8", " "); $bytes "\xE3\x81\x82" $bytes "\xE3\x81\x82"
8
CSI CSI Code Set Independent Code Set Independent Solaris, Citrus Solaris, Citrus __STDC_ISO_10646__ C __STDC_ISO_10646__ C
9
Ruby Ruby String String " ".encoding " ".encoding -> ->
10
3 Encoding Grades ASCII Compatible ASCII Compatible ASCII Incompatible ASCII Incompatible Dummy Dummy
11
ASCII Compatible full support full support script encoding script encoding faster faster UTF-8, Shift_JIS, EUC-JP,... UTF-8, Shift_JIS, EUC-JP,...
12
Major Encodings US-ASCII US-ASCII ASCII-8BIT ASCII-8BIT UTF-8 UTF-8
13
Japanese Encodings Shift_JIS Shift_JIS EUC-JP EUC-JP
14
Other Encodings Big5, EUC-KR, EUC-TW, GBK, Big5, EUC-KR, EUC-TW, GBK, ISO-8859-X, KOI8-R, KOI8-U, etc ISO-8859-X, KOI8-R, KOI8-U, etc
15
Machine dependend Encodings Windows-31J Windows-31J CP51932 CP51932 eucJP-ms eucJP-ms Windows-125X Windows-125X
16
ASCII-8BIT ASCII Compatible 8BIT String ASCII Compatible 8BIT String BINARY? BINARY?
17
ASCII Only 7BIT String is special 7BIT String is special "abcde".ascii_only? -> true "abcde".ascii_only? -> true "abcde" + " " "abcde" + " "
18
ASCII Incompatible limited support limited support Can t use as script encoding Can t use as script encoding UTF-{16,32}{BE,LE} UTF-{16,32}{BE,LE}
19
UTF-16 & UTF-32 UTF-16BE, UTF-16LE UTF-16BE, UTF-16LE UTF-32BE, UTF-32LE UTF-32BE, UTF-32LE UTF-16 UTF-32 UTF-16 UTF-32
20
Dummy encoding Ruby Ruby for stateful encodings for stateful encodings Encoding#dummy? -> true Encoding#dummy? -> true ISO-2022-JP, UTF-7 ISO-2022-JP, UTF-7
21
Encoding.list Encoding.list Encoding.list [encoding,..] [encoding,..] Encoding.name_list Encoding.name_list [enc_name,..] [enc_name,..] Encoding.aliases Encoding.aliases {alias => enc_name,..} {alias => enc_name,..}
22
$KCODE is obsolete $KCODE $KCODE Ruby1.9 Ruby1.9 $KCODE $KCODE
23
String 1.8: Byte String 1.8: Byte String –Ruby ignores encoding 1.9: Byte String with encoding 1.9: Byte String with encoding –Ruby knows the encoding of string
24
No Character Object but 1 Character String but 1 Character String ?.class -> String ?.class -> StringWhy?
25
A character has... codepoint codepoint encoding encoding byte string byte string 1 char string has them! 1 char string has them! cf. cf.
26
1.8: ?a ?a 97 (Fixnum) 97 (Fixnum) ?\x61 ?\x61 97 (Fixnum) 97 (Fixnum)1.9: ?a ?a "a" (US-ASCII) ?\x61 ?\x61 "a" (US-ASCII) ? ? " " (UTF-8) ?\u{3042} ?\u{3042} " " (UTF-8)
27
String#ord and Integer#chr " ".ord 12354 # Unicode " ".ord 12354 # Unicode 12354.chr 12354.chr RangeError: 12354 out of char range RangeError: 12354 out of char range 12354.chr("UTF-8") 12354.chr("UTF-8") " " " "
28
?a.encoding ?a.encoding "a".encoding "a".encoding "\xFF".encoding "\xFF".encoding "\u{3042}".encoding "\u{3042}".encoding "\u{3042 3044 3046}".encoding "\u{3042 3044 3046}".encoding
29
String#[] String#[] 1.8: String#[] integer (1 byte) String#[] integer (1 byte) [0] 0xE3 # UTF-8 [0] 0xE3 # UTF-81.9: String#[] 1 string String#[] 1 string [0] " " [0] " "
30
String#length 1.8: 1.8: String#length byte length String#length byte length.length 9 (UTF-8).length 9 (UTF-8) 1.9: 1.9: String#length character length String#length character length.length 3.length 3 String#bytesize byte length String#bytesize byte length.bytesize 9 (UTF-8).bytesize 9 (UTF-8)
31
String is not Enumerable String#each is removed. String#each is removed. "hoge".each{|l|p l} "hoge".each{|l|p l} NoMethodError NoMethodError
32
String#each_* String#each_byte (bytes) String#each_byte (bytes) String#each_char (chars) String#each_char (chars) String#each_line (lines) String#each_line (lines)
33
== == ( ArgumentError) 7bit 7bit
34
/(.)/ =~ " " /(.)/ =~ " " $1 " " $1 " "
35
/\xE3\x81\x82/n =~ " " /\xE3\x81\x82/n =~ " " ArgumentError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) ArgumentError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) ASCII-8BIT ASCII-8BIT
36
bytes = "A ".force_encoding( bytes = "A ".force_encoding("ASCII-8BIT") /\xE3\x81\x82/ =~ bytes 1 /\xE3\x81\x82/ =~ bytes 1 /a/ =~ bytes 0 /a/ =~ bytes 0
37
Script Encoding
38
Magic Comment #!/bin/env ruby # -*- coding: UTF-8 -*- /coding[:=]\s*(?<encname>[\w.-]+)[^\w.-]/
39
-K option -K -K Encoding.external_encoding Encoding.external_encoding script encoding script encoding -E (external encoding ) -E (external encoding )
40
script encoding 1. magic comment 2. -K 3. US-ASCII -E -E -e stdin 1. magic comment 2. -K or – E 3. Locale locale locale
41
String#inspect vs String#dump String#inspect String#dump: dump dump Escape dump Escape dump
42
IO open open(path, "r:utf-8") {|f| puts f.gets } open(path, "r:utf-8") {|f| puts f.gets } open(path, "r:utf-8:euc-jp") {.. } open(path, "r:utf-8:euc-jp") {.. } open(path, "mode:external:internal") open(path, "mode:external:internal")
43
IO with encoding option open(path, encoding: "utf-8") open(path, encoding: "utf-8") open(path, encoding: "utf-8:euc- jp") open(path, encoding: "utf-8:euc- jp") open(path, encoding: " external:internal ") open(path, encoding: " external:internal ")
44
IO with encoding option open(path, open(path, external_encoding: "utf-8") open(path, open(path, external_encoding: "utf-8, internal_encoding: "euc-jp")
45
Encoding.defult_external Default encoding for external input -K or -E > locale -K or -E > locale
46
String as Bytes String#getbyte(index) String#getbyte(index) String#setbyte(index, value) String#setbyte(index, value) String#bytesize String#bytesize
47
transcoding Martin Martin http://www.sw.it.aoyama.ac.jp/2008/pub/ RubyKaigiM17N.html http://www.sw.it.aoyama.ac.jp/2008/pub/ RubyKaigiM17N.html http://www.sw.it.aoyama.ac.jp/2008/pub/ RubyKaigiM17N.html http://www.sw.it.aoyama.ac.jp/2008/pub/ RubyKaigiM17N.html
48
Encoding Encoding String#encode String#encode Magic comment Magic comment
49
Dir.open Dir.open encoding encoding –Dir.glob, fnmatch String#encode (transcode) String#encode (transcode) Unicode Win32API ? Unicode Win32API ?
50
RubyM17N RubyM17N !!! !!!
51
any questions? any questions?
52
* UCS * UCS * CSI * CSI
53
UCS UCS * UCS * UCS * magic comment * magic comment * UCS * UCS
54
CSI CSI * encoding * encoding * magic comment * magic comment
55
FAQ Any questions? Any questions?
56
* * * US-ASCII * US-ASCII
57
[0x3042,0x3044].pack("U") * pack("U*") encoding * pack("U*") encoding * pack("U*") UTF-8 * pack("U*") UTF-8 * pack("UC") * pack("UC")
58
require require US-ASCII US-ASCII
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.