Download presentation
Presentation is loading. Please wait.
Published byShanon Park Modified over 9 years ago
1
Encoding and fonts Edward Garrett Software Developer, ELAR
2
Some issues Data types: diverse scripts multilingual data IPA and other transcriptional notations Modes: representation (in some scheme) storage (using some encoding) display (in browser, word processor, etc.) input (with various OS, keyboards, etc.) Your issues and challenges: data/problems to look at now? Friday AM advice clinic
3
Representing data Symbols Encoding: character sets, Unicode Fonts Relationships (eg links) Structures (eg hierarchies)
4
Representing textual data Plain text Lacks formatting information Transfer between applications Internal memory Saved in files Encodings Unicode Markup XML HTML
5
Plain text What is it? Try saving a document as plain text … in TextEdit …
8
Definitions Background on digital data storage: Bit: 0, 1 Byte: 8 bits, e.g. 00101100 Definitions from Yucca Korpela’s article: Character repertoire: a set of characters (from one or more scripts) constituting the data that can be represented Character code: a mapping that gives each character in a repertoire a distinct numeric identifier Character encoding: a method of mapping sequences of character codes into sequences of bytes
9
Character encodings: ISO 8859-1 ISO 8859-1: uses 1 byte (8 bits) to encode characters for most of the Western European languages
11
Unicode International standard (ISO 10646) Industry standard (Unicode Consortium) Aims to code all characters from all of the world’s scripts - over 1 million code points Privileges character semantics, not glyphic representations Multiple encoding methods Referencing a character: U+nnnn (in hexidecimal, base 16) Most characters in Basic Multilingual Plane (first 65,536 character positions)
12
Unicode encodings UTF-32: each code to 4 bytes; inefficient as most commonly used characters are in BMP UTF-16: maps each code to either one 2 byte sequence, or two: efficient and widely used Good for the BMP UTF-8: maps each code to 1-4 bytes Particularly compact for Western European languages Most widely supported across various internet protocols
13
Character semantics vs. glyphs No difference between e, e, and e IPA letter [c], unvoiced palatal plosive, but same as Roman c No separate characters for cursive scripts, joined up handwriting
14
Character semantics vs. glyphs Examples U+0041 LATIN CAPITAL LETTER A U+0410 CYRILLIC CAPITAL LETTER A U+0391 GREEK CAPITAL LETTER ALPHA IPA digraphs “Never use a character just because it looks right.”
15
Precomposed characters Complex characters involving a base character and multiple diacritics - treated as equivalent A relevant case study: "Challenges in Writing Bih" [http://test.elar.soas.ac.uk/node/5]
16
Compatibility characters Similar to their decompositions, but not equivalent; they include extra information (formatting, etc.)
17
Pre-composed and compatibility characters Why do they exist, if counter to Unicode’s focus on character semantics over glyphic representation? Compatibility with prior encodings No such new characters will be accepted into Unicode
18
Things to watch out for An example to illustrate the difference between: Text rendering Document encoding
24
Take away message Just because characters aren’t rendered properly doesn’t mean that they aren’t there. Just because characters are rendered properly doesn’t guarantee that they will stay that way. Beware your platform’s default encoding (probably not Unicode).
25
Adding markup Not only should the document be Unicode, but it must declare itself as Unicode.
29
Exercises What's wrong with these Unicode words? Character encoding exercises I http://test.elar.soas.ac.uk/taxonomy/term/1
30
Your questions and issues
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.