Encoding and fonts Edward Garrett Software Developer, ELAR.

Encoding and fonts Edward Garrett Software Developer, ELAR

Some issues  Data types: diverse scripts multilingual data IPA and other transcriptional notations  Modes: representation (in some scheme) storage (using some encoding) display (in browser, word processor, etc.) input (with various OS, keyboards, etc.)  Your issues and challenges: data/problems to look at now? Friday AM advice clinic

Representing data  Symbols  Encoding: character sets, Unicode  Fonts  Relationships (eg links)  Structures (eg hierarchies)

Representing textual data  Plain text Lacks formatting information Transfer between applications Internal memory Saved in files  Encodings  Unicode  Markup XML HTML

Plain text  What is it?  Try saving a document as plain text … in TextEdit …

Definitions  Background on digital data storage: Bit: 0, 1 Byte: 8 bits, e.g. 00101100  Definitions from Yucca Korpela’s article: Character repertoire: a set of characters (from one or more scripts) constituting the data that can be represented Character code: a mapping that gives each character in a repertoire a distinct numeric identifier Character encoding: a method of mapping sequences of character codes into sequences of bytes

Character encodings: ISO 8859-1  ISO 8859-1: uses 1 byte (8 bits) to encode characters for most of the Western European languages

Unicode  International standard (ISO 10646)  Industry standard (Unicode Consortium)  Aims to code all characters from all of the world’s scripts - over 1 million code points  Privileges character semantics, not glyphic representations  Multiple encoding methods  Referencing a character: U+nnnn (in hexidecimal, base 16)  Most characters in Basic Multilingual Plane (first 65,536 character positions)

Unicode encodings  UTF-32: each code to 4 bytes; inefficient as most commonly used characters are in BMP  UTF-16: maps each code to either one 2 byte sequence, or two: efficient and widely used Good for the BMP  UTF-8: maps each code to 1-4 bytes Particularly compact for Western European languages Most widely supported across various internet protocols

Character semantics vs. glyphs  No difference between e, e, and e  IPA letter [c], unvoiced palatal plosive, but same as Roman c  No separate characters for cursive scripts, joined up handwriting

Character semantics vs. glyphs  Examples U+0041 LATIN CAPITAL LETTER A U+0410 CYRILLIC CAPITAL LETTER A U+0391 GREEK CAPITAL LETTER ALPHA  IPA digraphs  “Never use a character just because it looks right.”

Precomposed characters  Complex characters involving a base character and multiple diacritics - treated as equivalent  A relevant case study: "Challenges in Writing Bih" [http://test.elar.soas.ac.uk/node/5]

Compatibility characters  Similar to their decompositions, but not equivalent; they include extra information (formatting, etc.)

Pre-composed and compatibility characters  Why do they exist, if counter to Unicode’s focus on character semantics over glyphic representation? Compatibility with prior encodings No such new characters will be accepted into Unicode

Things to watch out for  An example to illustrate the difference between: Text rendering Document encoding

Take away message  Just because characters aren’t rendered properly doesn’t mean that they aren’t there.  Just because characters are rendered properly doesn’t guarantee that they will stay that way.  Beware your platform’s default encoding (probably not Unicode).

Adding markup  Not only should the document be Unicode, but it must declare itself as Unicode.

Exercises  What's wrong with these Unicode words?  Character encoding exercises I  http://test.elar.soas.ac.uk/taxonomy/term/1

Your questions and issues

Encoding and fonts Edward Garrett Software Developer, ELAR.

Similar presentations

Presentation on theme: "Encoding and fonts Edward Garrett Software Developer, ELAR."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Encoding and fonts Edward Garrett Software Developer, ELAR.

Similar presentations

Presentation on theme: "Encoding and fonts Edward Garrett Software Developer, ELAR."— Presentation transcript:

Similar presentations

About project

Feedback