Strategies for Developing Non-English Websites Elizabeth J. Pyatt Instructional Designer Education Technology Services.

Slides:



Advertisements
Similar presentations
Worldwide typography (and how to apply JIS-X to Unicode) Michel Suignard Microsoft Corporation.
Advertisements

Chris Pratley Lead Program Manager Microsoft Office.
 Use the Left and Right arrow keys or the Page Up and Page Down keys to move between the pages. You can also click on the pages to move forward.  To.
XHTML Basics.
Bits and the "Why" of Bytes: Representing Information Digitally
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
Solutions for Multilingual Literature by XSL Formatter 6,800 known languages.
Unicode and the Web Nathan Schneider. Special Text In our interactions with computers, it is often desirable to use characters other than the standard.
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
Media: Text “Words and symbols in any form, spoken or written, are the most common system of communication.” ~ unknown.
1/25 Writing Character sets Unicode Input methods.
1 HTML’s Transition to XHTML. 2 XHTML is the next evolution of HTML Extensible HTML eXtensible based on XML (extensible markup language) XML like HTML.
Practical Unicode Logic for Online Tech Courses Elizabeth J. Pyatt, Ph.D. Information Technology Services Penn State University
Moving a Large Scale University to Unicode Elizabeth J. Pyatt, Ph.D. Teaching and Learning with Technology Penn State University
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest.
Digital Text Primer Prepared for: AIEA Roundtable on Digitization of Armenian Documents Saturday 7 October 2006, University of Geneva, Switzerland Roland.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Computer Science and Software Engineering University of Wisconsin - Platteville Note 9. Internationalization Yan Shi SE 3730 / CS 5730 Lecture Notes Part.
ECA 228 Internet/Intranet Design I Meta Tags & Directories.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
 Using Microsoft Expression Web you can: › Create Web pages and Web sites › Set what you site will look like as you design it › Add text, images, multimedia.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Unicode & W3C Jataayu Software C. Kumar January 2007.
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
East Meets Rest Adding East Asian Scripts to Harvard’s ILS Prepared for presentation to the North American Aleph Users’ Group 2 June 2003 Charles Husbands,
Creating Interfaces: Localization Language & other issues character codes Homework: preparation for future topics.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
IBM Maximo Asset Management © 2007 IBM Corporation Tivoli Technical Exchange Calls Aug 31, Maximo - Multi-Language Capabilities Ritsuko Beuchert.
ASCII and Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Week 4 Number Systems.
Spring /6.831 User Interface Design and Implementation1 Lecture 22: Internationalization.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley.
Bing Hong OSIsoft Internationalization &
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
� Teaching Languages February 00. � Teaching Languages Agenda Teaching Languages - Rolly Sussex Uni Qld Case Study - Mike Fardon Uni WA Language support.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Anlab ( ) Kim, Yangjung Characters & Fonts.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
Chapter Three The UNIX Editors.
Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft.
Lis508 lecture 2: characters to textual documents Thomas Krichel
Word 2003 The Word Screen. Word 2003 Screen File Menu –Holds the options for creating a new document, opening a document, saving a document, printing.
Douglas M. Smith x31905 Entering Special Characters - Excel Insert ¢, £, ¥, ®, and other characters that are not on the keyboard On the Windows Start menu,
Fonts. Choosing Fonts How text looks on our web pages is a major component of the overall appearance of our site. We need to choose our text fonts carefully,
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
Essential Skills for Computing Fonts
Binary Representation in Text
Binary Representation in Text
Standardization supporting cultural diversity 1: Character repertoires, ordering and assignments to the 12-key telephone keypad for European languages.
HTML5 Basics.
Chapter 8 & 11: Representing Information Digitally
Characters & Fonts Digital Multimedia, 2nd edition
What is the characteristic of an alphabetic writing system?
Digital Asset Management Part 11: Access
Characters & Fonts Digital Multimedia, 2nd edition
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Unity Application Generator
ASCII and Unicode.
Presentation transcript:

Strategies for Developing Non-English Websites Elizabeth J. Pyatt Instructional Designer Education Technology Services

Supporting Multiple Languages v Unpopular Language Support (Easy): All English Alphabet, all the time. All English Alphabet, all the time. “Escribes vous Russki (Russian)? No” “Escribes vous Russki (Russian)? No” v Preferred Language Support (Harder): Display native scripts and punctuation Display native scripts and punctuation Display appropriate punctuation/symbols Display appropriate punctuation/symbols «¿Escribes vous Русский ? !Sí!» «¿Escribes vous Русский ? !Sí!»

Script versus Language v Arabic Script used for – Arabic, Ottoman Turkish, Persian (Farsi), etc. v Cyrillic Script used for – Russian, Ukrainian, Uzbek, Bulgarian, etc. v Serbo-Croatian (1 language) Cyrillic Text = “Serbian” Cyrillic Text = “Serbian” Roman (English alphabet) Text = “Croatian” Roman (English alphabet) Text = “Croatian” v Hindi-Urdu (also 1 language) (Hin = Devanagari / Urd = Arabic script)

Language of Scripts i18n = internationalization v Roman/Latin alphabet = English alphabet v Cyrillic = Russian v RTL =Right to Left (e.g. Arabic/Hebrew) v CJK = Chinese-Japanese-Korean Chinese has largest character count Chinese has largest character count v South Asian = Scripts of India (many)

Taxonomy of scripts C = Consonant; V = Vowel v Alphabet - 1 letter = 1 vowel or consonant Roman, Cyrillic, Greek, Runes, Georgian, Armenian, etc Roman, Cyrillic, Greek, Runes, Georgian, Armenian, etc Typing - map single letters to character Typing - map single letters to character v Syllabary - 1 character = 1 CV syllable Japanese, Cherokee, Ethiopic, Sumerian Japanese, Cherokee, Ethiopic, Sumerian Typing - map CV sequence into character Typing - map CV sequence into character (e.g. Jap Katagana na-wa = ナワ ) (e.g. Jap Katagana na-wa = ナワ )

Taxonomy of scripts C = Consonant; V = Vowel v Ideographic (Chinese) - 1 character / 1 meaning Symbols combined to make compounds Symbols combined to make compounds Typing - map CV sequence to list of possible characters Typing - map CV sequence to list of possible characters Ideographic scripts can have syllabary component Ideographic scripts can have syllabary component v Consonantal Syllabary - letters are consonants; vowels are diacritics on C’s Korean, Thai, languages of India, Cree, etc. Korean, Thai, languages of India, Cree, etc. Typing uses CV sequences. Fonts must alter characters depending on surrounding sounds Typing uses CV sequences. Fonts must alter characters depending on surrounding sounds E.g. Susi = suis E.g. Susi = suis

Scripts & Encoding v ASCII - assign a number to a character Excel Formula =CHAR(65) results in “A” Excel Formula =CHAR(65) results in “A” v Modern Encoding expands the repertoire beyond ASCII but with inconsistent implementations for different platforms/scripts v Know the encoding for your script/language. Needed for debugging.

Some Notable Encodings v Latin 1 (ISO ) English, Most W. Europe, Africa, Pacific Is., Nat. American English, Most W. Europe, Africa, Pacific Is., Nat. American v Latin 2 (ISO ) (Latin 3/Latin 4…) Central Europe (Hungarian, Polish, Czech) Central Europe (Hungarian, Polish, Czech) v Big5 (Chinese only), Shift-JIS (Japanese only), etc. v “ISO” vs. “Windows” Parallel Encodings (e.g. Hebrew) ISO (Visual Hebrew) ISO (Visual Hebrew) Windows-1255 (Windows Hebrew) (also MacHebrew) Windows-1255 (Windows Hebrew) (also MacHebrew) Parallel ISO/Windows for many scripts (Arabic, Cyrillic, etc) Parallel ISO/Windows for many scripts (Arabic, Cyrillic, etc) v Unicode (Super Encoding, all scripts) “Exotic Latin Alphabet” - Welsh, Hawaiian, Old Irish etc. “Exotic Latin Alphabet” - Welsh, Hawaiian, Old Irish etc. Also Chinese, Japanese, Cyrillic, Arabic, Hebrew, Greek… Also Chinese, Japanese, Cyrillic, Arabic, Hebrew, Greek…

Now What do I do? v Step 1 - Select target languages (don’t forget English) v Step 2 - Determine which encoding supports language. v Step 3 - Develop properly encoded page. Aim for Unicode (even English). v Step 4 - Declare encoding & language in HTML Meta tags

How do I get properly encoded text? v Latin 1 (English, Spanish, French, German) Use entity codes (e.g. ñ for ñ) Use entity codes (e.g. ñ for ñ) Declare encoding Declare encoding v Major World Language Set up keyboards Set up keyboards Type in text editor/HTML editor Type in text editor/HTML editor Declare encoding & language Declare encoding & language v Undersupported Language Get correct fonts/keyboards or “PDF it”. Get correct fonts/keyboards or “PDF it”.

Character Codes (Latin 1 Langs) v Applies to “Western European” languages only v Always use for backwards compatability Some examples: v Accent codes - e.g. ñ = ñ v Punctuation - e.g. © = © v Old Math - e.g. ° = ° v New Math (recent browsers only) Σ =  σ =  Σ =  σ =  ∫ = ∫ ≠ = ≠ ∫ = ∫ ≠ = ≠

Encoding & Language Tags v Set encoding in header Latin 1 Unicode Shift_JIS (Japanese) v Declare Page Language (ISO-639 code) English-U.S. Spanish/French/German/Japanese Document fr = French, de = German, zh = Chinese, jp = Japanese, etc. Spanish P (or any HTML text tag)

Challenge Set 1: v How do you insert the name José Espiño into HTML? v How do you declare the language Spanish? (multiple options) v What encoding is needed (assume English page with Spanish word)

Stray Unicode Characters  You can hard-code a four-digit Unicode numeric code to force a character to appear. E.g. (Cyrillic “D” Д = Д or Д (hex))  Best used for small spans of text or “exotic” Latin characters (e.g. a#/a( ) v If you use hex version, add the “x” prefix and add leading zero (to make 4 digits total) v Set encoding to “utf-8” with meta-tag

Challenge 2:  How do you insert the ¿Escribes vous Русский ? !Sí! into HTML? (Note: 1st letter capital in Cyrillic) v How do you declare the page to be Unicode?

Setting Up Keyboards for Other Scripts v Activate required keyboards from Control Panel or Systems Preferences (OS X) v You may need to install language utilities for East Asian and other unusual scripts from the System Disk Quick Demo

Typing with Encoded Fonts v Keyboarding utilities which match the “keys” to the right encoded number must be installed. v Keyboards can arrange one encoding in several layouts QWERTY (AKA “transliterated/phonetic”) QWERTY (AKA “transliterated/phonetic”) Preferred by U.S. students Preferred by U.S. students Native layout (native script typewriters) Native layout (native script typewriters) Preferred by native speakers (e.g. instructors) Preferred by native speakers (e.g. instructors)

Dreamweaver/Front Page: Options for Inputting Text A. Switch keyboard (editor may add meta tag) B. Type C. Or cut and paste encoded text D. Or Import from international text editors via Save As HTML Global Writer (Windows) Global Writer (Windows) Simple Text (free from Apple) Simple Text (free from Apple) Others for specific scripts Others for specific scripts Avoid import from Word Avoid import from Word Mini Demo 2

Challenge 3 (Research): v What encodings can I use for Russian? v How about Modern Greek vs. Ancient Greek?

Undersupported Scripts Ultimate Challenge v “Undersupported” = minority languages, ancient/medieval, small populations v Third Party utilities may be needed Unicode font (TrueType.ttf format) Unicode font (TrueType.ttf format) Keyboard Utility (if you can get it) Keyboard Utility (if you can get it) Print Font for PDF’s (the last resort) Print Font for PDF’s (the last resort) v Test, Test, Test (esp. Mac vs. Win)

Print Font vs.Web Font 1. Replaces ASCII characters with random characters 2. Both parties must have same font to read document correctly 3. Ideal for print/PDF documents when no data transmission occurs 4. E.g. Symbol, Webdings 1. Complies with some encoding (e.g. ASCII) 2. Alternative fonts with same encoding can be used (e.g. Times or Arial) 3. Ideal for Web transmission, still difficult for typing purposes 4. E.g. Arial Unicode, Lucida Sans Unicode, Lucida Grande, TITUS Cyberbit (free) etc.

When Websites show Gibberish v Problem: No Encoding Specified (see gibberish) Go to View menu and manually switch encoding Go to View menu and manually switch encoding v Problem: No HTML entity codes for accents (See gibberish for accented letters) Try switching View to Latin 1, Windows-1252, MacRoman, UTF-8 (Unicode) Try switching View to Latin 1, Windows-1252, MacRoman, UTF-8 (Unicode)

ANGEL & Other Web Tools 1. Activate keyboards for needed scripts  See 2. Open Netscape 7/Mozilla 3. Go to ANGEL or other Web tool 4. Switch keyboards 5. Type! 6. Users can view in Netscape 7/Mozillia, IE5+ (Win) or Safari (OSX)

¡Escribez Русский! Where to Find Out More v Penn State Computing with Accents v Titus Cyberbit Unicode Font (free) Look under “Instrumentalia” Look under “Instrumentalia”