Making Sense of Language Tags 10 th Metadata Open Forum.

Slides:



Advertisements
Similar presentations
Language Tags W3C Project Review. Presenter and Agenda Addison Phillips Internationalization Architect, Yahoo! Co-Editor, Language Tag Registry Update.
Advertisements

How Standards Happen* *and why sometimes they dont Addison Phillips Internationalization Architect Yahoo! Inc.
Language Tags and Locale Identifiers A Status Report.
Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.
Globalization Gotchas
Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
Copyright © 2003 Pearson Education, Inc. Slide 3-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.
Copyright 2001, ActiveState. XSLT and Scripting Languages or…XSLT: what is everyone so hot and bothered about?
doi> Digital Object Identifier: overview
Internationalizing WHOIS Preliminary Approaches for Discussion Internationalized Registration Data Working Group ICANN Meeting, Brussels, Belgium Jeremy.
ICANN Rio Meeting IDN Authorization for TLDs with ICANN agreements 26 March, 2003 Andrew McLaughlin.
T. Baker / 27 March 2000 A Registry for Dublin Core Thomas Baker, GMD IuK 2000: "Information, Knowledge and Knowledge Management Darmstadt, 27 March 2000.
THE DONOR PROJECT Titia van der Werf-Davelaar. Project Financed by: Innovation of Scientific Information Provision (IWI) Duration: –phase 1: 1 may 1998.
DC2001, Tokyo DCMI Registry : Background and demonstration DC2001 Tokyo October 2001 Rachel Heery, UKOLN, University of Bath Harry Wagner, OCLC
OLAC Metadata Steven Bird University of Melbourne / University of Pennsylvania OLAC Workshop 10 December 2002.
Accessing Distributed Resources Information: An OLAC perspective Steven Bird Gary Simons Chu-Ren Huang Melbourne SIL Academia Sinica ENABLER/ELSNET Workshop.
LIS650lecture 1 XHTML 1.0 strict Thomas Krichel
IETF 71 Philadelphia - ENUM IANA Registration of Enumservices: Guide, Template and IANA Considerations draft-ietf-enum-enumservices-guide-08 B. Hoeneisen.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
UKOLN, University of Bath
Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.
Unicode and Windows XP Cathy Wissink Program Manager Globalization Infrastructure, Design and Development Windows International Microsoft.
Ideas to Layout Beginning web layout using Cascading Style Sheets (CSS). Basic ideas, practices, tools and resources for designing a tableless web site.
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
SE 370: Programming Web Services Week 4: SOAP & NetBeans Copyright © Steven W. Johnson February 1, 2013.
Solutions for Multilingual Literature by XSL Formatter 6,800 known languages.
Information Retrieval in Practice
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Chapter 2 Introduction to HTML5 Internet & World Wide Web How to Program, 5/e Copyright © Pearson, Inc All Rights Reserved.
24rd Internationalization and Unicode Conference, Atlanta, GA USA – Sept 2003 Common XML Locale Repository Dr. Mark Davis Steven.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
OCLC Online Computer Library Center A Global OpenURL Resolver Registry Phil Norman OCLC Dlsr4lib Workshop March 23 rd, 2006 Arlington VA.
1 © 2000, Cisco Systems, Inc. DNSSEC IDN Patrik Fältström
ECA 228 Internet/Intranet Design I Meta Tags & Directories.
IDN over EPP (IDNPROV) IETF BOF, Washington DC November 2004.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Unicode & W3C Jataayu Software C. Kumar January 2007.
CcTLD IDN TF Report ccTLD Meeting, Rio de Janero Mar. 25, 2003 Young-Eum Chair, ccTLD IDN TF.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Chapter 6 Text and Multimedia Languages and Properties
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
Introducción WEB Diseño y programacion en HTML.
Language / Locale IDs M. Davis, IBM A. Phillips, webMethods.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
Bing Hong OSIsoft Internationalization &
Versioning, Extensibility & Postel’s Law Noah Mendelsohn Tufts University Web:
Text and Graphics September 26, Unit 3.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Addressing Issues David Conrad Internet Software Consortium.
ccTLD IDN Report ccTLD Meeting, Montreol June 24, 2003 Young-Eum
Oracle9i Database Administrator: Implementation and Administration 1 Chapter 14 Globalization Support in the Database.
4395bis irireg Tony Hansen, Larry Masinter, Ted Hardie IETF 82, Nov 16, 2011.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
Week 7 Lecture 2 Globalization Support in the Database.
Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
Design Guidelines Thursday July 26, 2007 Bernard Aboba IETF 69 Chicago, IL.
What problems are we trying to solve? Hannes Tschofenig.
Introduction. Internet Worldwide collection of computers and computer networks that link people to businesses, governmental agencies, educational institutions,
A S P. Outline  The introduction of ASP  Why we choose ASP  How ASP works  Basic syntax rule of ASP  ASP’S object model  Limitations of ASP  Summary.
HTML5 Basics.
ENCODING AND SENDING FORMATTED TEXT
Knowledge Management Systems
The Re3gistry software and the INSPIRE Registry
Understand basic HTML and CSS terminology, concepts, and basic operations. Objective 3.01.
Presentation transcript:

Making Sense of Language Tags 10 th Metadata Open Forum

Presenter Addison Phillips Globalization Architect, Yahoo! Chair, W3C Internationalization Core Working Group Co-Editor, Language Tag Registry Update (LTRU) Working Group (RFC 4646, RFC 4647, RFC 4646bis)

Languages, Language Tags, and Locales (oh my!) Identifying language (and locale): the challenge ISO 639 IETF BCP 47 – RFC 4646, RFC 4647 – RFC 4646bis Challenges for users

Human Language as Metadata Some data is just data, but some data is human- readable text. Text processing depends on language: – spelling, stemming, tokenization, word/line/sentence boundaries, thesauri, terminology, morphological analysis, font and stylistic traditions, collation. IT systems depend on language negotiation: – localization, message selection, user interface, presentation, number/date/time/etc. formatting, list presentation

Human Language "Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!" (Mark Twain, Puddinhead Wilson)

Identifying Languages Languages dont form nice hierarchies – splitters vs lumpers – dialects, subdialects, regional and stylistic differences, patois Differing communities with different needs – terminology, librarians, computer systems, translators, etc.

In the Beginning (ca CE) Received Wisdom from the Dark Ages Locales: – japanese, french, german, C – ENU, FRA, JPN – ja_JP.PCK – AMERICAN_AMERICA.WE8ISO8859P1 Languages… … looked a lot like locales (and vice versa)

ISO 639 Defines language identifier codes Multiple parts: – ISO (alpha2 codes 676 ) (136 codes) – ISO (alpha3 codes ) (about 500) – ISO (alpha3 codes) (about 7000) – ISO (principles for encoding) – ISO (language families) – ISO (alpha4 codes) (under development)

Impact of ISO ISO and share a codespace – all codes are also codes – Macrolanguages

Human Language "Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!" (Mark Twain, Puddinhead Wilson) en

Parallel Efforts ISO 639 – ISO (early 1980s) – ISO (alpha3) – ISO IETF BCP 47 – RFC 1766 (1995) – RFC 3066 (2001) – RFC 4646 (2006) – RFC 4646bis (2007)

BCP 47 Internet Engineering Task Force (IETF) Best Current Practice (BCP) Enable presentation, selection, and negotiation of content in protocols and formats – Widely used! XML, HTML, RSS, MIME, SOAP, SMTP, LDAP, CSS, XSL, CCXML, Java, C#, ASP, perl, Apache, IE, Mozilla……….

Adds Granularity Need to identify language on varying levels of mutual intelligibility and granularity "Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!" (Mark Twain, Puddinhead Wilson) en en-US

Whats a Locale – a concept or identifier used by programmers to represent a particular collection of cultural, regional, or linguistic preferences. java.util.Locale.Net Culture LANG (setlocale in C, C++) NLS_LANG in Oracle … and so on…

Locales? Huh? Theatre Center News: The date of the last version of this document was A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt.

Locale Identifiers Different ideas: – Accept-Locale vs. Accept-Language – URIs/URNs, etc. – CLDR/LDML And Requirements: – Operating environments and harmonization – App Servers – Web Services New Solution? Cost of Adoption: – UTF-8 to the browser: 8 long years

Locales and Language Tags meet We really need locale identifiers. Language tags are being (ab)used as locale identifiers anyway… Not going to need a big new thing… … we can do this really fast… Yeah, well write an RFC IUC23, March 2003

BCP 47 (Historic) Basic Structure Alphanumeric (ASCII only) subtags Up to eight characters long Separated by hyphens Case not important (i.e. zh = ZH = zH = Zh) 1*8alphanum * [ - 1*8 alphanum ]

RFC 1766 zh-TW ISO (alpha2) ISO 3166 (alpha2) i-klingon Registered value

RFC 3066 sco-GB ISO (alpha 3 codes) But use… eng-GB alpha 2 codes when they exist X

Problems Script Variation: – zh-Hant/zh-Hans – (sr-Cyrl/sr-Latn, az-Arab/az-Latn/az-Cyrl, etc.) Obsolence of registrations: – art-lojban (now jbo), i-klingon (now tlh) Instability in underlying standards: – sr-CS (CS used to be Czechoslovakia Lack of a single authoritative, stable source

And More Problems Lack of scripts Little support for registered values in software Reassignment of values by ISO 3166 Lack of consistent tag formation (Chinese dialects?) Standards not readily available, bad references Bad implementation assumptions – 1*8 alphanum *[ - 1*8 alphanum] – 2*3 ALPHA [ - 2ALPHA ] Many registrations to cover small variations – 8 German registrations to cover two variations

LTRU and RFC 4646 Defines a generative syntax – machine readable – future proof, extensible Defines a single source (IANA Language Subtag Registry) – Stable subtags, no conflicts – Machine readable Defines when to use subtags – (sometimes)

Anatomy of a Language Tag sl-Latn-IT-rozaj-1994-x-mine ISO 639-1/2 (alpha2/3)ISO script codes (alpha 4)ISO 3166 (alpha2) or UN M49 Registered variantsPrivate Use and Extension

More Examples fr, de, nl, en, ja fr-FR, fr-CA, de-DE, de-CH … es-419 (Spanish for Americas) en-US (English for USA) de-CH-1996 (Old tags are all valid) sl-rozaj-1994 (Multiple variants) zh-t-wadegile (Extensions)

Solves the Script problem zh-Hant (!= zh-TW) zh-Hans (!= zh-CN) Azerbaijani (az) – Arab, Cyrl, Latn Serbian (sr) – Cyrl, Latn Yiddish (yi) – Hebr, Latn Mongolian (mn) – Cyrl, Latn, Hani Belarussian (bs) – Cyrl, Latn Etc.

Benefits Subtag registry in one place: one source, machine-readable Subtags identified by length/content Extensible Compatible with RFC 3066 tags Stable: subtags are forever

Tag Choice Tag Content Wisely – use the shortest tag reasonable – use as many subtags as necessary to disambiguate – dont invent things; use the registry – map deprecated values to modern equivalents

Specialized Codes zxx und mis Zxxx

Problems Matching – Does en-US match en-Latn-US ? Tag Choices – Users have more to choose from. Implementations – More to do, more to think about – (easier to parse, process, support the good stuff)

Tag Matching (RFC 4647) Uses Language Ranges in a Language Priority List to select sets of content according to the language tag Three Schemes – Basic Filtering – Extended Filtering – Lookup

Tags are not Tokens! Many technologies would like language tags (attributes, etc.) to be atomicbut language tags have structure foo(lang:en) { color: red; } Accept-Language= zh;q=1.0;de-DE;q=0.8

Filtering Ranges specify the least specific item – en matches en, en-US, en-Brai, en-boont Basic matching uses plain prefixes – en-US matches en-US or en-US-boont but not en-Latn-US Extended matching can match inside bits – en-*-US

Lookup Range specifies the most specific tag in a match. Returns exactly one item. – en-US might return either en or en-US but not en-US-boont Mirrors the locale fallback mechanism and many language negotiation schemes.

Lookup and Language Negotiation Resources fall back to find the best match Global Binary Resources zh-Hans-SG (Chinese, Simplified script, Singapore) zh-Hans (Chinese, Simplified script) zh (Chinese) (root) Falling back

What Do I Do (Content Author)? Not much. – Existing tags are all still valid: tagging is mostly unchanged. – Resist temptation to (ab)use the private use subtags. Unless your language has script variations: – Tag content with the appropriate script subtag(s) Script subtags only apply to a small number of languages: zh, sr, uz, az, mn, and a very small number of others.

What Do I Do (Programmer)? Check code for compliance with 4646 – Decide on well-formed or validating – Implement suppress-script – Change to using the registry – Bother infrastructure folks (Java, MS, Mozilla, etc) to implement the standard

I need a new subtag… Register new subtags with ietf- – only primary language or variant subtags – read RFC 4646 for instructions – two-week review period with expert approval

LTRU Milestone Dates RFC 4646 – Registry went live in December 2005 RFC 4647 (Anticipated) RFC 4646bis – This includes ISO support, extended language subtags, and possibly ISO 639-6

RFC 4646bis (Internet-Draft) Currently taking shape – Adds about 7000 additional primary language subtags from ISO – Extended language subtags for Chinese and other languages being debated – … and some cleanup work on processes and procedures

Macrolanguages and Extlang zh-Hant-HK Chinese, Traditional Script, Hong Kong SAR yue-Hant-HK Cantonese, Traditional Script, Hong Kong SAR zh-yue-Hant-HK Chinese, Cantonese, Traditional Script, Hong Kong SAR extlang

Things to Do (languages) Get involved in LTRU Get involved in W3C I18N Activity Write implementations Work on adoption of BCP 47: understand the impact Then get involved with Locale identifiers …

Back to Locales… IUC 20 Round Table Suzanne Toppings Multilingual Article Tex Texin and the Locales list…

Locale Identifiers and Web Services

W3C and Unicode W3C – Identifiers and cross-over with language tags – Web services – XML, HTML Unicode Consortium – LDML – CLDR – Standards for content

Language Tags and Locale Identifiers REC (LTLI) Working Draft developed by W3C I18N Architecture WG – effort currently moribund: needs community participation – defines standards and guidelines for using language tags in W3C technologies – defines relationship of language tags to locale identifiers basis for efforts such as WS-I18N

Things to Read Tag and Registry RFC Matching RFC bis Draft References LTRU Mailing List

Ideas and Questions