Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft.

Similar presentations


Presentation on theme: "Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft."— Presentation transcript:

1 Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

2 25 March 2003 Prague, Czech Republic (IUC23) 2 Who is this talk geared towards? This is a high-level introduction to the concepts of collation, assuming no prior knowledge. Audience: –Developers new to concept –People who need to understand collation enough to “sell” this globalization feature to management –Not intended to be a “nuts and bolts” talk (see the presentation immediately following!)

3 25 March 2003 Prague, Czech Republic (IUC23) 3 Collation: Used Everyday! It may not be obvious, but you most likely use collation in some form everyday: Finding a mail slot for a colleague Searching for an author at the bookstore Library card catalog Looking up a phone number

4 25 March 2003 Prague, Czech Republic (IUC23) 4

5 25 March 2003 Prague, Czech Republic (IUC23) 5 Anytime you order or search for data in a logical fashion within a structure, you use collation!

6 25 March 2003 Prague, Czech Republic (IUC23) 6 Collation, the definition: The culturally expected ordering of linguistic characters in a particular language Often referred to as sorting, ordering, alphabetizing Informants recognize correct vs. incorrect collation for their language, but often have a hard time explaining the particular collation rules

7 25 March 2003 Prague, Czech Republic (IUC23) 7 Great definitions, but what do they mean, really? Every language (every culture) has an expected result when users search for data in “sorted” order If the ordering isn’t perfectly correct, users have a very hard time finding data This ordering can be influenced by a number of linguistic and orthographic elements within a language

8 25 March 2003 Prague, Czech Republic (IUC23) 8 Examples of linguistic elements that impact collation “Character” order Casing (upper case vs. lower case) Modifiers (diacritics, Indic matras, vowel marks) Radicals (CJK) Stroke counts (CJK) Syllable structure (SE Asian languages) Pronunciation

9 25 March 2003 Prague, Czech Republic (IUC23) 9 Collation in Action Latin scripts: English, French, Lithuanian, Swedish, Traditional Spanish Chinese variants (Taiwanese orders) Devanagari script: Hindi, Marathi Tamil script: Tamil

10 25 March 2003 Prague, Czech Republic (IUC23) 10 English:

11 25 March 2003 Prague, Czech Republic (IUC23) 11 French:

12 25 March 2003 Prague, Czech Republic (IUC23) 12 Lithuanian:

13 25 March 2003 Prague, Czech Republic (IUC23) 13 Swedish:

14 25 March 2003 Prague, Czech Republic (IUC23) 14 Spanish (Traditional):

15 25 March 2003 Prague, Czech Republic (IUC23) 15

16 25 March 2003 Prague, Czech Republic (IUC23) 16 Devanagari Hindi: consonants with modifier marks (candrabindu U+0901, anusvara U+0902 or visarga U+0903) sort differently than the consonant alone. A consonant and one of these modifier marks has a lighter primary sorting weight than the same consonant without a modifier mark.

17 25 March 2003 Prague, Czech Republic (IUC23) 17 Devanagari कँ (Devanagari Ka + candrabindu) कं (Devanagari Ka + anusvara) कः (Devanagari Ka + visarga) क (Devanagari Ka)

18 25 March 2003 Prague, Czech Republic (IUC23) 18 Devanagari Hindi vs. Marathi Two different languages within the Devanagari script, two different sorts of Lla (U+0933)

19 25 March 2003 Prague, Czech Republic (IUC23) 19 Devanagari Hindi: 0932 < 0933 < 0934; that is: ल < ळ < ऴ Marathi: 0939 < 0933 < 0915+094d+0937 conjunct; that is: ह < ळ < क्ष

20 25 March 2003 Prague, Czech Republic (IUC23) 20 Tamil Consonant + virama (halant) combination has primary weight lighter than the consonant alone

21 25 March 2003 Prague, Czech Republic (IUC23) 21 Tamil க் (Tamil Ka + virama) க (Tamil Ka) ங் (Tamil Nga + virama) ங (Tamil Nga) ச் (Tamil Ca + virama) ச (Tamil Ca) ஞ் (Tamil Nya + virama) ஞ (Tamil Nya)

22 25 March 2003 Prague, Czech Republic (IUC23) 22 Myths about collation “Well, if I localize my product, these kind of details don’t matter”

23 25 March 2003 Prague, Czech Republic (IUC23) 23 Myths about collation “If I already use Unicode in my product, sorting is covered by this universal encoding”

24 25 March 2003 Prague, Czech Republic (IUC23) 24 Myths about collation “One collation is good enough for Europe*, right?” * Replace with the market of your choice: Asia, North America, India, etc.

25 25 March 2003 Prague, Czech Republic (IUC23) 25 Myths about collation “One collation is good enough for the Latin* script, right?” * Replace with the script of your choice: Cyrillic, Han, Devanagari, etc.

26 25 March 2003 Prague, Czech Republic (IUC23) 26 Why should I care about all this? Ideally, a well-globalized product uses culturally correct collation where the users expect it, for example:  Address book  Document filing system  Database  … Your users will expect collation in a surprising number of places!

27 25 March 2003 Prague, Czech Republic (IUC23) 27 Collation Example

28 25 March 2003 Prague, Czech Republic (IUC23) 28 Yet another collation example

29 25 March 2003 Prague, Czech Republic (IUC23) 29 How do I make sure my users get the results they expect? Collation usually needs to address user’s expected ordering, not the linguistic ordering of the data source (these two can differ!) Swedish user, German data Multiple users, multilingual data The Switzerland example

30 25 March 2003 Prague, Czech Republic (IUC23) 30 How do I make sure my users get the results they expect? Make sure you’re using a collation-aware mechanism to order data Windows APIs such as CompareString, LCMapString SQL Server 2000 collations The.Net Framework's CompareInfo class Except when you want non-linguistic collation…

31 25 March 2003 Prague, Czech Republic (IUC23) 31 When not to use linguistic collation When consistency across different cultures is required –“Case insensitive” file systems –File extension names (.INF,.GIF, etc.)

32 25 March 2003 Prague, Czech Republic (IUC23) 32 When not to use linguistic collation When users expect data in a specific collation other than their own –Excel column names –“ASCII” order

33 25 March 2003 Prague, Czech Republic (IUC23) 33 In summary… Linguistically-aware collation is an important feature of any well-globalized product Collation needs to be considered at the language level –Encoding, region, script level not enough! There are many collation-aware mechanisms out there (within OS for example); take advantage of them!

34 25 March 2003 Prague, Czech Republic (IUC23) 34 Other applicable IUC talks Stay tuned for the second half of this tutorial! Cathy's "Issues in Indic Collation" talk on Thursday afternoon

35 25 March 2003 Prague, Czech Republic (IUC23) 35 Other References This tutorial's corresponding paper Unicode Technical Note (UTN) #1 http://unicode.org/notes/tn1/ Nadine Kano, Developing International Software (out of print, but still available on the web) http://microsoft.com/globaldev/dis_v1/disv1.asp New! Developing International Software, 2nd edition (available now or very soon): http://microsoft.com/globaldev/dis_v2/disv2.asp Michael Kaplan, Internationalization with VB http://i18nWithVB.com/

36 25 March 2003 Prague, Czech Republic (IUC23) 36 Questions?

37 25 March 2003 Prague, Czech Republic (IUC23) 37 Don't forget to fill out your evals! Sorting it all out: An introduction to collation


Download ppt "Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft."

Similar presentations


Ads by Google