Getting Started with ICU Vladimir Weinstein Eric Mader Steven R. Loomis The ICU library is a very powerful tool for solving globalization tasks. This paper provides reader with instructions for obtaining and setting up both ICU4J and ICU4C libraries. Several important frameworks of ICU are also introduced: conversion, collation, message format and break iteration. Usage examples are given for each framework. In the interest of text complexity and size, each framework is represented in one library – conversion in ICU4C (there is no ICU4J conversion engine), collation and message formatting in ICU4J and break iteration again in ICU4C. However, all four frameworks are used in an example – locale aware and Unicode enable word count program – UCount. This example is provided in both C++ and Java. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Agenda Getting & setting up ICU4C Using conversion engine Using break iterator engine Getting & setting up ICU4J Using collation engine Using message formats Example analysis 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Getting ICU4C http://ibm.com/software/globalization/icu Get the latest release Get the binary package Source download for modifying build options CVS for bleeding edge – read instructions There are several ways to get ICU. First of all, you want to visit http://ibm.com/software/globalization/icu. On the download page you can find all the ICU releases. The safest bet is to use the latest release. ICU versions are numbered with two digits, such as 2.8 or 3.0. Most of the releases are major (“reference”) releases. Round numbers do not mean that a release is more significant than the others (in other words, amount of changes for 2.8 is probably about the same as the amount of changes for 3.0). Some of the reference releases have maintenance releases (such as 2.6.2). If your platform is listed on the binary download list, it will probably be the easiest to pick a binary package. This option gives you a ready to use ICU library. You might also want to try the source package. Having a source package allows you to change build options, build only the parts of ICU that you really need, choose the data packaging options, etc. Readme.html file is a good resource to find out different modes of building. We also provide a CVS access to our library. All the releases are tagged with ‘release-x-y’ tag. So, if you want ICU 3.0, you can check out release-3-0 tag. CVS HEAD is not guaranteed to be stable. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Setting up ICU4C Unpack binaries If you need to build from source Windows: MSVC .Net 2003 Project, CygWin + MSVC 6, just CygWin Unix: runConfigureICU make install make check Once you have downloaded ICU, you need to set it up. Binary download needs only unpacking. Source download requires you to build the library. For Windows, we provide solution and project files for MSVC .Net 2003. In most cases, building the library is as simple as starting the build. Older versions of ICU provide workspace and project files for MSVC 6. If you use one of the UNIX platforms, you need to configure ICU. Source distribution provides the configure script, which will probe your system and create Makefiles. There is a front end to configuration script, which is invoked by the runConfigureICU command. Reading readme.html is almost certainly required. Once the configuration is over, you can build the library by invoking make. There are several useful additional commands for make: make install will install ICU in the specified place and make check will also build and run the test suite. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Testing ICU4C Windows - run: cintltst, intltest, iotest Unix - make check (again) See it for yourself: #include <stdio.h> #include "unicode/utypes.h" #include "unicode/ures.h" main() { UErrorCode status = U_ZERO_ERROR; UResourceBundle *res = ures_open(NULL, "", &status); if(U_SUCCESS(status)) { printf("everything is OK\n"); } else { printf("error %s opening resource\n", u_errorName(status)); } ures_close(res); Generally, it is always a good idea to run the test suite, in order to make sure that the library is properly built. Test suite consists of three programs: cintltst which runs the C APIs test, intltest which mostly tests C++ APIs and iotest which tests our input/output library. If all of these programs run fine, ICU is ready to be used. ICU4C consists of several libraries. The core library is common. It provides all the services and frameworks that are required for the higher level services. Common library provides the configuration settings, basic types, locale conversion, resource management, service registration, normalization, character properties, code page conversion and other core services. Your projects will at least have to use the common library. The second library is i18n. It provides higher level frameworks and services, such as collation, transformation, formatting, etc. Also worth noting is the io library which provides POSIX-like services. You will need to use it if you require globalized input/output services for your project. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Conversion Engine - Opening ICU4C uses open/use/close paradigm Open a converter: UErrorCode status = U_ZERO_ERROR; UConverter *cnv = ucnv_open(encoding, &status); if(U_FAILURE(status)) { /* process the error situation, die gracefully */ } One of the more popular uses for ICU is text conversion. One of the reasons for this is that ICU provides probably the most complete set of conversion tables. Also, a lot of work has been done on the proper identification of the various codepages and establishing an alias system. Therefore, if you need to convert text from one codepage to Unicode or to another codepage, chances are that ICU will be best for the task. In order to do conversion, a converter needs to be opened. ICU is based on the open/use/close paradigm. This means that in order to use a service, a service object needs to be opened and kept around as long as the services are required. One of the benefits of such an approach is that a service object can provide best performance in subsequent uses. Therefore, it is wise to plan your programs in such a way that you reuse service objects. In ICU4C most of the APIs use the UErrorCode variable to return the status of the operation. If any errors occur during the API execution, this variable will be set to the error condition. After API returns, it is usually wise to check the contents of the status variable using U_SUCCESS or U_FAILURE macros. Almost all APIs use UErrorCode for status Check the error code! 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
What Converters are Available ucnv_countAvailable() – get the number of available converters ucnv_getAvailable – get the name of a particular converter Lot of frameworks allow this examination Sometimes, it is useful to know which converters are supported by the installed ICU library. First, you need to find out how many converters are installed. This can be done by using the ucnv_countAvailable() API. Next, you can get the name of each converter in list, using ucnv_getAvailable API, which takes an index of a converter. There are several other ways to open a converter. For more details, take a look at the ICU Users Guide and API reference. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Converting Text Chunk by Chunk char buffer[DEFAULT_BUFFER_SIZE]; char *bufP = buffer; len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE, source, sourceLen, &status); if(U_FAILURE(status)) { if(status == U_BUFFER_OVERFLOW_ERROR) { status = U_ZERO_ERROR; bufP = (UChar *)malloc((len + 1) * sizeof(char)); } else { /* other error, die gracefully */ } /* do interesting stuff with the converted text */ There are various ways to convert text. The simplest scenario is to have a complete chunk of data that needs to be converted to or from Unicode. In that case, you only need to specify the buffer to hold the result and call the conversion API. In order to know the required size of the buffer, one can use several approaches. The first one is to estimate. If you are converting a single byte code page and Unicode, the number of units in receiving buffer should be at least as big as the number of units in the source data. However, you might not know enough about the encoding. In that case, you can use the API to find out how much space you really need. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Converting Text Character by Character UChar32 result; char *source = start; char *sourceLimit = start + len; while(source < sourceLimit) { result = ucnv_getNextUChar(cnv, &source, sourceLimit, &status); if(U_FAILURE(status)) { /* die gracefully */ } /* do interesting stuff with the converted text */ Works only from code page to Unicode Another conversion API allows you to convert one character from source encoding to Unicode. This API is useful for encapsulating converter function in a character iterator for example. There is no API to convert a single code point from Unicode to a codepage. Another interesting thing in this example is that converter usage modifies the pointer to the source text. So, you need to preserve the original pointer if you are going to need it later. During this conversion, converter internal state will be changed and the next call to this API will be affected by the internal state. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Converting Text Piece by Piece while((!feof(f)) && ((count=fread(inBuf, 1, BUFFER_SIZE , f)) > 0) ) { source = inBuf; sourceLimit = inBuf + count; do { target = uBuf; targetLimit = uBuf + uBufSize; ucnv_toUnicode(conv, &target, targetLimit, &source, sourceLimit, NULL, feof(f)?TRUE:FALSE, /* pass 'flush' when eof */ /* is true (when no more data will come) */ &status); if(status == U_BUFFER_OVERFLOW_ERROR) { // simply ran out of space – we'll reset the // target ptr the next time through the loop. status = U_ZERO_ERROR; } else { // Check other errors here and act appropriately } text.append(uBuf, target-uBuf); count += target-uBuf; } while (source < sourceLimit); // while simply out of space Another interesting situation is reading a file. In that case, you don’t know in advance how long the file is going to be. Also, allocating a huge buffer to hold the whole source file is usually not a good idea. ICU conversion engine provides a way to convert data that comes in pieces. The core of this loop is the ucnv_toUnicode API. It takes a piece of text and converts it to Unicode. However, it’s ‘flush’ argument allows us to specify that more text will arrive. So, if the encoding that we are dealing with depends on the previously converted characters, converter retains state, thus resulting in a correct conversion. From the example above, it is visible that the API modifies both the source and the target pointers. Also, ucnv_toUnicode can be mixed with ucnv_getNextUChar if required. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Clean up! Whatever is opened, needs to be closed Converters use ucnv_close Sample uses conversion to convert code page data from a file After using a converter, you need to clean up. Otherwise, you’ll produce a memory leak. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Text Boundary Analysis Process of locating linguistic boundaries while formatting and processing text Many uses Relatively straightforward for English Hard for some other languages: Chinese and Japanese Thai Hindi Text boundary analysis is the process of locating linguistic boundaries while formatting and processing text. Examples of this process include: Locating appropriate points to word-wrap text to fit within specific margins while displaying or printing. Locating the beginning of a word that the user has selected. Counting characters, words, sentences, or paragraphs. Determining how far to move the text cursor when the user hits an arrow key. Making a list of the unique words in a document. Capitalizing the first letter of each word. Locating a particular unit of the text (For example, finding the third word in the document). Many of these tasks are straightforward for English text, but are more complicated for text written in other languages. For example: Chinese and Japanese are written without spaces between words. This means that we can’t just look for spaces to find word boundaries. In general, we can break a line after any character, with a few exceptions called taboo or kinsoku characters. Some kinsouku characters cannot start a line, and some cannot end a line. Thai is also written without spaces between words. However, we must still only break lines on word boundaries. This means that we need some way to find the word boundaries, since we can’t rely on spaces, or other punctuation. This is usually done using a dictionary of Thai words. Hindi text is written using complex ligatures, called conjuncts. For text editing, conjuncts are usually treaded as a unit, even though they are represented by multiple characters. To implement cursor movement in Hindi text, we need to be able to identify the groups of characters that comprise a single conjunct. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Break Iteration - Introduction Character boundaries: grapheme clusters Word boundaries: word counting, double click selection Line break boundaries: where to break a line Sentence break boundaries: sentence counting, triple click selection ICU class - BreakIterator Character boundaries are also called “grapheme cluster boundaries.” Used for cursor positioning, selection, counting characters. Word boundaries used for double-click selection, moving cursor to next word, counting words Line break boundaries are places where it’s legal to break a line. Eg: “POSIX-like” is one word, but can break on the hyphen. Sentence boundaries used for triple-click selection, counting sentences, check if two words in same sentence. Boundary position is zero based index of character following boundary. Eg: 0 is before first character, position 1 is before second character. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Break Iteration – starting states Points to a boundary between two characters Index of character following the boundary Use current() to get the boundary Use first() to set iterator to start of text Use last() to set iterator to end of text First, let’s look some general information about BreakIterators. The iterator always points to a boundary position between two characters. The numerical value of this boundary is the zero-based index of the character following the boundary. So a boundary position of zero represents the boundary just before the first character in the text, and a boundary position of one represents the boundary position between the first and second character in the text, and so on. We can use the current() method to get the iterator’s current position. The first() and last() methods reset the iterator’s current position to be the beginning or end of the text, respectively, and return that boundary position. The beginning and end of the text are always valid boundaries. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Break Iteration - Navigation Use next() to move to next boundary Use previous() to move to previous boundary Returns DONE if can’t move boundary The next() and previous() methods move the iterator to the next or previous boundaries, respectively. If the iterator is already at the end of the text, next() will return DONE. If the iterator is at the start of the text, previous() will return DONE. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Break Itaration – Checking a position Use isBoundary() to see if position is boundary Use preceeding() to find boundary at or before Use following() to find boundary at or after If we want to know if a particular location in the text is a boundary, we can use the isBoundary() method. We can use the preceding() and following() methods to find the closest break location before or after a given location in the text. (Even if the given location is a boundary.) If the given location is not within the text, these methods will return DONE, and reset the iterator to the beginning or end of the text. If given location *is* a boundary, preceeding() and following() will return that location. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Break Iteration - Opening Use the factory methods: Locale locale = …; // locale to use for break iterators UErrorCode status = U_ZERO_ERROR; BreakIterator *characterIterator = BreakIterator::createCharacterInstance(locale, status); BreakIterator *wordIterator = BreakIterator::createWordInstance(locale, status); BreakIterator *lineIterator = BreakIterator::createLineInstance(locale, status); BreakIterator *sentenceIterator = BreakIterator::createSentenceInstance(locale, status); The first thing we need to do is to create an iterator. We do this using the factory methods on the BreakIterator class. Don’t forget to check the status! 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Set the text We need to tell the iterator what text to use: UnicodeString text; readFile(file, text); wordIterator->setText(text); Reuse iterators by calling setText() again. The iterators created by the factory methods don’t have any text associated with them, so the next thing we need to do is set the text that we want to iterate over. Assume that readFile() will read the whole file into the UnicodeString. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Break Iteration - Counting words in a file: int32_t countWords(BreakIterator *wordIterator, UnicodeString &text) { U_ERROR_CODE status = U_ZERO_ERROR; UnicodeString word; UnicodeSet letters(UnicodeString("[:letter:]"), status); int32_t wordCount = 0; int32_t start = wordIterator->first(); for(int32_t end = wordIterator->next(); end != BreakIterator::DONE; start = end, end = wordIterator->next()) text->extractBetween(start, end, word); if(letters.containsSome(word)) { wordCount += 1; } return wordCount; A word iterator will also return runs of punctuation as words. We deal with this using the UnicodeSet letters, to make sure the “word” contains at least one letter. We can also add numbers to the set if we want to count them as words too. We should really check status after creating the UnicodeSet, but it will only fail if the pattern is incorrect, or if ICU has been incorrectly installed. Note that this code would work for characters or sentences too! 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Break Iteration – Breaking lines int32_t previousBreak(BreakIterator *breakIterator, UnicodeString &text, int32_t location) { int32_t len = text.length(); while(location < len) { UChar c = text[location]; if(!u_isWhitespace(c) && !u_iscntrl(c)) { break; } location += 1; return breakIterator->previous(location + 1); Called w/ location of first character that doesn’t fit on line. First we skip white space, since it can hang in the margin. We call previous() with location + 1 in case location is already a line break boundary. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Break Iteration – Cleaning up Use delete to delete the iterators delete characterIterator; delete wordIterator; delete lineIterator; delete sentenceIterator; Again, C/C++ programs need to release allocated memory. Do delete your break iterators. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Useful Links Homepage: http://ibm.com/software/globalization/icu API documents and User guide: http://ibm.com/software/globalization/icu/documents.jsp 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Getting ICU4J Easiest – pick a .jar file off download section on http://ibm.com/software/globalization/icu Use the latest version if possible For sources, download the source .jar For bleeding edge, use the latest CVS – see site for instructions If you want to use ICU4J, the best solution is to download a .jar off ICU4J’s website. You can access different ICU4J versions by going to http://oss.software.ibm.com/icu4j/download/. You can drop this file in your class path or you can explicitly mention it when starting your applications. In most cases, you’ll want to use the latest available release. If, however, you would like to modify ICU4J, or to have access to the latest code, you need to use CVS. ICU4J is hosted in CVS, similarly to ICU4C. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Setting up ICU4J Check that you have the appropriate JDK version Try the test code (ICU4J 3.0 or later): import com.ibm.icu.util.ULocale; import com.ibm.icu.util.UResourceBundle; public class TestICU { public static void main(String[] args) { UResourceBundle resourceBundle = UResourceBundle.getBundleInstance(null, ULocale.getDefault()); } ICU4J 3.0 requires JDK 1.4 Add ICU’s jar to classpath on command line Run the test suite 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Building ICU4J Need ant in addition to JDK Use ant to build We also like Eclipse Integrated Development Environment Eclipse works very nice with CVS and is used by a lot of ICU4J developers. Eclipse will allow you to easily check out ICU4J and set up the environment. Detailed instructions can be found at http://icu.sourceforge.net/docs/eclipse_howto/eclipse_howto.html. If you do not wish to use Eclipse, you can compile and run ICU using JDK and Ant. Make sure that you check which JDK version is required for the ICU4J version that you need to use. While we are trying to maintain compatibility with the widest range of JDKs available, we do sometimes need to stop supporting older versions of JDK. The latest ICU4J version (3.0) requires JDK 1.4 or later. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Collation Engine More on collation tomorrow! Used for comparing strings Instantiation: ULocale locale = new ULocale("fr"); Collator coll = Collator.getInstance(locale); // do useful things with the collator Collators are used to compare strings. Globalized applications need to compare strings in linguistic sensitive way. Collation engine in ICU4J is a port of UCA compliant collation engine implemented in ICU4C. However, ICU4J’s collation tries to follow closely JDK’s collation API set, in order to allow for drop-in replacement. Data changes and bug fixes are ported from ICU4C every release. In order to use a collator, we need to instantiate it. Collator lives in the com.ibm.icu.text.Collator class. After the factory returns, collator is ready to use. Comparing strings in linguistic sensitive way is much more complicated than simple binary comparison. Depending on your needs, there are two main ways to use the engine – direct string comparison and sort key calculation. Lives in com.ibm.icu.text.Collator 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
String Comparison Works fast You get the result as soon as it is ready Use when you don’t need to compare same strings many times int compare(String source, String target); String comparison takes two strings and returns the relation of those strings according to the collator. The strings will be either equal or one string will be greater than the other. This function closely resembles the binary comparison function. ICU4J version looks like this: int compare(String source, String target); You want to use the compare function in cases where you will not be comparing the same strings many times. The advantage of this API is that you will get the result as soon as possible - if two strings are different on the first symbol, the comparison will take much less time than if they differ in case of the last symbol. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Sort Keys Used when multiple comparisons are required Indexes in data bases ICU4J has two classes Compare only sort keys generated by the same type of a collator In situations when you can anticipate that many comparison operations using the same strings are going to take place, you will be better off by using sort keys. A sort key is a binary representation of a string that can be used for binary comparison with other sort keys. The result of such comparison will be identical as if compare function was used. Sort key is basically a zero terminated array of unsigned bytes. Therefore, you can store them the same way as you would store any byte array. It is not uncommon to use sort keys as values in index fields. Sort keys can only be compared with the sort keys generated by a collator that has the same locale and the same settings as the original collator. Comparing sort keys from functionally different collators doesn’t make sense. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
CollationKey class JDK compatible Saves the original string Compare keys with compareTo method Get the bytes with toByteArray method We used CollationKey as a key for a TreeMap structure ICU4J provides two ways to use sort keys. One way is to use the encapsulation class CollationKey. This class holds the binary sort key. If you need to compare two CollationKeys, you can use the compareTo method. This class also preserves the original string. If you need the get the sort key contents, you can use the toByteArray method. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
RawCollationKey class Does not store the original string Get it by using getRawCollationKey method Mutable class, can be reused Simple and lightweight The other encapsulation class is RawCollationKey. You can get an instance of this class by using getRawCollationKey API. This class is mutable and reusable and it might be better suited for usage. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Message Format - Introduction Assembles a user message from parts Some parts fixed, some supplied at runtime Order different for different languages: English: My Aunt’s pen is on the table. French: The pen of my Aunt is on the table. Pattern string defines how to assemble parts: English: {0}''s {2} is {1}. French: {2} of {0} is {1}. Get pattern string from resource bundle Message formatting is the process of assembling a message from parts, some of which are fixed and some of which are variable and supplied at runtime. For example, suppose we have an application that displays the locations of things that belong to various people. It might display the message “My Aunt’s pen is on the table.”, or “My Uncle’s briefcase is in his office.” Example is displaying the location of objects owned by various people. {0}, {1} are called “format elements.” Number are “argument numbers” and represent variable pieces of the data. In this example, 0 is person, 1 is the place, and 2 is the thing. Notice the two single quotes in the English pattern. We’ll say more about this later. We need to get the pattern from a resource because translators will need to translate it! 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Message Format - Example String person = …; // e.g. “My Aunt” String place = …; // e.g. “on the table” String thing = …; // e.g. “pen” String pattern = resourceBundle.getString(“personPlaceThing”); MessageFormat msgFmt = new MessageFormat(pattern); Object arguments[] = {person, place, thing); String message = msgFmt.format(arguments); System.out.println(message); 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Message Format – Different data types We can also format other data types, like dates We do this by adding a format type: String pattern = “On {0, date} at {0, time} there was {1}.”; MessageFormat fmt = new MessageFormat(pattern); Object args[] = {new Date(System.currentTimeMillis()), // 0 “a power failure” // 1 }; System.out.println(fmt.format(args)); The pattern is in the code to simplify the example. Patterns should really come from resources so that translators can translate them. This will output: On Jul 17, 2004 at 2:15:08 PM there was a power failure. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Message Format – Format styles Add a format style: String pattern = “On {0, date, full} at {0, time, full} there was {1}.”; MessageFormat fmt = new MessageFormat(pattern); Object args[] = {new Date(System.currentTimeMillis()), // 0 “a power failure” // 1 }; System.out.println(fmt.format(args)); This will output: Remember, patterns should really come from resources so that translators can translate them. On Saturday, July 17, 2004 at 2:15:08 PM PDT there was a power failure. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Message Format – Format style details Format Type Format Style Sample Output number (none) 123,456.789 integer 123,457 currency $123,456.79 percent 12% date Jul 17, 2004 short 7/17/04 medium long July 17, 2004 full Saturday, July 17, 2004 time 2:15:08 PM 2:15 PM 2:14:08 PM 2:15:08 PM PDT 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Message Format – No format type If no format type, data formatted like this: Data Type Sample Output Number 123,456.789 Date 7/17/04 2:15 PM String on the table others output of toString() method 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Message Format – Counting files Pattern to display number of files: There are {1, number, integer} files in {0}. Code to use the pattern: String pattern = resourceBundle.getString(“fileCount”); MessageFormat fmt = new MessageFormat(fileCountPattern); String directoryName = … ; Int fileCount = … ; Object args[] = {directoryName, new Integer(fileCount)}; System.out.println(fmt.format(args)); Using what we’ve already learned, this is how we’d display the number of files in a directory. This will output messages like: There are 1,234 files in myDirectory. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Message Format – Problems counting files If there’s only one file, we get: There are 1 files in myDirectory. Could fix by testing for special case of one file But, some languages need other special cases: Dual forms Different form for no files Etc. This message is grammatically incorrect because it uses plural forms for a single file. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Message Format – Choice format Choice format handles all of this Use special format element: There {1, choice, 0#are no files| 1#is one file| 1<are {1, number, integer} files} in {0}. Using this pattern with the same code we get: There are no files in thisDirectory. There is one file in thatDirectory. There are 1,234 files in myDirectory. The line breaks in the pattern are only for readability. 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Message Format – Choice format patterns Selects a string based on number If string is a format element, process it Splits real line into two or more ranges Range specifiers separated by vertical bar (“|”) Lower limit, separator, string Separator indicates type of lower limit: The first range really starts at negative infinity, no matter what the pattern says. Limits can be a number or the Unicode infinity sign, ∞ (U+221E) Can have minus sign. Note that the first to separators are equivalent. ≤ is U+2264. Separator Lower Limit # inclusive ≤ < exclusive 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Message Format – Choice pattern details Here’s our pattern again: There {1, choice, 0#are no files| 1#is one file| 1<are {1, number, integer} files} in {0}. First range is [0..1) Really [-∞..1) Second range is [1..1] Third range is (1..∞] 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Message Format – Other details Format style can be a pattern string Format type number: use DecimalFormat pattern Format type date, time: use SimpleDateFormat pattern Quoting in patterns Enclose special characters in single quotes Use two consecutive single quotes to represent one The '{' character, the '#' character and the '' character. Remember the English pattern we used above? 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Useful Links Homepage: http://ibm.com/software/globalization/icu API documents and User guide: http://ibm.com/software/globalization/icu/documents.jsp 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 27th Internationalization and Unicode Conference Berlin, Germany, April 2005