OPS-25: Unicode and the DataServer David Moloney Software Architect
Agenda Unicode: How did we get here ? Unicode deployment with OpenEdge® DataServers Unicode: How did we get here ? What are its broader OpenEdge implications ? What are its DataServer implications ? Specific Implementation in the DataServers for: Oracle® MS SQL Server Multi-lingual OE app now? Supported w/one or many DB’s ? Unicode or traditional CP’s If so, you know: diverse market localizations is a complicated undertaking. Unicode = best first step to multilingual coverage, modern foundation for internationalization This session: ½ General: Unicode & OE ½ Specific to DS’s Expectation: “basic” knowledge of internationalization, OE client dev., role of DS’s in OE deployment strategy Shy on any, still get plenty but w/out some parts go too quick Too much to cover! OPS-25: Unicode and the DataServer
Code Pages € � ‚ ƒ „ \t(Tab) \n(NL) \r(CR) ! “ # A B C D E F G a b c d ASCII: 7-bit 127 Character Set Extended ASCII 128 € 129 � 130 ‚ 131 ƒ 132 „ 133 Special Chars Upper Case Lower Case 9 \t(Tab) 10 \n(NL) 13 \r(CR) 32 Space 33 ! 34 “ 35 # 37 65 A 66 B 67 C 68 D 69 E 70 F 71 G 72 97 a 98 b 99 c 100 d 101 Extended 255 Character Sets: ISO8859-1 1250 IBM437/850 … … … … … … … … … … CP’s map chars to #s (code points) 1963: 7-bit: ASCII; 8-bit: EBCDIC ASCII 8th supposed to be error check; Instead=Ext Charset (Non-English). Low Order 7-bit stayed common Several 8-bit extended charset standards: ISO, Windows, IBM 125 126 127 … ü 253 ý 254 255 … … … … … … … … … … … … … … … OPS-25: Unicode and the DataServer
8-bit Code Pages a á Examples of character encoding: È Č “ 61 E1 A0 C8 ISO8859-1 ISO8859-2 1252 1250 IBM437 IBM850 IBM852 a 61 á E1 A0 È C8 n/a D4 Č AC “ 93 “a” - part of 127-bit ASCII - same code points all CPs Accented “a” - different CP’s in different code pages Not all Cps have all char Some chars only exist in some CP’s Competing interests = competing standards 8-bit CP can’t contain all characters for all languages (only Unicode can). Incompatible CP’s and misguided translations lead to corruption OPS-25: Unicode and the DataServer
“è” “č” Avoid This E8 E8 Data Corruption ISO8859-1 1250 France You cannot just transfer data between computers, you need to convert it between code pages. Codepoint need context What if you could deploy your application around the world without any risk of data corruption due to code pages ? Starting with OpenEdge 10, you can by using Unicode !! France Czech Republic OPS-25: Unicode and the DataServer
What is Unicode ? (“Unique Code”) A character encoding standard that: Replaces all legacy SBCS & MBCS systems Can assign more than a million numbers Highest code point: “U+10FFFF”=2^20+2^16=1,114,112 Gives one “unique” number/text-symbol-character Provides one internationalization process Is Not platform, program, country or language specific Is essential to the Web (HTML, XML, etc.) Unicode continues to expand. Unicode Ver.5.1, 4/2008: 240,295 assigned points; 873,817 unassigned Good news for “just speak English” crowd: Unicode has an upper limit but language extinction pace > business demand for new glyph CPs Commerce trumps culture OPS-25: Unicode and the DataServer
How is Unicode encoded ? ÿ … … … … … … … … … … The Encoding Tradeoff “UTF-x” UTF = Unicode Transformation Format x = Minimum length of coding unit U+0000 U+0001 U+0002 U+0003 U+00FF ÿ Extended ASCII (ISO8859-1) … … UTF-16 … … BMP … … … … UTF-32 UTF-32 UTF-8 U+FFFF Ease of Use Storage Space U+100000 … … U+10FFFD U+10FFFE U+10FFFF … … UTF <> Code Page (i.e. CP assigns code pt directly to chars) NOTE: If you hear me say “Unicode CP”, I mean “Unicode Encoding”. “U+”=Virtual Code Pt of Unicode “Universal” charset Unicode = “virtual” CP range 0 to 0x10FFFF “U+10FFFF”= highest “Virtual” code point UTF = An Encoding format for the Unicode Character Set Actual encoded value for a code pt based on UTF form. “x” = Minimum length of the coding unit UTF-32: 32-bit code units, one unit/code point, each 32-bits. Fixed size helps sorting UTF-16: 16-bit code units, 1 or 2 units/code point, each 16 bits. Fixed size helps sorting UTF-8: 1-4 8-bit code units (bytes) to 1 code point, each 8 bits There are other UTF encoding formats UCS-2 = UTF-16 in BMP range Encoding tradeoff: 11 bits are ALWAYS unused: Only 21 bits of a 32-bit word ever stores a Unicode code pt. value UTF16 tries to max use of the BMP, first 65536 smaller code units (Western, Euro, Arabic, Hebrew, most Asian) But 8-bit “Variable width” UTF8 - can affect performance and ease of use, esp. for sorting Supplementary Range The Encoding Tradeoff Char ANSI Number Unicode ANS Hex Range ÿ 255 0xFF U+00FF Basic Latin = 1,114,112 OPS-25: Unicode and the DataServer
UTF Encoding Examples BMP Unicode UTF-8 UTF-16 UTF-32 U+004D 4D 00 4D U+00A1 C2 A1 00 A1 00 00 00 A1 U+00E1 C3 A1 00 E1 00 00 00 E1 U+0470 D0 C0 04 70 00 00 04 70 U+4E9C E4 BA 9C 4E 9C 00 00 4E 9C U+10302 F0 90 9C 82 D8 00 DF 02 00 01 03 02 BMP Notice “M” = Exactly the same in UTF-8 and ISO8859-1 UTF-8 1 byte = 7-bit ASCII: 0-127 2 byte = European (except ASCII), Arabic, Hebrew, Latin-1 Supplement iso8859-compatible 3 byte = Indic, Thai, Chinese, Japanese, Korean, Euro 4 byte = Add’l Chinese, Japanese, Korean and historic characters + musical and math symbols into 4 byte supplementary characters. UTF-16 matches Unicode code points in the BMP Beyond BMP: UTF-8 and UTF-16 are 4-byte. There are only 2048 UTF-16 “surrogate” code points containing: “Supplementary” chars for Chinese, Japanese, Korean + historic, music. math symbols OPS-25: Unicode and the DataServer
UTF Encoding Examples BMP Unicode UTF-8 UTF-16 UTF-32 U+004D 4D 00 4D U+00A1 C2 A1 00 A1 00 00 00 A1 U+00E1 C3 A1 00 E1 00 00 00 E1 U+0470 D0 C0 04 70 00 00 04 70 U+4E9C E4 BA 9C 4E 9C 00 00 4E 9C U+10302 F0 90 9C 82 D8 00 DF 02 00 01 03 02 BMP Careful: There are 2 UTF-8 encodings: “standard” and “modified” Chart shows standard where there is no such thing as a surrogate pair supplementary character. OE uses “standard” (4 byte) UTF-8 “modified” (3 byte) UTF-8 uses surrogate pair supplementary chars outside the BMP But surrogate pairs = 2 2-byte chars in UTF16 Become 2- 3-byte chars in “modified” UTF8 Oracle supports “standard” AL32UTF8 and “modified” UTF8 MSS supports UCS-2 in BMP but can store UTF-16 chars. (Oracle) NLS_LANG UTF8 3-byte “Modified”: C0 D8 00 80 DF 02 AL32UTF8 4-byte “Standard”: F0 90 9C 82 OPS-25: Unicode and the DataServer
? ü Unicode Conversion All code pages convert to Unicode Unicode may not convert to other code pages IBM437 IBM437 ? IBM852 ü IBM852 IBM850 IBM850 Unicode 1250 1250 1252 1252 ISO8859-2 ISO8859-2 ISO8859-1 ISO8859-1 OPS-25: Unicode and the DataServer
Agenda Unicode: How did we get there ? The path to successful development & deployment Unicode: How did we get there ? What are its broader OpenEdge implications ? What are its DataServer implications ? Specific Implementation in the DataServers for: Oracle MS SQL Server OpenEdge 10 introduces UTF-8 Unicode (only supports UTF16 internally) Natural fit: UTF-8 & ISO-8859-1 (OE default) are fully compatible Whose OE app. uses the default OE CP: ISO8859-1 ? You’re in luck: Change your database & app CP’s to UTF-8 and you have a Unicode application ! But does this mean you’re ready to go with Unicode ? OPS-25: Unicode and the DataServer
The Unicode “Solution” ? Yes ! One stop shopping for Internationalization! NO, there are considerations to be addressed: Operating System Web Server (XML Schemas and HTML) Print drivers Data from/to other systems OCX’s Terminal Emulators YES A “Global” app = an internationalized app = “localization-capable” app (language, collation, settings) Unicode is the first step to achieving “global” capability No Internationalization is not just the Unicode character set. Must deal with data widening, fonts, encoding standards Environment Setup and OE application settings; Just like before Configuration dangers are lurking outside your app: Just like before Must decide what’s right for your application. OPS-25: Unicode and the DataServer
OpenEdge Globalization Settings For more info: See “Internationalizing Applications” Guide Primary Parameters Secondary Database Settings -cpinternal -cplog _db._db-xl-name -cpstream -cpterm _db._db-coll-name -cpcoll -cpprint -d -numsep -E -numdec -cprcodein -cprcodeout -lng Parameters for “globalization” are essentially the same as the “language settings” prior to Unicode Convmap stores most UTF-8 conversion pages by default One addition: Affects how you set client “-cpcoll” + “db-coll-name” metaschema field ICU (International Components for Unicode) Library has 54 ICU colations for UTF-8 The icuil8n.dll ICU library is linked into OpenEdge to support this Existing OpenEdge Constructs: Convmap.cp – Character Processing Tables Progress.ini Fonts New OpenEdge Construct: ICU Library – For Linguistic Sorting OPS-25: Unicode and the DataServer
ÄŚzech Äzech Čzech Common Mistakes Loading or importing data with the wrong code page ÄŚzech 1250 C4 8C 7A 65 63 68 Äzech ISO8859-1 Files are just a sequence of bytes A byte by itself does not tell the encoding If we don’t know the encoding, it cannot get decoded UTF-8 Čzech OPS-25: Unicode and the DataServer
Caution ! Čzech Čzech Čzech Byte Order Mark (BOM) 1250 EF BB DF C4 68 ISO8859-1 Čzech UTF-8 Čzech Write BOM (Byte Order Mark): Important if min. UTF code unit is > 1 byte (UTF-16 & UTF-32) “canonical” multi-byte forms - requires knowledge of “Endian” format Endian = machine byte ordering: “Big” – 1st byte = most significant “Little” = 1st byte = least significant For UTF8 (and all UTF encodings): BOM used to detect presence of Unicode encoding Esp. in Windows apps. Progress understands BOMs when reading. BOM overrules –cpstream <spacebar> CAUTION: This may hide problems. User might think it was his cpstream that read the file when actually it was the BOM. Always check if the file has a BOM, if so, use any cpstream knowing of the override but recommend using correct cpstream Also, you may need to output BOM for Notepad and other apps (see sample) NOTE: BOM is encoded differently based on UTF encoding OUTPUT TO text.txt CONVERT TARGET "UTF-8". PUT CONTROL "~357~273~277". /* BOM */ PUT UNFORMATTED "UTF-8 text". OUTPUT CLOSE OPS-25: Unicode and the DataServer
Common Mistakes Loading or importing data with the wrong code page (…) "imuller" "Ian Muller" "Y" "C" 1657 283200 "jdoe" "Jane Doe" "N" "U" 3275 450010 "jsmith" "John Smith" "Y" "C" 1450 323700 "jsanchez" "Juan Sánchez" "Y" "C" 4250 323900 . PSC filename=users records=0000000001133 ldbname=mydatabase timestamp=2007/03/28-20:55:03 numformat=44,46 dateformat=mdy-1950 map=NO-MAP cpstream=ISO8859-1 0000143373 <spacebar>. d’s written by OE store CP of the data w/the data <spacebar> Data Administration asks for CP of data we load not the code page we want the data converted to ---- In General ----- Using OUTPUT TO stmt - know encoding format of exported data Using INPUT FROM stmt - know encoding format of imported data Use the “CONVERT” keyword to specify output target/input source To override the -cpstream default: OUTPUT TO file CONVERT TARGET "UTF-8". INPUT FROM file CONVERT SOURCE "UTF-8". OPS-25: Unicode and the DataServer
Ó à Common Mistakes E0 D3 E0 E0 D3 Updating data with the wrong code page _progres _mprosrv E0 -cpinternal ISO8859-1 D3 -cpinternal IBM850 E0 -cpstream IBM850 -cpstream=interfaces w/external files/data: Set incorrect=corrupts. RULE OF THUMB: Set –cpstream to match the Operating System CP <spacebar> Accented a: input from a client machine where OS CP = 1252 <spacebar> But cpstream+cpinternal=IBM850; OE client assumes data is in that CP OE client CP differs from database server, causes “à” to become “Ó”. Reading backward to the same client produces “à” User is unaware that the database is storing a bad character Client thinks everything’s fine but the data in the database is wrong Two wrongs make a right ! _db-xl-name ISO8859-1 E0 OS = 1252 Ó D3 à OPS-25: Unicode and the DataServer
à à Common Mistakes 85 E0 E0 E0 E0 Updating data with the CORRECT code page _progres _mprosrv 85 -cpinternal ISO8859-1 E0 -cpinternal IBM850 E0 -cpstream 1252 <spacebar> Matching cpstream to OS CP cause proper conversion (cpstream->cpint) on the client BEFORE going onto server and storage _db-xl-name ISO8859-1 E0 OS = 1252 à E0 à OPS-25: Unicode and the DataServer
Real Life Story ASCII Linefeed (0x0A) to EBCDIC Newline (0x25) DataServer for ODBC Hi Bob,CRLF How are you?CRLF Bye 0x0A _db-xl-name IBM037 IBM037 EBCDIC 0x0A OpenEdge Client 0x0A 0D 0A -cpstream iso8859-1 Iso8859-1 ASCII ASCII Iso8859-1 and EBCDIC IBM037 are not compatible CP’s. IBM037 has platform-specific control characters – like the newline character. Here: User was cut and pasting formatted text into their Windows OE app. <spacebar> CRLF in iso8859-1 (cpstream + cpinternal) are ODOA ISO8859-1 cpstream source to IBM037 cpinternal 0D <carriage return> is dropped in translation OA <linefeed> is untranslated 0A is the proper linefeed on Windows and Unix platforms But its not the proper “newline” control on EBCDIC IBM platforms. NOTE: Really shouldn’t be storing control characters that have a machine dependency anyway. -cpinternal iso8859-1 Hi Bob,▐How are you?▐Bye 0x0A OPS-25: Unicode and the DataServer
Real Life Story ASCII Linefeed (0x0A) to EBCDIC Newline (0x25) DataServer for ODBC Hi Bob,CRLF How are you?CRLF Bye 0x25 _db-xl-name IBM037 IBM037 EBCDIC 0x25 OpenEdge Client 0x0A OD 0A -cpstream IBM850 IBM850 ASCII By setting cpstream & cpinternal of OE client to IBM850, the right translation occurs <spacebar> IBM850 and IBM037 are compatible across Windows and IBM platforms where this installation Existed Windows “linefeed” was converted to platform-specific “Newline” control character Hi Bob, How are you? Bye -cpinternal IBM850 0x0A OPS-25: Unicode and the DataServer
Tips & Hints ISO8859-1 database with data encoded in IBM850 Un-corrupting data ISO8859-1 database with data encoded in IBM850 Run on session with -cpinternal iso8859-1 FOR EACH myTable EXCLUSIVE-LOCK. RUN FixChar(INPUT-OUTPUT myTable.myField). END. PROCEDURE FixChar: DEF INPUT-OUTPUT PARAM c AS CHAR NO-UNDO. c = CODEPAGE-CONVERT(c,"IBM850","ISO8859-1"). END PROCEDURE. This last case = a translation that should have, but didn’t, occur No conversion from iso8859-1 to ibm850. Above code - would manually perform such a conversion Filling a void Prev. case = Bad conversion took place between iso8859-1 & IBM850 Solution: Apply the bad conversion in reverse mode. Again use: CODEPAGE-CONVERT(target, source) 2 types of errors: Filling a void and reversing a curse OPS-25: Unicode and the DataServer
Database Sorting Rules Are not all the same FOR EACH table WHERE name <= CHR(126). FOR EACH table WHERE name >= CHR(126). -cpinternal MSS 1252 _Db._Db-collate Iso8859-1 Basic # $ ~ Alphanumerics # $ Alphanumerics ~ Collations can also cause problems, not just code pages Rule: When changing collation: Source & target collation should be based on same CP to avoid data loss from client selection. ICU collations are all based on the Unicode char. but still have different sort criteria Highlights another fact: Db collation may not match our ICU collation. This DataServer example: Customer used iso8859-1 basic collation to determine the high-order character in his database was codepoint 126 (the Tilde) where it sorts high. But in the client CP 1252, Tilde sorts low 1st Query: Returns Alphanum data in WHERE bracket but client selection on 1252 throws them out 2nd Query: Returns no data in WHERE bracket Both queries curiously return no data ! OPS-25: Unicode and the DataServer
Agenda Unicode: How did we get there ? The path to successful development & deployment Unicode: How did we get there ? What are its broader OpenEdge implications ? What are its DataServer implications ? Specific Implementation in the DataServers for: Oracle MS SQL Server Now we’ll discuss DataServers in relation to OpenEdge and Unicode How many of you have OpenEdge applications that use DataServers ? Most of the European crowd is probably already doing some multi-cultural support in their DataServer apps ? But you can’t get full Unicode-enabled international application support w/out Unicode supported via schema holder dataServer interface on the client application. OPS-25: Unicode and the DataServer
Under Development D I S C L A I M E R This talk includes information about potential future products and/or product enhancements. What I am going to say reflects our current thinking, but the information contained herein is preliminary and subject to change. Any future products we ultimately deliver may be materially different from what is described here. D I S C L A I M E R OPS-25: Unicode and the DataServer
Unicode Deliverables MSS CLOB Support + Oracle NCLOB Support 10.0A 10.0B 10.1B03 10.1C 10.1C01 Future MSS CLOB Support + CLOB Params To Stored Proc.’s Oracle NCLOB Support Unicode for MSS + Oracle DataSrvr + CLOBs Unicode for MSS DataSrvr (limited) For 10.0B03: Contact your Technical Support Representative ICU Collation Unicode OPS-25: Unicode and the DataServer
OpenEdge Settings _db-xl-name, -cpinternal and -cpstream OpenEdge Database OpenEdge Process Keyboard GUI -cpinternal _db-xl-name Screen OpenEdge code page conversions CHUI -cpstream Printer Simple OE client connected to an OE database <spacebar> Blue arrows represent transalction NOTE: _db-xl-name represents the code page of the database. Cpinternal and DB do conversion OS files OPS-25: Unicode and the DataServer
OpenEdge Settings _db-xl-name, -cpinternal and -cpstream Driver Foreign Data Source Driver Conversions ? OpenEdge Process DataServer Layer or process Keyboard DB Driver GUI -cpinternal Database CP Screen OpenEdge code page conversions CHUI -cpstream Printer Needs to match Needs to match OS files In equivalent “DataServer” configuration: “Schema holder” introduced to act as the OpenEdge Database to the DataServer (holds schema of foreign data source) Separate connection - adds a driver interface & a foreign database Same RULE applies as has always ….. Schema image CP must “mirror” the “foreign data source” CP. So “_Db._db-xl-name” metaschema field in schema holder should match real foreign storage. <spacebar> More exactly: Schema image CP must match whatever format data is presented in to the DataServer through the driver interface. = CP of foreign data source when driver does no translation = CP of the driver interface when driver does translation NOTE: Reduces translation if cpinternal also matches “_db-xl-name” Schema Holder _db-xl-name OPS-25: Unicode and the DataServer
OpenEdge Settings match match Schema Holder _db-xl-name WEBSPEED™ ORACLE Database WEBSPEED™ _progres -web DATASERVER _orasrv -cpinternal -cpinternal Driver _db-xl-name Web Browser -cpstream match match -cpstream Schema Holder APPSERVER™ _proapsv OS files OS files -cpinternal _db-xl-name GUI CLIENT prowin32 -cpstream Keyboard CHUI CLIENT _progres -cpinternal In a complex arrangements, DataServer requirements don’t change Cpinternal values between OE components carry out translations Attached database components still rely the appropriate setting of _db-xl-name based on storage format. Screen Printer OS files -cpinternal -cpstream Keyboard -cpstream Printer OS files Screen Printer OS files OPS-25: Unicode and the DataServer
Dictionary Utilities changed for Unicode For Both Oracle and MS SQL Server Schema Migration * Including Unicode batch mode parameters Update/Add Table Definitions + Verify Table Definitions + Adjust Schema + Generate delta.sql * Dump as Create Table Statement * NOTE: The Dictionary “Field Editor” dialog will display the Unicode types where applicable in the foreign type label. * “Use Unicode Types” GUI selection provided + Modified to handle Unicode types internally OPS-25: Unicode and the DataServer
Comparing 10.1C Unicode: Oracle vs. MSS Attribute OpenEdge ORACLE MSS Unicode Definitions DB-Codepage (_db._db-xl-name) DB-Codepage Data Types CHAR, LONGCHAR, CLOB CHAR,VARCHAR2, LONG, CLOB NCHAR, NVARCHAR2, NCLOB (in 10.1C01) NCHAR, NVARCHAR, NVARCHAR(max)and NTEXT mapped to OpenEdge CHAR Max. Char Size CHAR: 30,000 bytes LONGCHAR/CLOB: 1G CHAR types: 4000 bytes CLOB types: 4G CHAR types: 8000 bytes CLOB types: 2G Max. Char Size for Unicode Same as above but... CHAR: 15,000 bytes using MSS DataServer 4000 bytes 4000 chars Semantics Character Character or Byte (double-byte) Character Driver Settings N/A NLS_LANG=.AL32UTF8 ACP=Active Code Page Database Code Pages UTF-8 NLS_CHARACTERSETS: AL32UTF8 & UTF8 NLS_NCHAR_CHARACTERSETS AL16UTF16 or UTF8 UCS-2 (partial UTF-16) Attribute Notes: OpenEdge Unicode storage in homogeneous. Oracle & MSS storage can be heterogeneous. An OE DS migration produces homogeneous foreign DB. MSS doesn’t support CLOBs so NVARCHAR(max) & NTEXT map to CHAR CHAR size is undefined in OE. SQL DBs have maximum lengths (& probably SBCS-based) Max size of an OE char holding Unicode for an MSS DS is 15K due to double byte expansion Note: Double byte expansion limits MSS 8000 byte CHAR limit to 4000 Oracle can define CHAR size in bytes or characters when using DB Unicode encoding only. NLS_LANG very important driver setting for Oracle Unicode implementation Oracle can have alternate charsets for database and columns Oracle Unicode types (NCHAR) can support fixed-width encodings, DB can’t MSS supports UCS-2, the 65536 code pts within BMP MSS does NOT natively handle suppl chars. - treats them as a pair of undefined Unicode surrogate characters which combined form a suppl char MSS can store suppl. chars without risk of loss or corruption. MSS can have different charsets per column. OPS-25: Unicode and the DataServer
Common Unicode Requirements DataServer Migration Driver Conversions ? Foreign Data Source OpenEdge Process DataServer Layer or process DB Driver -cpinternal UTF-8 OpenEdge code page conversions Database CP -cpstream UTF-8 Needs to match Schema Holder Needs to match Considerations for the database being ported. NOTE: Migration produces homogeneously-Unicode databases whereas the pull operation from a foreign database might pull Unicode and non-Unicode types heterogeneously into a DataServer schema. Common Requirements: Any OE database can be migrated to Unicode as long as a CP is defined to convmap. The OpenEdge DS schema holder must be derived from a UTF-8 OE DB The _db-xl-name metaschema field (SHCP) needs to be defined as “UTF-8” Character data in the FDS defined as Unicode must be received into the DataServer through the interface driver as UTF-8 data (i.e., should “match” _db-xl-name) EXCEPT for MSS (in which it is known in advance that UCS-2/UTF-16 is passed and must be converted). SQL char. length must be determined with respect to Unicode characters. Must consider the potential impact of data widening during conversions (iso8859-1 to UCS-2, char content of specific locales you will support, etc.). SQL Width tool is highly recommended for sizing OE data. SQL Width tool recognized Unicode data if you convert prior to migration. Consider how you will deal with large character data. When should chars be expanded to CLOB’s or CLOB-like data types. Options are limited in MSS without CLOB support. Consider setting cpinternal and cpstream to UTF8 to reduce conversion requirements The database you are porting must be compliant (Oracle 9i or MSS 2005 or greater) _db-xl-name UTF-8 Database .d file cpstream=ISO8859-1 PRODB Recommended: Set $DLCDB environment variable to $DLC/prolong/utf _db-xl-name ANSI or UTF-8 .d file cpstream=ISO8859-5 Build from: $DLC/prolong/utf/empty OPS-25: Unicode and the DataServer
Agenda Unicode: How did we get there ? The path to successful development & deployment Unicode: How did we get there ? What are its broader OpenEdge implications ? What are its DataServer implications ? Specific Implementation in the DataServers for: Oracle MS SQL Server Has anybody here attempted to support Unicode in their Oracle DataServer already ? You can already do this by using a Unicode database code page and function-based indexes to do sorting, etc. But you can’t get full application support without support for Unicode (Unicode column types, etc.) through the schema holder and in the DataServer code that interfaces with the client application. OPS-25: Unicode and the DataServer
Oracle DataServer Migration _db-xl-name, -cpinternal and -cpstream OpenEdge Process OpenEdge conversions Driver conversions ORACLE 9i+ Database -cpinternal UTF-8 10.1C ORACLE DataServer Layer or process OCI Client Library NLS_LANG= .AL32UTF8 Database Charset National Charset -cpstream UTF-8 Match Schema Holder Oracle Requirements: Consider whether you will use a database CP or Unicode Types to store Unicode data. NOTE: If database CP = UTF8 or if Unicode types (AL16UTF8 or UTF8) are used, consider widening/narrowing. When the schema image CP = UTF-8, NLS_LANG must be a set for Unicode or the Oracle DS will fail to connect. Remember: The schema matches the driver (OCI) in this case NLS_LANG=UTF8: surrogate code pts are 2 UTF8 chars and 6 bytes NLS_LANG=AL32UTF8 matches OpenEdge UTF-8 NLS_LANG driver conversions happen at the driver or server depending on circumstances. If you use Unicode types, ensure that NCHAR charset is a subset of your database charset. The database sometimes performs conversions between the two depending in the setting of the CHARSET_FORM attribute. Also, literal text in SQL can only be represented in the DB CP so NCHAR’s types must be convertible to CHAR types. Also note, CLOB’s are stored in UTF-16. _db-xl-name UTF-8 Database .d file cpstream=ISO8859-1 VARCHAR NVARCHAR CLOB CFILE NCLOB _db-xl-name ANSI or UTF-8 .d file cpstream=ISO8859-5 OPS-25: Unicode and the DataServer
Oracle Unicode Migration What version of ORACLE Unicode Instance and Unicode drivers must be 9i or above Codepage for Schema Image Declares Unicode Collation Name Sets ICU collation Oracle Version 9i or above SHCP=UTF8, then NLS_LANG=.AL32UTF8 Set a Unicode collation OPS-25: Unicode and the DataServer
Oracle Unicode Migration Two ways to configure an ORACLE database to store Unicode: Use Unicode Types Unchecked – Uses Database Charset NLS_CHARACTERSETS: AL32UTF8 UTF8 Checked – Uses National Language Charset NLS_NCHAR_CHARACTERSETS: AL16UTF16 Decide how you will define your Unicode data in Oracle: When SHCP=UTF8 & “Use Unicode Types”=no” – Database CP is implied NOTE: AL32UTF8 Database CP – provides full range of data support, least translation, least character widening When SHCP=UTF8 & “Use Unicode Types”=yes - Nat’l Char Set for Unicode column types are used Have Unicode/non-Unicode database CP and use Unicode character data types (AL16UTF16 or UTF8). AL16UTF16 optimizes string processing for NCHAR types “Use Unicode Types” = off by default (backward compatible selection) If checked, the SH CP is auto-changed to “utf-8”. “Char semantics” are auto-disabled (only for Unicode defined by a database CP) OPS-25: Unicode and the DataServer
Oracle Unicode Migration For field width’s use Width (recommended) Use SQL Width Tool Char semantics Checked – CHAR(10) = 10 chars (w/UTF8 =10–30 bytes) (w/AL32UTF8=10-40 bytes) Unchecked – CHAR(10) = 10 bytes “Use Unicode Types” = yes – only character semantics allowed “Use Unicode Types” = no – means you’re using a DB CP – character or byte semantics apply Best Practice - migrate columns with char. Semantics. Selecting byte semantics - Assumes you know the max. byte size of all your data and want to size character columns using this knowledge. Highly recommend using “Width” over FORMAT and to size columns w/SQL Width Tool OPS-25: Unicode and the DataServer
Oracle Unicode Migration Maximum char length Use Unicode Types = 2000 (assumes NCS = AL16UTF16 ) = 1000 (assumes DB CP = AL32UTF8 Expand to CLOB Checked – Greater than Maximum char length produces CLOB Unchecked – Greater than Maximum char length produces LONG (backward compatible) OE CP<>UTF8, set max char len=4000 Max Char Length >1000<=4000; >4000 migrated to LONG – backward compatible. Recommend: Check Expand to CLOB instead to deal w/char widening and multiple LONGs OE CP=UTF8+UUT=no, assuming AL32UTF8, set max char len=1000 OE CP=UTF8+UUT=yes,assuming AL16UTF16, set max char len=2000; Assumes all fixed length 2 byte chars; No adjust for supplementary chars > BMP OE CP=UTF8+UUT=yes, assuming UTF8, recommend max char len=1333 Assumes 1-3 byte var. len. Chars; No adjust for 6-byte supplementary chars > BMP) Irrespective of the above recommendations … NOTE: Oracle imposes physical limits for NCS = 2000 for AL16UTF16 and 4000 for UTF8 OPS-25: Unicode and the DataServer
Agenda Unicode: How did we get there ? The path to successful development & deployment Unicode: How did we get there ? What are its broader OpenEdge implications ? What are its DataServer implications ? Specific Implementation in the DataServers for: Oracle MS SQL Server Prior to 10.1C, you would not have been able to use Unicode in MSS. Didn’t have support for Unicode types Would have needed conversions between UCS2 and UTF8 encodings OPS-25: Unicode and the DataServer
MS SQL Server DataServer Migration _db-xl-name, -cpinternal and -cpstream OpenEdge conversions Driver conversions MSS 2005 Database OpenEdge Process 10.1C MSS DataServer Layer or process ODBC Driver ACP = OS CP -cpinternal UTF-8 UCS-2 UTF-16 -cpstream UTF-8 Implied Match Schema Holder _db-xl-name UTF-8 MSS Requirements: You can only use Unicode Types to store Unicode data (no DB CP) The “Unicode” schema image CP = UTF-8 has an implied match with “Unicode” server column types that only support UCS-2/UTF-16. 1-3 byte UTF-8 characters expand to 2 byte UCS-2 characters 4 byte UTF-8 characters become 2 2-byte surrogate pairs Driver conversions - may occur on non-Unicode columns If OS CP where driver resides <> non-Unicode CP of database columns. Columns passed as “Unicode” data types are generally not converted Database .d file cpstream=ISO8859-1 NCHAR NVARCHAR NTEXT NVARCHAR(max) _db-xl-name ANSI or UTF-8 .d file cpstream=ISO8859-5 OPS-25: Unicode and the DataServer
MS SQL Server Unicode Migration ODBC Data Source Name Must be Unicode Driver Codepage for Schema Image Declares Unicode Collation Name Sets ICU collation Use Unicode Types Checked – Selects Unicode (Changes Codepage to UTF-8) NVARCHAR types Unchecked – Uses non-Unicode character types VARCHAR types ODBC DS: DataDirect 5.1 driver>= Unicode enabled. Older non-Unicode drivers= significant overhead to support a Unicode ODBC application. Driver must point to MSS 2005 DS or greater. MSS picks up the server’s default charset and DB charset through SQL. Unicode column types provide UCS-2/UTF-16 Unicode for MSS on Windows in columns independent of the DB and server instance CP. Set an ICU collation (NOTE: MSS collation algorithms are NOT published !!!) “Use Unicode Types” = off. No chg. to old default behavior (even when CP = UTF-8) “Use Unicode Types” = generate unicode types homogeneously for all chars fields Automatically sets CP UTF-8. Migration error occurs if target is not >= MSS 2005. THIS WILL CAUSE A DOUBLING OF STORAGE FOR SBCS DATA. WARNING: Also reduces OpenEdge max record size to 16K for a DataServer client Heterogeneous non-unicode types pulled into a schema image may be subject to OS CP driver conversions OPS-25: Unicode and the DataServer
MS SQL Server Unicode Migration Maximum char length Use Unicode Types = 4000 (assumes MSS 2005 = UCS-2 For field width’s use Width (recommended) Use SQL Widtth Tool Expand width (utf-8) Checked – Doubles width defined for NVARCHAR types NVARCHAR(1000) becomes NVARHCAR (2000) Max. VarChar Length = VARCHAR limit before CLOB conversion. For Non-Unicode: VARCHAR > 8000 TEXT column For Unicode NVARCHAR > 4000 NVARCHAR(max) column With “Expand width(utf-8) checked >2000 NVARCHAR(max) col. NOTE: No BLOB Support. Only 1st 30K bytes is useable in OpenEdge record buffer, = 15K characters Expand width(utf-8) – doubles width for Unicode character columns. Rare cases when use of supplementary range causes 2-byte char. assumptions to be violated. Highly recommend “Width” over FORMAT and to size columns w/SQL Width Tool but remember: Inside BMP: Single byte ANSI or UTF-8 chars 2 byte chars OPS-25: Unicode and the DataServer
Linguistic Sorting and Collation Sorting with Finnish collation FOR EACH mytable BY COLLATE(myfield,"CASE-INSENSITIVE","ICU-fi"): DISPLAY myfield WITH FONT 8. END. Basic ICU-UCA ICU-fi Aaa Ááá Äää Ççç Ĉĉĉ Bbb Ccc Zzz Aaa Ááá Äää Bbb Ccc Ĉĉĉ Ççç Zzz Aaa Ááá Bbb Ccc Ĉĉĉ Ççç Zzz Äää Quick Word about collation: Just as we saw with the earlier sorting example… Sorting standards are all over the place OpenEdge uses ICU MSS uses an unpublished sorting algorithm If OE and DB collation don’t match, records not meeting client sort criteria are dropped. BTW: Supplementary chars part of the “90” collation algorithms supported in MSS 2005 + new comparative operators and functions ICU-UCA “Unicode Collation Algorithm” (a series of weights) Supplies the DUCET – Default Unicode Collation Element Table: The default collation for all Unicode characters. Sorts appropriately for many Western languages 3. OR, you can always “COLLATE”, or resort, your client results according to sorting requirements of a particular locale. NOTE: Using new collations requires database objects be reindexed OPS-25: Unicode and the DataServer
Linguistic Sorting and Collation Comparing with Finnish collation FOR EACH mytable WHERE COMPARE(myfield,">=","C", "CASE-INSENSITIVE","ICU-fi") BY COLLATE(myfield,"CASE-INSENSITIVE","ICU-fi"): DISPLAY myfield WITH FONT 8. END. Basic ICU-UCA ICU-fi Ccc Zzz Ccc Ĉĉĉ Ççç Zzz Ccc Ĉĉĉ Ççç Zzz Äää Can also bracket result sets based on collation-based COMPARE criteria OPS-25: Unicode and the DataServer
Linguistic Sorting and Collation Global Setup Caution with performance! TEMP- TABLES Database English User AppServer -cpcoll ICU-en -cpcoll ICU-uca -cpcoll ICU-uca --- Uses client collation in COMPARE and COLLATE TEMP- TABLES French User -cpcoll ICU-fr TEMP- TABLES You can deal with localization: In your DB: costs storage and can decentralize data In your App: (See above example using temp tables for specific locales. COMPARE/COLLATE are costly operations. App. Solution - may save storage but starve for CPU. Czech User -cpcoll ICU-cs RUN ASprg.p ON hAppServer (INPUT SESSION:CPCOLL, INPUT USERID, INPUT <other parameters>, OUTPUT TABLE ttMytable). TEMP- TABLES Finnish User -cpcoll ICU-fi OPS-25: Unicode and the DataServer
8-bit Code Pages Where to find code page tables: 10.1B Internationalizing Applications manual (IBM850 and ISO8859-1) http://www.microsoft.com/globaldev/reference/cphome.mspx http://www-03.ibm.com/servers/eserver/iseries/software/globalization/codepages.html http://en.wikipedia.org http://www.fileformat.info/info/charset/index.htm Where to find Unicode Fonts: http://en.wikipedia.org/wiki/Code2000 Information about Windows fonts: http://www.microsoft.com/typography/fonts/default.aspx http://www.microsoft.com/globaldev/getwr/steps/wrg_font.mspx You need a unicode-aware font to display unicode characters You need a surrogate-aware font to display surrogate characters OPS-25: Unicode and the DataServer
For More Information, go to… PSDN B2420-LV: From 26 to 96,000 Characters in 60 Minutes DEV-10: Supporting Multiple Languages in Your Application DEV-23: Global Applications and Code Pages Progress eLearning Community: Understanding Internationalization – Salvador Vinals Documentation: OpenEdge Data Management: DataServer for Oracle OpenEdge Data Management: DataServer for Microsoft SQL Server OpenEdge Development: Internationalizing Applications OPS-25: Unicode and the DataServer
? Questions OPS-25: Unicode and the DataServer
Thank You OPS-25: Unicode and the DataServer
OPS-25: Unicode and the DataServer