Agenda: Guidelines for Supporting Complex Scripts* on Windows 2000 Key Concepts Overview of Unicode Migrating existing applications Using Unicode text in resources *Such as Devanagari and Tamil
Definitions Enabling for a script: Adding support for input, display, and output of the script Localization: Translating user interface elements Globalization: Developing software such that feature design and code design are not limited to a single locale or script
Requirements for Enabling Indian Scripts in Applications on Windows 2000: Use Unicode to encode text Enable for complex scripts Note: Many Microsoft products do not yet meet these requirements. However, we’re working on it!
Overview of Unicode
Character Set Evolution MS-DOS: OEM character sets Windows 3.x: ANSI character sets Windows 9x: ANSI character sets Windows NT: Unicode Unicode Supported for Compatibility: Supported for Compatibility: OEM (console) character sets, OEM (console) character sets, ANSI character sets, ANSI character sets,
Why do character set differences matter? Historically, they fragmented code bases for both Windows and applications Single byte: European editions Single byte: European editions Double byte: Far East editions Double byte: Far East editions Bi-directional: Middle East editions Bi-directional: Middle East editions Make it difficult to share data Make it difficult to develop multilingual applications
What is Unicode? A 16-bit character encoding A mapping of characters to numbers A mapping of characters to numbers Syntax rules for display of complex scripts Syntax rules for display of complex scripts Not a font or glyph encoding! Not a font or glyph encoding! Not a sort algorithm! Not a sort algorithm! Includes all characters in common use in modern scripts (and others) Basis for the ISO character encoding standard Native text encoding for Windows NT
Unicode ™ / ISO 16-bit international character encoding Windows 2000 uses Unicode version 2.0 0x0000 0xFFFF Punctuation Future use ASCII Private use Compatibility Indian Greek Arabic, Hebrew Latin Ideographs (Hanzi, Kanji, Hanja) Symbols Hangul Kana Thai A FF964F (null)
Relatives of Unicode ISO/IEC bit ISO standard of 64K X 64K “planes” 32 bit ISO standard of 64K X 64K “planes” Unicode repertoire is plane 0 Unicode repertoire is plane 0 UTF-7 7 bit transformation format 7 bit transformation format Not widely used Not widely used UTF-8 8 bit transformation format 8 bit transformation format Used in web pages and some Used in web pages and some
Why Should I Use Unicode and Win32 for Indian Text? My application works fine now! ??
Benefits of Using Unicode on Windows 2000 Share data (e.g., cut and paste) with other Win32 applications Make use of full Win32 API for text processing Support multilingual documents, including multiple Indian scripts Use industry standard encoding
Summary: Use Unicode – It is the ultimate character encoding Represent all text with one unambiguous encoding Support multilingual text easily Avoid special processing for variable byte- length characters Use standard encoding recognized throughout the industry and the world Support new scripts that are only supported through Unicode
Migrating Exiting Applications to Support Indian Text on Windows 2000… Three Migration Scenarios: 1. ANSI application to Unicode 2. Standard Win32 application to complex script enabled 3. Existing Indian language application to Unicode and Win32
Migrating ANSI applications to Unicode Overview of “A” and “W” entry points How to build a Unicode Win32 Application Unicode Applications on Windows 98
Review of the W and A APIs Two kinds of window classes: Unicode, ANSI Win32 API has two versions of most functions: “W” (wide) version handles Unicode “W” (wide) version handles Unicode “A” (ANSI – ) assumes the system default code page (character encoding) “A” (ANSI – ) assumes the system default code page (character encoding) Macros resolve to W or A entry point Example: Macro for RegisterClassEx #ifdef UNICODE #define RegisterClassEx RegisterClassExW #else #define RegisterClassEx RegisterClassExA #endif
To Build a Unicode-enabled Application: Automatic in Visual Studio: Compile with options –DUNICODE and -D_UNICODE Compile with options –DUNICODE and -D_UNICODE Specify WinMainCRTStartup in ProjectSettings/Link/Output/EntryPointSymbol Specify WinMainCRTStartup in ProjectSettings/Link/Output/EntryPointSymbol Or, use only the “W” routines from Win32 API Metafiles: Use Extended Metafiles (EMF) Use Extended Metafiles (EMF) Windows Metafiles (WMF) don’t support Unicode Windows Metafiles (WMF) don’t support Unicode
For Applications that Must Also Run on Windows 98… Use Unicode everywhere with single binary, two code paths: On Windows NT use W entry points On Windows NT use W entry points On Windows 98, convert Unicode ANSI, use A entry points On Windows 98, convert Unicode ANSI, use A entry points See sample GLOBALDV for example See sample GLOBALDV for example See April Microsoft Systems Journal for details and other options
Migrating Standard Win32 Application to Support Complex Scripts Good news: In a Unicode application, it basically just works!
Simple, Plain-text Applications Use standard edit control in Visual C/C++ Use standard win32 API functions Win32 APIs: ExtTextOutW or DrawTextW Win32 APIs: ExtTextOutW or DrawTextW ScriptString API in Uniscribe ScriptString API in Uniscribe
Pitfalls in Enabling for Complex Scripts When displaying typed text: Do not output characters one by one! Do not output characters one by one! Do save text in a buffer and display the whole string with Uniscribe or Win32 API Do save text in a buffer and display the whole string with Uniscribe or Win32 API To measure line lengths: Do not sum cached character widths Do not sum cached character widths Do use a GetTextExtent function or Uniscribe Do use a GetTextExtent function or Uniscribe
Simple Applications With Formatted Text Use rich edit control in Visual C/C++ Internet Explorer 5.0: Use Document Object Model (more later)
Applications With Advanced Formatting and Layout Use script APIs (“Uniscribe”) See MSJ article of November 1998
What about Visual Basic, Visual J++? Visual Basic 6.0 Standard controls are ANSI, not Unicode Standard controls are ANSI, not Unicode Use “MS Forms 2.0” controls to use Unicode in controls Use “MS Forms 2.0” controls to use Unicode in controls Resource editor does support Unicode Resource editor does support Unicode Visual J++ Resource editor supports Unicode Resource editor supports Unicode Text Output is ANSI only Text Output is ANSI only Future Plans: Make Unicode work everywhere in Visual Studio
Migrating Existing Indian language applications to Win32 and Unicode
Step 1 in Migrating Existing Indian Applications Follow guidelines for Unicode enabling and complex script enabling
Step 2 in Migrating Existing Indian Applications … Provide conversion facility to migrate documents From your format to ISCII From your format to ISCII From ISCII to Unicode From ISCII to Unicode MultiByteToWideChar(, … Devanagari is codepage Devanagari is codepage Tamil is codepage Tamil is codepage See UCONVERT sample Included on your CD Included on your CD Modified from UCONVERT in Win32 SDK Modified from UCONVERT in Win32 SDK
Using Unicode Text in Resources Getting Unicode into Win32 resources Multilingual Visual C/C++ applications
Getting Unicode into Win32 Resources Create Unicode RC file Resource editor in Visual Studio does not support Unicode yet, so Resource editor in Visual Studio does not support Unicode yet, so Generate rc file for English using IDE Generate rc file for English using IDE Translate to target language with Unicode editor (e.g., notepad or Word) Translate to target language with Unicode editor (e.g., notepad or Word) Save as Unicode Save as Unicode Compile with resource compiler RC.EXE RC.EXE does support Unicode RC.EXE does support Unicode Compile within Visual Studio IDE Compile within Visual Studio IDE
Implementing Multilanguage User Interface in Applications Use satellite resource DLLs Default to user settings, but Allow user to change For details, see: April 1999 Microsoft System Journal April 1999 Microsoft System Journal GLOBALDV sample code GLOBALDV sample code
Multilanguage User Interface Initialize to current UI language Windows 2000: GetUserDefaultUILanguage() Windows 2000: GetUserDefaultUILanguage() Others: Use the language of the O/S Others: Use the language of the O/S Allow user to select UI language Put language-dependent resources in resource DLLs Put language-dependent resources in resource DLLs Use naming convention, e.g., res.dll Use naming convention, e.g., res.dll Find all resource DLLs, put up list box of choices Find all resource DLLs, put up list box of choices
Agenda: Using Unicode and Complex Scripts in Enterprise Applications Intranet/internet applications Unicode support in SQL Server 7.0 Other Considerations
Intranet/Internet Applications Internet Explorer 5.01 on Win32 Platforms Displays multilingual text including complex scripts Displays multilingual text including complex scripts Supports complex scripts in Document Object Model Supports complex scripts in Document Object Model Supports Indian text through Unicode Supports Indian text through Unicode
Encodings for Multi-lingual Text in Web Pages Raw Unicode OK for intranet on Windows NT networks OK for intranet on Windows NT networks Not good for internet pages Not good for internet pages Number entities, e.g., क OK for occasional use, e.g., inserting characters not in the main script of page OK for occasional use, e.g., inserting characters not in the main script of page Not good for large documents Not good for large documents UTF-8 – Recommended encoding Works just about everywhere Works just about everywhere Supported by IE 4.0+, Netscape 4.0+ Supported by IE 4.0+, Netscape 4.0+
Creating UTF-8 Webpages Use charset=UTF-8 in META tag Save HTML page as UTF-8 using notepad, Word, etc. Saving as UTF-8 in Word: Select File/Save As WebPage/Tools Select File/Save As WebPage/Tools Select Web Options/Encoding Select Web Options/Encoding Change charset designation to UTF-8 Change charset designation to UTF-8
Embedded Fonts in Web Pages Downloadable fonts used only in web pages Deleted when page is closed WEFT tool Creates embedded font from TTF file Creates embedded font from TTF file Saves download time/space by using only those glyphs required for the page Saves download time/space by using only those glyphs required for the page On Microsoft website, see workshop/author/fontembed/font_embed.asp workshop/author/fontembed/font_embed.asp
Introduction to DHTML Based on Document Object Model Objects in HTML document Objects in HTML document Text in objects including titles, headers, etc Text in objects including titles, headers, etc Attributes such as font, color, etc Attributes such as font, color, etc Are accessible via scripts, e.g., JScript or VBScript Are accessible via scripts, e.g., JScript or VBScript Supported in IE 4.0+ Supported in IE 4.0+ See various documents under for overview
Examples of DHTML <H1 id=Head1 style=“font-weight: normal” onmouseover = “makeitalic() ;” onmouseover = “makeitalic() ;” onmouseout = “makenormal() ;” > onmouseout = “makenormal() ;” > Sample Dynamic HTML Sample Dynamic HTML function makeItalic() { function makeItalic() { Head1.style.fontstyle = “Italic” ; } function makeNormal() { Head1.style.fontstyle = “Normal” ; }</script> Heading tag Jscript functions that change style of heading text
Using Indian Scripts in DHTML Use same design rules as static HTML Encode in UTF-8 Encode in UTF-8 Use embedded fonts if needed Use embedded fonts if needed Consider multilingual pages Display initial page in English Display initial page in English Offer option to change to other Offer option to change to other
Unicode Support in SQL Server 7.0 Unicode datatypes in SQL Server 7.0 NCHAR NCHAR NVARCHAR NVARCHAR NTEXT NTEXT Indicate Unicode text by N’text’, in SQL queries: Indicate Unicode text by N’text’, in SQL queries: create table myTable (col1 CHAR(8), col2 NCHAR(8)) insert into myTable (col1,col2) (‘Japan’, N‘ 日本 ') Utilities for entering/retrieving Unicode data: Query Analyzer Query Analyzer Data Transformation Services Data Transformation Services Client application using ODBC Client application using ODBC
Accessing Data Through ODBC ODBC supports Unicode data access Use Visual C/C++ for read/write Use SQL ‘W’ routines, e.g., SQLExecDirectW(SQLHSTMT, LPWSTR, int); Use SQL ‘W’ routines, e.g., SQLExecDirectW(SQLHSTMT, LPWSTR, int); Specify data type SQL_C_WCHAR as needed: SQLBindCol(hstmt, nColumn, SQL_C_WCHAR, szCol, nMaxCol, &cbName); Specify data type SQL_C_WCHAR as needed: SQLBindCol(hstmt, nColumn, SQL_C_WCHAR, szCol, nMaxCol, &cbName); See GLOBALDV sample Use Visual Basic to retrieve and display
Accessing SQL Server 7.0 Unicode Data through ASP Webpages Use standard encodings: UTF-8 in web pages UTF-8 in web pages Unicode in SQL Server 7.0 Unicode in SQL Server 7.0 Access data through Jscript/ODBC Jscript automatically translates Unicode to current codepage in web page Defaults to system codepage Defaults to system codepage Specify UTF-8 “codepage” using: Specify UTF-8 “codepage” using: // Scope=session // Scope=session // Scope=page // Scope=page
Summary of SQL Server 7.0 Unicode Access
Other Considerations … Handling Indian text in network applications Indic Language Group must be installed on clients Indic Language Group must be installed on clients Only necessary on server if display and input is required locally Only necessary on server if display and input is required locally Sharing Documents Word 2000 Documents: Must have Indic language group installed on local machine Word 2000 Documents: Must have Indic language group installed on local machine HTML: Can use embedded fonts HTML: Can use embedded fonts
Break!
OpenType Layout David C. Brown Development Lead, and David Meltzer Program Manager Microsoft Corporation
OpenType Layout File Format Benefits of OpenType Layout Features Indic Features
OpenType File Format sfnt table structure Extension of the current TrueType file format Extension of the current TrueType file format A single font file may contain TrueType outline data TrueType outline data PostScript (CFF) outline data PostScript (CFF) outline data
Benefits of OpenType Support for large character sets Multi-script character sets Unicode support Glyph alternates supported Advanced typography supported Better protection of font data Font embedding controls
Layout Features Glyph substitution Glyph positioning Script and Language information
Glyph Substitution Single glyph substitution One-to-many substitution Multiple glyph substitution Aesthetic alternatives Contextual glyph substitution
Glyph Positioning Two-dimensional positioning Single glyph adjustment Adjustment of paired glyphs Cursive attachment Mark attachment Contextual positioning
Script and Language Information Layout features encoded by Scripts Scripts Languages within scripts Languages within scripts
Indic Features Language Forms Conjuncts and Typographical Forms Glyph Positioning
Language Forms Nukta Akhand Reph Below-base Form Half Form Post-base Form Vattu Variants
Example: Below-base form
Conjuncts and Typographical Forms Pre-base substitutions Below-base substitutions Above-base substitutions Post-base substitutions Halant Forms
Example: Pre-base consonant conjunct
Glyph Positioning Below-base marks Above-base marks Distance control
Coming Tools for Developing OpenType Fonts VTT (Visual TrueType) VOLT (Visual OpenType Layout Tool)
Installing Sample Fonts … copy …\cssamp\fonts.exe c:\temp cd c:\temp fonts /T:c:\temp /C Use explorer to drag mangal.ttf and latha.ttf into your winnt\fonts directory.
Resources OpenType Specification pec pec Indic Encoding Specification Early draft available on your CD Early draft available on your CD contact contact