From Code to XLIFF Bridging the Chasm Dr. Stephen Flinter Connect Global Solutions LRC Conference – 19 November 2003
Agenda The XLIFF Transformation Problem Current approaches Grammar based approach – XPG XPG & XML Summary
The Problem The XLIFF Transformation Problem Current approaches Grammar based approach – XPG XPG & XML Summary
The Problem XLIFF has made the representation of resources translation/localisation friendly Non-trivial to convert existing files to XLIFF Adding new file formats can be painful
XLIFF Transformation Definition: XLIFF Transformation is the process by which native file formats are transformed into XLIFF, and from XLIFF back to its native format (after translation). File formats include: Java,.properties, XML, HTML, custom.
Architecture
.com Business Model Parody of the.com business model that has been floating around the web: –Get lots of users –??? –Profit
XLIFF Transformation Model The XLIFF transformation model could be described in similar terms: –Native file format –??? –XLIFF
Architecture
Current Approaches The XLIFF Transformation Problem Current approaches Grammar based approach – XPG XPG & XML Summary
Current Approaches to XLIFF Use XLIFF as native format Use commercial tools Use regular expressions & scripts
XLIFF as Native Format Use XLIFF from software development onwards No transformation required Preferred approach in the long term
Disadvantages Requires significant changes to the software development process How to handle legacy resources? –Back to the original problem
Commercial Tools Tool support for XLIFF is improving all the time. Advantages of support and expertise of tool developer.
Disadvantages However, many tools still only read XLIFF, and won’t generate XLIFF from native formats Won’t necessarily support all formats required Can be difficult to identify in-line tags
Scripts and Regular Expressions Use a scripting language (e.g. perl, python, WordBasic) Encode rules to extract translatable resources using regular expressions
Examples StringRegular Expression “Translatable text” /”([^”]*)”/ id1 = Translatable text /.* = (.*)/
Advantages Superficially simple to develop Plenty of powerful RE languages (especially perl) available Full control and ownership of how the formats are managed
Disadvantages Error prone – difficult to cover all situations To remove all errors, often have to add many parsing rules Has to be redone for every new file type RE’s have to change for inline tags
Other Examples print(“First string”); print(“Second” + “ string”); print(“Third \”string\””); print(“Fourth {0} string”);
Summary This approach is doomed to failure because of the disconnect between the grammar of the language, and the regular expressions used to identify strings.
Grammar Based Approach The XLIFF Transformation Problem Current approaches Grammar based approach – XPG XPG & XML Summary
A New Approach With this approach, we look at the language grammar (EBNF) Identify grammar productions that can hold translatable text Generate a parser that accepts instances of the grammar and emits XLIFF
Grammar-based Architecture
Architecture New component: XLIFF parser generator (XPG) Accepts a JavaCC grammar Allows one or more productions to be marked as translatable Generate the “extract” and “merge” programs
JavaCC JavaCC: Java Compiler Compiler Modelled after lex & yacc Works on EBNF-type grammars rendered as JavaCC.jj files JavaCC grammar available for most modern programming languages.
Big Win Direct, one-to-one correspondence between the grammar and the mechanism for identifying strings.
Advantages Consistent high quality –Guaranteed to work in every case – for all instances of the grammar. Painless –No scripting/regular expressions required –Extractor and merger generated automatically Fast –Just need to identify the strings in the grammar
Example Extract from Java BNF ::= | | ::= " ?" ::= | ::= except " and \ |
JavaCC Extract void Literal() : {} { | BooleanLiteral() | NullLiteral() }
< STRING_LITERAL: "\"" ( (~["\"","\\","\n","\r"]) | ("\\" ( ["n","t","b","r","f","\\","'","\""] | ["0"-"7"] ( ["0"-"7"] )? | ["0"-"3"] ["0"-"7"] ["0"-"7"] ) )* "\"" >
Identifying We identify the as a language item that may contain strings XPG then generates a new grammar, which compiles to the extractor. The extractor then generates XLIFF.
Modified JavaCC Grammar void Literal() : {} { | StringLiteral() | BooleanLiteral() | NullLiteral() }
StringLiteral() void StringLiteral() : { Token t; } { t = { String s = t.image.substring(1, t.image.length() - 1); pw.println(" "); pw.println(" " + s + " "); pw.println(" "); }}
Other XPG Tasks Create XLIFF surrounding tags Create skeleton file Embed code for handling inline tags
Inline Tags Example: –“Click on the {0} button to start the {1} job” The {0} and {1} constitute inline tags Not part of grammar itself Can vary from application to application We must be able to extract these based on regular expressions: –{[0-9]+}
XPG and Inline Tags Embeds code to read a set of regular expressions from a file. When the extractor identifies a string: –Executes RE on string –Moves matches to XLIFF inline tag
Final Architecture
XPG & XML The XLIFF Transformation Problem Current approaches Grammar based approach – XPG XPG & XML Summary
XPG and XML Applications A similar approach can be applied to XML Schemas Uses XSTL & DOM rather than JavaCC Can identify XML tags and attributes that may contain text
Summary XPG is an approach to XLIFF transformation that corresponds to the grammar of the language being transformed. This ensures consistent, error free and rapid XLIFF transformation. The XPG approach is suitable for computer languages and markup