TBX version 3 – Learning from users Alan Melby, developed with Hanne Smaadahl
What is TBX? TermBase eXchange XML-based framework for representing structured terminological data. Independent of programming language and operating systems. Flexible enough to represent most of the information in a variety of terminology databases. What is TBX? 2018 Smaadahl / Melby TBX, or TermBase eXchange, is the open, XML-based standard for exchanging structured terminological data. It is independent of programming language or operating systems. And flexible enough to support a variety of terminology databases. When we say “Flexible enough to represent most of the information in a variety of terminology databases”, we mean: 1 - structural relationships >> Here we assume the termbase has three levels: concept, language, and term. 2 - data categories >> All the information in a termbase can be represented in TBX and much of it can be represented in industry-standard data categories (aka "data element types" or "fields"), depending on the target TBX dialect, and the rest can be represented in note elements, using the "steamroller" approach.
Saving information in a termbase to a separate file separating content from tool to support future software change Exchanging information between systems (3 examples) authoring translation data mining Guiding the design of a new termbase for interoperability Why do we have TBX? 2018 Smaadahl / Melby TBX is designed to satisfy a number of use cases. The main ones are: Saving or Archiving the information in a termbase. This allows you to separate the valuable content of your termbase from any specific tool. Especially important to support future software change. Exchanging information between systems. Here are three examples: Sending monolingual information from a termbase to an authoring tool (authoring) Sending a subset of the information from a termbase to a translator. This can be both human and machine translation. Export most or all information from a termbase for analysis using XML when you need more advanced analytics. When your terminology management tool doesn’t give you all the answers you need. Guiding the design of a new termbase for interoperability with other termbases, but also with other tools that are used to repurpose terminology data.
2002 2008 2019 History of TBX TBX 1.0 LISA-OSCAR TBX 2.0 TBX 3.0 ISO 30042:2008 2019 TBX 3.0 ISO 30042:2019 History of TBX 2018 Smaadahl / Melby TBX has a long history dating back to the 1980’s when the need for a terminology format was first recognized. It has gone through several major iterations, evolving from SGML to XML. Including close cooperation within the TEI (Text Encoding Initiative) during the 1990’s. It was first published as an industry standard in 2002, when the OSCAR (Open Standards for Container/Content Allowing Re-use) working group of the now-disbanded Localisation Industry Standards Association (LISA) came out with the first version of TBX. The second generation of TBX was co-published in 2008 by LISA and ISO. LISA was disbanded in 2011. The third generation will be published in 2019 by ISO. TBX as an ISO standard is maintained in ISO Technical Committee 37, Language and terminology. TBX predecessors: MATER (ISO 6156:1986), MicroMATER (Melby, 1991), cooperation within the TEI culminated in ISO 12200:1999 -- MAchine-Readable Terminology Interchange Format (MARTIF).
Too powerful (complex) TBX 2.0: no easy way to know what to expect* What is the chaos we wish to tame? Too powerful (complex) *People didn’t know what to expect because they didn’t include an XCS file with each TBX document instance. 2018 Smaadahl / Melby The “chaos” that we are referring to is the result of the fact that when you receive a file that is claimed to be compliant with TBX version 2.0, it is usually hard for import software to know what to expect. The import process often fails. This is because second-generation TBX used a complex mechanism, called an XCS (eXtensible Constraint Specification), to dynamically indicate what to expect in the file (by specifying constraints on metadata). The chaos was created because people did not follow the rule: They didn’t include an XCS file with each TBX document instance.
<?xml version="1.0"?> <TBXXCS name='DXFd-supplier' version="1.0" lang='en' xmlns="x-schema:TBX-XCS-XDRschema-v-0- 1.xml"> <header><title>subset DCS file for the Supplier example</title></header> <datCatSet> <termNoteSpec name="termType" datcatId="ISO12620A-0201"> <contents datatype="picklist" targetType="none">fullForm abbreviatedForm</contents> </termNoteSpec> <descripSpec name="subjectField" datcatId="ISO12620A-04"> <contents datatype="picklist" targetType="none">manufacturing finance</contents> <levels>termEntry</levels> </descripSpec> <descripSpec name="definition" datcatId="ISO12620A-0501"> <contents datatype="noteText" targetType="none"/> <levels>termEntry </descripSpec> </datCatSet> </TBXXCS> Sample XCS 2018 Smaadahl / Melby Sample snippet of an XCS file.
How is complexity reduced? No “generic” TBX Specify the dialect No more chaos! A TBX file must belong to a dialect TBX dialects are strictly constrained to certain data categories 3 current, public dialects: TBX-Core, TBX-Min, TBX-Basic Public dialects follow a “telescoping” principle How is complexity reduced? 2018 Smaadahl / Melby TBX 3.0, on the other hand, uses a simple mechanism: A required dialect name on the root element of each TBX file. TBX 3.0 files cannot be “generic”, but must now be instances of specific TBX dialects, for example “TBX-Basic” (it can no longer simply be “TBX”). This change will address the single most common complaint about version 2.0: lack of predictability. Going forward, tools supporting the same dialect will know what to expect. TBX dialects are strictly defined to include only certain data categories. The dialect name is linked to a formal description of what to expect. Public dialect descriptions are stored on freely available industry-standard websites, such as www.tbxinfo.net . Now all that an import routine needs to do is look at the dialect name to decide whether it is prepared to deal with that TBX file. Chaos has been tamed through a simple mechanism.
Telescoping principle Date Term Note TBX-Core Core + Administrative Status Customer Subset Part of Speech Subject Field TBX-Min TBX-?? TBX-Basic Core + Min + Context Definition External xref Gender Geographical Usage Project Subset Related Concept Related Term Responsibility Source Term Location Term Type Transaction Type xGraphic Core + Min Basic Needs of a given user community Telescoping principle 2018 Smaadahl / Melby There are three current public dialects of TBX: TBX-Core, TBX-Min, TBX-Basic. These public dialects are built on a “telescoping” principle. TBX-Basic contains the data categories of TBX-Min. TBX-Min contains the data categories of TBX-Core. If software supports TBX-Min, it can therefore partially support TBX-Basic. If it supports TBX-Basic, it can support TBX-Min. This “telescope” can be extended to future dialects. The requirement for a dialect to be considered public is that is responds to the needs of a specific user community. For example, TerminOrgs, an organization that represents Terminology In Large Organizations, is working on a dialect that would further expand on TBX-Basic, to meet the specific needs of terminology management in large organizations (compared to the more LSP or translation oriented scenario of TBX-Basic). (Photo: Canva, free stock photo)
Features of 3.0 VALIDATION INTEROPERABILITY TBX 3.0 EASE OF USE MODERNIZATION EASE OF USE TBX 3.0 Features of 3.0 2018 Smaadahl / Melby VALIDATION: This version of the standard clearly defines the requirements for: the Core, for a dialect to be compliant, and for document instance validation using off-the-shelf XML tools (e.g. Oxygen). INTEROPERABILITY features added in version 3.0 are: The name of a TBX dialect must be declared as the value of the type attribute on the <tbx> root element, “TBX-Basic”. TBX dialects are strictly defined to include certain data categories. There are currently 3 public dialects: Core, Min, Basic. TBX 3.0 coordinates with certain aspects of OASIS XLIFF. This version implements the XLIFF 2.0 inline markup model. This version of TBX has been MODERNIZED with features such as: Introducing a simplified, more “modern” XML style, DCT (Data Category as Tag), alongside the traditional TBX style of DCA (Data Category as Attribute). Preserving the latter for legacy support. TBX dialects may convert from one style to another without data loss (isomorphic). Full elaboration of DCT is for future versions. A preview is available on TBXinfo. Introducing a permanent, default xml namespace for TBX 3.0. Core namespace URN is urn:iso:std:iso:30042:ed:3.0 Adding a @dir attribute for text directionality (similar to HTML or XLIFF dir) that will be allowed to inherit structurally. The dir values are ltr (left-to-right), rtl (right-to-left), and auto (default). Version 3.0 is EASIER TO USE because : the XCS file in version 2.0 has been replaced by the requirement to declare the dialect name on the <tbx> root element. It allows for validation using RNG Schema (or XSD), instead of DTD + XCS. There’s free access to machine readable artifacts and supporting documentation on public websites, e.g. tbxinfo.net.
Dialect toolkit for public dialects Core, Min and Basic TBX “Spyglass” (analyzing TBX files without looking at XML) MultiTerm to TBX conversion Mapping Wizard MultiTerm-to-TBX Converter collaboration with Glossary Converter TBX “Steamroller” TBX v2-to-v3 Conversion TBX v3 Validation Tools to help you You only need the ISO 30042 standard to define a new dialect 2018 Smaadahl / Melby There are many tools available to help you, free of charge. A good starting point is TBXinfo.net. This is a community effort. Resources are freely available. We invite you to join and contribute to this community. Dialect toolkits for public dialects Core, Min and Basic. TBX “Spyglass” allows you to analyze TBX files without looking at XML. MultiTerm to TBX conversion (this is a plugin for MultiTerm, and is the result of collaboration with Gerhard Kordmann’s Glossary Converter). It is the perhaps the most downloaded MultiTerm plugin. TBX “Steamroller” helps you convert any TBX file into a valid TBX-Basic output, including the ability to convert invalid TBX into valid TBX. TBX v2-to-v3 Conversion TBX v3 Validation
See www.tbxinfo.net (TBX website) We help you! Import/export to/from various CAT tools We need you! What’s next? 2018 Smaadahl / Melby Photo: SAP image library (royalty free)
Demos 2018 Smaadahl / Melby Photo: SAP image library (SAP owned)
Thank You Hanne Smaadahl Alan K. Melby Senior Terminologist, SAP & Project Lead ISO 30042 v3.0 hanne.smaadahl@sap.com LTAC Global President & Professor Emeritus, Brigham Young University (BYU) akm@ltacglobal.org 2018 Smaadahl / Melby
Appendix 2018 Smaadahl / Melby
TBX sample 2018 Smaadahl / Melby