Standards for language resources the ISO/TC 37(/SC 4) perspective Laurent Romary Directeur de Recherche INRIA ISO/TC 37/SC 4 chair
Context ISO TC37 - Terminology and other language resources SC3 - Computer applications in terminology ISO 12200 - Martif ISO 12620 - Data categories (under revision) ISO 16642 - TMF (Terminological Markup Framework) SC4 - Language Resource Management www.tc37sc4.org
An example scenario: information extraction Semantic content Content analysis Syntactic structures Chunk parsing Part-of-speech tagging POS tagging Primary Data
Horizontal view (W3C perspective) Semantic content OWL XML Content analysis RDF Syntactic structures Chunk parsing Part-of-speech tagging SOAP POS tagging Primary Data
Vertical view (ISO/TC 37/SC 4 perspective) Semantic content Content analysis Evaluation Linguistic models and descriptors (Data Categories) Syntactic structures Chunk parsing Lexica Part-of-speech tagging POS tagging Primary Data
Linguistic information sources …and initiatives Access protocols [Corba, SOAP] Primary resources (text, dialogues) Structural mark-up Basic annotations [TEI, MPEG7, TMX, XLIFF, XHTML, etc.] Knowledge structures Hierarchies of types Relations between concepts (subjects/topics etc.) Links to primary resources [Topic Maps, OIL, RDF] Links NLP structures (annotations) POS tagging Chunks (cf. Named Entities) Deep Syntactic structures Co-references etc. [Eagles/ISLE, CES, MATE,…] Lexical structures (Language models) Terminologies Transfer lexica LTAG/HPSG/LFG lexica [TBX, OLIF, Eagles/ ISLE (Genelex)] Meta-data [Dublin core, OLAC, ISLE, MPEG7, RDF]
SC4 Approach Efforts geared towards defining abstract models and general frameworks for the creation and representation of language resources In principle, abstract enough to accommodate diverse linguistic, theoretical or practical approaches No provision of new formats Situate development squarely in the framework of XML and related standards Ensure compatibility with established and widely accepted web-based technologies Ensure feasibility of transduction from legacy formats into newly defined formats
SC4 and other standardizing bodies Contributing organizations ----- ----- ----- TEI text representation Reference for primary sources e.g.: text archives ----- Oscar Text W3C basic protocols and formats XML (Schemas) XPath XPointer + RDF, SVG, SMIL, SOAP ISO TC37/SC4 - language resources, NLP perspective e.g. linguistic annotations, lexical formats Technical background MPEG - Multimedia, XML based e.g. MPEG7-4 Word and phone lattices Audio/Speech
ISO/TC 37/SC 4 structure Data categories WG4 WG2 WG3 WG5 WG1 Lexical databases WG2 Representation schemes WG3 Multilingual text representation Workflow of language Resource Management WG5 WG1 Basic descriptors and mechanisms for language resources
On-going activities Feature structure representation (in collaboration with the TEI - Text Encoding Initiative) ISO DIS 24610 Morpho-syntactic annotation ISO NP 24611 Lexical markup framework ISO NP 24612 (+ ISO NP 12620-3) Task force on Meta-data for language resources (OLAC+IMDI) ACL/Sigsem working group on multimodal content representation Data category registry for ISO/TC 37 ISO CD 12620-1 on ballot (deadline Jan. 2004)
Modeling linguistic annotation structures
General framework - 1 Model for linguistic annotation that can be instantiated in a standard representational format GMT: Generic Mapping Tool serve as a pivot format into and out of which proprietary formats may be transduced to enable Comparison, merging, manipulation via common tools Reference: ISO 16642 - Terminological Markup Framework
General framework - 2 A meta-model A set of data-categories A general, underlying model that informs current practice A set of data-categories Provides to precise semantics of the format Obtained: By sub-setting a Data Category Registry By providing application specific categories Vs. terminology - fixed named levels
ISO 16642: A family of formats TMF … TML1 TML2 TML3 TMLi (Geneter) (TBX) GMT
Meta-model * * * * Terminological Data Collection (TDC) Global Information (GI) Complementary Information (CI) * Terminological Entry (TE) * Language Section (LS) * Term Section (TS) * Term Component Section (TCS)
TMF: example TE LS LS TS TS id=‘ID67’ subjectField=‘ manufacturing ’ definition=‘A value…’ TE LS LS lang=‘ hu ’ lang=‘ en ’ TS term=‘alpha smoothing factor’ termType=‘fullForm’ term=‘…’ TS
Implementation in TBX (cf. www.lisa.org) <termEntry id='ID67'> <descrip type='subjectField‘>manufacturing</descrip> <descrip type='definition'>A value between 0 and 1 used in ...</descrip> <langSet lang='en'> <tig> <term>alpha smoothing factor</term> <termNote type='termType'>fullForm</termNote> </tig> </langSet> <langSet lang='hu'> <term>Alfa ...</term> </termEntry>
Implementing a Data Category Registry for ISO TC37
Data Category Definition: Example: Background: Elementary descriptor used in a linguistic description or annotation scheme Example: /Part of speech/, /Grammatical gender/, /Grammatical number/, /Feminine/, /Plural/, /Ablative/ Background: Experience gained from ISO 16642 in linguistic format specification Wider notion of data-categories as meta-data for tagged language resources
Multiple uses of data categories Documentation Meta-data XML schemas Data category selection Meta model XSL filters
Application domains Terminological data collection (TC 37/SC 3) Cf. “old” ISO 12620 set of data categories for terminology Language codes (TC 37/SC 2) Cf. evolution from ISO 639-1 and ISO 639-2 to ISO 639-4 On-going and future SC4 activities (TC 37/SC 4) Meta-data for language resources Morpho-syntax/Syntax, Discourse level annotation NLP lexica, MT lexica Multilingual data representations (e.g. translation memories) and access (query languages)
Technical background ISO 11179 (ISO JTC 1/SC 32): meta-data registry view Provide mechanisms for the management of data categories ISO 16642 (ISO TC 37/SC 3): terminology view Provides ways of dealing with multilingual issues OWL (W3C Sem. Web activity): ontology view Provides a framework for dealing with hierarchies and expressing constraints on data-categories E.g. a /noun/ can be described by means of /gender/ and /number/ in French
XML schema declaration Relation to ISO 11179 Complex datcat Set of Simple datcats /masculine/ /feminine/ /neuter/ /gender/ Data element concept Conceptual domain Data element Value domain XML object List of values Implemented as an XML attribute named ‘gen’ m, f, n XML schema declaration <w lemme=“vert” gen=“f”>verte</w>
The ISO 12620-1 proposal Entry Identifier: gender Profile: morpho-syntax Definition (fr): Catégorie grammaticale reposant, selon les langues et les systèmes, sur la distinction naturelle entre les sexes ou sur des critères formels (Source: TLFi) Definition (en): Grammatical category… (Source: TLFi (Trad.)) Conceptual Domain: {/feminine/, /masculine/, /neuter/} Object Language: fr Name: genre Conceptual Domain: {/feminine/, /masculine/} Object Language: en Name: gender Object Language: de Name: Geschlecht Conceptual Domain: {/feminine/, /masculine/, /neuter/}
Perspectives ISO/TC 37/SC 4 in a wider picture Basic building blocks to bring coherence in the representation of linguistic information in a variety of application domains E.g. e-documentation, e-learning, e-business (e-catalogues), multimedia, localisation… Provide vertical solution to linguistically based applications E.g. Information extraction, indexing