Hypermedia Lexica and Lexicon Metadata The MetaLex model in the ModeLex project Dafydd Gibbon U Bielefeld Europe E-MELD Workshop, Detroit, August 2002
Overview Metalex goals Background: DATR, Hyprlex, Speech, Language Documentation Metalex design: theory and practice Lexical documents & metadocuments Lexical objects, properties, structures Metalex implementation Ivory Coast encyclopaedia project Ega documentation model project The Modelex (multimodal lexicon) project Ivory Coast + Nigeria documentation curriculum project Extending metalex Modalities & submodalities Data-driven lexicography Data structures & algorithms: trees, lattices; induction, inference
General objectives: Versatile high quality spoken language lexicography Motivated balance of high-tech + low tech Good resources are data-driven and theory-informed Specific project objectives: DATR/ILEX: formal lexicon theory and implementation VerbMobil: integrated HyprLex dissemination model HyprLex encyclopaedia model for Ivory Coast Languages Ega endangered language documentation model Modelex - theory and design of multimodal lexica Ivory Coast and Nigeria curricula for language documentation Metalex goals: background
Data-driven data + metadata acqusition: Systematic metatext derived from and supporting... Computational fieldwork Induction of lexica Theory-informed data + metadata acquisition: Integrated Lexicon (ILEX) consisting of... Abstract Lexicon (ALEX) - "theory" in the mathematical sense Object Lexicon (OLEX) - "model" in the mathematical sense Metalex design: data and theory
Data-driven acquisition: Computational fieldwork Portable metadatabase with restricted vocabulary and general metatext, and Definition of and support for transcription + annotation Portable support for scenarios, scripts Portable support for lexicon processing Induction of lexica Lexicon tools for Extraction of macrostructural elements (lexeme elements) Induction of microstructural information (media concordance, POS,...) Induction of mesostructural regularities and subregularities (grammar,...) Metalex design: data
Theory-informed formalisation: Abstract Lexicon (ALEX) - "theory" in the mathematical sense Decomposition (componential A-V description) Generalisation (inheritance) Composition (multilinear operations) Object Lexicon (OLEX) - "model" in the mathematical sense XML archiving and dissemination formats object-relational database acquisition and processing formats = Integrated Lexicon (ILEX) Metalex design: theory
Data model Theory = shared lexicon architecture: Macrostructure: declarative and procedural components Lexicon architecture: relational, inheritance, text,... Lexical objects: entry types Lexical access: fact query, semasiological / onomasiological indexing Mesostructure: Generalisations: grammar, phonetics, cultural background,... Composition of lexicon object types: idioms, words, morphemes,... Lexical access: inferential query Microstructure: Lexical entry (article, lemma structure - atom, string, tree,...) Types of lexical information - standardly: "lexicon model" Metalex implementation: architecture
Microstructure specification philosophy: Anybody can specify any kind of unpredictable detail Questionnaire / Experiment / Corpus / Archive dependence Lexicon architecture: relational, inheritance, text,... Intelligent (semi-)automatic classification, not fixed attributes Theory-informed coarse grouping is possible Media attributes: visual, auditory, tactile,... Meaning attributes: definition, gloss, lexical relations,... Composition attributes: context/category, parts, operations Use attributes: style, register, concordance, media illustrations,... Micrometadata attributes: lexicographer DB indices, source (e.g. fieldwork metadata) DB indices, modification,... Metalex implementation: microstructure
Metalex implementation: fieldwork metadata source (1) Situation dimensions participant: fieldworker, partners, contacts channel: modalities, media locale: indoor/outdoor, spatial configuration temporal: date, time, calendar event functional: affiliation, role, occasion; observation (prompt, metadata management) Language dimension affiliation discourse level: discourse type, genre + prosody phrase level: recursive phrasal categories/relations + prosody word level: clitics, inflexion, word formation + prosody
Metalex implementation: fieldwork metadata source (2) Technical dimension physical characteristics of participants: age, sex, health physical characteristics of locale: indoor/outdoor, spatial configuration, temporal sequence, date (season), time (of day) audio: mike type, position, room; A/D; channels, f sample, resolution; formats video: camera & microphone type, analogue/digital; filters, lenses; audio; formats other sensors: laryngograph, airflow, data glove,... Metalinguistic dimension empirical method: introspection, experiment, corpus elicitation materials: questionnaire, experiment layout, corpus scenario metadata specification: index, metatext type, metacatalogue type
Metalex implementation: fieldwork metadata entry tool LREC 2002, Workshop on Portability Issues
Metalex implementation: fieldwork metadata entry tool HanDBase DBMS for PalmOS
Metalex objects in conjunction with work in ISLE CLWG (Computational Lexicon Working Group) (see Gibbon in reading list) LEXICON: {, } Macrostructure: Ordering( {ENTRY,...} ) Mesostructure: Mesostructure: ENTRY:
The LEXICON object Front Matter Metadata: Bibliographical: creator, publisher, title, date,... Medium / format: paper, CD-ROM/DVD, web,... Macrostructure type: access: semasiological/onomasiological, n-lingual/langue(s), special: taxonomy (thesaurus), concordance structure, e.g. tabular: f(type,attrib)=value
The ENTRY object: metadata Entry Metadata: (see Gibbon & al. in reading list) Entry type (wrt macrostructure specification): encyclopaedic multiword unit, word,... Microstructure data model specification: entry structure: flat, tree, graph (net),... dta categories specification (atribute, field, information type) DC groups - structural skeleton DCs DC substructure - homography, homophony, polysemy...
The ENTRY object: DC groups Media ("surface"): acoustic (phonetic, earcon, sonification,), visual (orthography, icon, gesture,...) Composition (structure): part (e.g. morphology for words), context (e.g. POS, subcat for words) Meaning (definition, illustration): semantic (components, relations, senses, ontology) pragmatic (speech act, dialogue, disfluency,...) Use: typically: media (e.g. audio) concordance,... Metadata: lexicographer,...
The ENTRY object: DCs Countless Data Category models: (see reading list) every existing dictionary linguistic "types of lexical information" several European projects (GENELEX, MULTILEX, ACQUILEX,...) ISO terminology norms (cf. MARTIF etc....)
The ENTRY object: DC structures Computationally relevant properties of fields: type (atomic, complex: tree, string, xyz-formatted text) character encoding spec.: ASCII, Unicode, xyz tree (or other graph/net): finite depth flat, disjunctive disjunctive tree recursive graph (net) table, non-tree graph, anchor/link/index structure generated text: print, hypertext (compiled vs. dynamic (generated on the fly)
Metalex microstruture application Media ("surface"): phonemic & tonemic transcription (SAMPA ASCII - still waiting for Unicode...) Composition (structure): morphemic substructure, category & subcategory Meaning (definition, illustration): glosses (English, French, German) definitions, senses, relations, components; audio-visual illustration Use: genres; examples (e.g. concordance link); free text notes Metadata: first record; last field
Metalex field lexicon microstruture Anouman_1: Media attributes: Phonemic tier: `an'U~m`'a~ Skeletal tier: VNVNV Tonal tier: L H LH Signal tier: Audio Meaning attributes: F-gloss: Oiseau E-gloss: Bird G-gloss: Vogel Definition: avis Homophone full: Anouman_2: grandchild Homophone phonemic: Anouman_3: yesterday Use: Genre: narrative Metadata: Lexicographer: S. Adouakou Source: Bielefeld-Anyi-Corpus, Adaou village, CI Date: March 2002
Metalex portable lexical database Relational database: Metalex specs flattened structure re-constitution via metalex specs HanDBase for PalmOS Features: standard full RelDBMS XML, CSV, text export export/import via GSM inexpensive (wrt laptop) stylus, keyboard, sync input light weight low power consumption inconspicous in use interfaces to Scheme, C
Metalex extension The Modelex project: "Theory and Design of Multimodal Lexica" Goals: Data-driven, theory-informed lexicon models Formal properties of abstract data models for multimodal lexica Interpretation of abstract data models in XML Integration of parallel annotation lattices for modalities and submodalities Development of a prototype multimodal lexicon
The Modelex domain: modalities and submodalities
Modelex: data driven lexicography
Modelex: gesture annotation Time Aligned Signal Corpus System (Java, GPL) Jan-Torsten Milde, U Bielefeld TASX annotator: Phonological tier ToBI tiers Gesture tier Speech Act tier Anyi, Ega, German
Model-theoretic compilation in ILEX: INTERPRETATION ( ALEX ) = OLEX
Metalex in the Modelex project: M ultimodal concordance as microstructure DC Prototype:
Metalex in the Modelex project: underspecified ALEX microstructure for gesture coordinates Hand: == "Palm" "Digit" == " " "> == " " " " " " " " <> ==. Palm: == == palm == pw == ph == == ( + ( - ) / 3 ) == ( + ( - ) * 2 / 3 ) == == px1 == py1 == ( + ) <> == Hand.
Metalex in the Modelex project: fully specified ALEX microstructure for gesture coordinates Hand: = palm px1 py1 ( px1 + pw ) ( py1 + ph ) thumb px1 py1 ( px1 - lt ) py1 fore px1 py1 px1 ( py1 - lf ) middle ( px1 + ( ( px1 + pw ) - px1 ) / 3 ) py1 ( px1 + ( ( px1 + pw ) - px1 ) / 3 ) ( py1 - lm ) ring ( px1 + ( ( px1 + pw ) - px1 ) * 2 / 3 ) py1 ( px1 + ( ( px1 + pw ) - px1 ) * 2 / 3 ) ( py1 - lr ) pinky ( px1 + pw ) py1 ( px1 + pw ) ( py1 - lp )
Metalex: conclusion & prospects User complexity: demands an open, data-driven approach Domain: demands a theory-informed approach with computational acquisition & inference Data-driven and theory-informed lexica are possible (METALEX) need integrated model-theoretic approach (ILEX): INTERPRETATION (ALEX) = OLEX a formal problem remains: differing complexity of trees (archive): simulation of other graphs via semantics only annotation lattices (data), tables (lexica): regular relations if non-recursive, indexed grammars if recursive?