Hypermedia Lexica and Lexicon Metadata The MetaLex model in the ModeLex project Dafydd Gibbon U Bielefeld Europe E-MELD Workshop, Detroit, August 2002.

Slides:



Advertisements
Similar presentations
Using OLIF, The Open Lexicon Interchange Format Susan McCormick OLIF2 Consortium October 1, 2004.
Advertisements

IRCS Workshop on Open Language Archives IMDI & Endangered Languages Archives Heidi Johnson / AILLA.
The Seven Pillars of Open Language Archiving: A Vision Statement Gary Simons and Steven Bird Workshop on Web-based Language Documentation and Description.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Online Qualitative Data Resources: Best Practice in Metadata Creation.
Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.
Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University
Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
An overview of EMMA— Extensible MultiModal Annotation Michael Johnston AT&T Labs Research 8/9/2006.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Irek Defée Signal Processing for Multimodal Web Irek Defée Department of Signal Processing Tampere University of Technology W3C Web Technology Day.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
What Linguists Want (we think) Helen Aristar Dry & Anthony Aristar LINGUIST List & E-MELD.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
CS 586 – Distributed Multimedia Information Management Prof. Dennis McLeod.
EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic.
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
A Motivating Scenario for Designing an Extensible Audio- Visual Description Language Monday 25 th of October, 2004 Raphaël Troncy, Jean Carrive, Steffen.
June 20, 2006E-MELD 2006, MSU1 Toward Implementation of Best Practice: Anthony Aristar, Wayne State University Other E-MELD Outcomes.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Multimedia Databases (MMDB)
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Working group on multimodal meaning representation Dagstuhl workshop, Oct
Metadata, the CARARE Aggregation service and 3D ICONS Kate Fernie, MDR Partners, UK.
ATLAS Demystified: A Practical Introduction Christophe Laprun, Jonathan Fiscus, John Garofolo, Sylvain Pajot National Institute of Standards and Technology.
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
Work Group 2: Ontological Concepts for Lexical Entries.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Metadata Helen Aristar Dry Eastern Michigan University LINGUIST List.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Language and Computation Day University of Essex 4 October 2005.
Towards multimodal meaning representation Harry Bunt & Laurent Romary LREC Workshop on standards for language resources Las Palmas, May 2002.
Customizing the IMDI metadata schema for endangered languages Heidi Johnson (AILLA) Arienne Dwyer (DOBES)
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh.
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Exploring and Enriching a LR Archive via the Web Marc Kemps-Snijders, Alex Klassmann, Claus Zinn, Peter Berck, Albert Russel, Peter Wittenburg MPI for.
LEXUS a flexible web based lexicon tool LEXUS a flexible web based lexicon tool, august 21 th, 2005 Marc Kemps-Snijders Peter Wittenburg
Technology – Broad View Aspects that play a role when integrating archives leave the details of some core topics to the 2. day Bernhard Neumair:Base Technologies.
Using Several Ontologies for Describing Audio-Visual Documents: A Case Study in the Medical Domain Sunday 29 th of May, 2005 Antoine Isaac 1 & Raphaël.
A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
MPEG-7 Audio Overview Ichiro Fujinaga MUMT 611 McGill University.
MSG Reuse Catalog T.W. van den Berg 7 April 2010.
1 February 2012 ILCAA, TUFS, Tokyo program David Nathan and Peter Austin Hans Rausing Endangered Languages Project SOAS, University of London Language.
A Reduced Yet Extensible Audio- Visual Description Language: How to Escape From The MPEG-7 Bottleneck Thursday 28 th of October, 2004 Raphaël Troncy, Jean.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
Building (on) a few dictionaries from Asia & the Pacific Alexandre François — CNRS–LACITO, Paris.
ELAN as a tool for oral history CLARIN Oral History Workshop Oxford Sebastian Drude CLARIN ERIC 18 April 2016.
Online Information and Education Conference 2004, Bangkok Dr. Britta Woldering, German National Library Metadata development in The European Library.
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Knowledge Management Systems
Natural Language Processing (NLP)
What is Linguistics? The scientific study of human language
Tomás Murillo-Morales and Klaus Miesenberger
Multimedia Content Description Interface
Lecture 8 Information Retrieval Introduction
Natural Language Processing (NLP)
FRBR and FRAD as Implemented in RDA
Natural Language Processing (NLP)
Using Dictionaries in Translation (223 TRAJ)
Presentation transcript:

Hypermedia Lexica and Lexicon Metadata The MetaLex model in the ModeLex project Dafydd Gibbon U Bielefeld Europe E-MELD Workshop, Detroit, August 2002

Overview Metalex goals Background: DATR, Hyprlex, Speech, Language Documentation Metalex design: theory and practice Lexical documents & metadocuments Lexical objects, properties, structures Metalex implementation Ivory Coast encyclopaedia project Ega documentation model project The Modelex (multimodal lexicon) project Ivory Coast + Nigeria documentation curriculum project Extending metalex Modalities & submodalities Data-driven lexicography Data structures & algorithms: trees, lattices; induction, inference

General objectives:  Versatile high quality spoken language lexicography  Motivated balance of high-tech + low tech  Good resources are data-driven and theory-informed Specific project objectives:  DATR/ILEX: formal lexicon theory and implementation  VerbMobil: integrated HyprLex dissemination model  HyprLex encyclopaedia model for Ivory Coast Languages  Ega endangered language documentation model  Modelex - theory and design of multimodal lexica  Ivory Coast and Nigeria curricula for language documentation Metalex goals: background

Data-driven data + metadata acqusition: Systematic metatext derived from and supporting...  Computational fieldwork  Induction of lexica Theory-informed data + metadata acquisition: Integrated Lexicon (ILEX) consisting of...  Abstract Lexicon (ALEX) - "theory" in the mathematical sense  Object Lexicon (OLEX) - "model" in the mathematical sense Metalex design: data and theory

Data-driven acquisition:  Computational fieldwork Portable metadatabase with restricted vocabulary and general metatext, and  Definition of and support for transcription + annotation  Portable support for scenarios, scripts  Portable support for lexicon processing  Induction of lexica Lexicon tools for  Extraction of macrostructural elements (lexeme elements)  Induction of microstructural information (media concordance, POS,...)  Induction of mesostructural regularities and subregularities (grammar,...) Metalex design: data

Theory-informed formalisation:  Abstract Lexicon (ALEX) - "theory" in the mathematical sense  Decomposition (componential A-V description)  Generalisation (inheritance)  Composition (multilinear operations)  Object Lexicon (OLEX) - "model" in the mathematical sense  XML archiving and dissemination formats  object-relational database acquisition and processing formats = Integrated Lexicon (ILEX) Metalex design: theory

Data model  Theory = shared lexicon architecture:  Macrostructure: declarative and procedural components  Lexicon architecture: relational, inheritance, text,...  Lexical objects: entry types  Lexical access: fact query, semasiological / onomasiological indexing  Mesostructure:  Generalisations: grammar, phonetics, cultural background,...  Composition of lexicon object types: idioms, words, morphemes,...  Lexical access: inferential query  Microstructure:  Lexical entry (article, lemma structure - atom, string, tree,...)  Types of lexical information - standardly: "lexicon model" Metalex implementation: architecture

Microstructure specification philosophy:  Anybody can specify any kind of unpredictable detail  Questionnaire / Experiment / Corpus / Archive dependence  Lexicon architecture: relational, inheritance, text,...  Intelligent (semi-)automatic classification, not fixed attributes  Theory-informed coarse grouping is possible  Media attributes: visual, auditory, tactile,...  Meaning attributes: definition, gloss, lexical relations,...  Composition attributes: context/category, parts, operations  Use attributes: style, register, concordance, media illustrations,...  Micrometadata attributes: lexicographer DB indices, source (e.g. fieldwork metadata) DB indices, modification,... Metalex implementation: microstructure

Metalex implementation: fieldwork metadata source (1) Situation dimensions  participant: fieldworker, partners, contacts  channel: modalities, media  locale: indoor/outdoor, spatial configuration  temporal: date, time, calendar event  functional: affiliation, role, occasion; observation (prompt, metadata management) Language dimension  affiliation  discourse level: discourse type, genre + prosody  phrase level: recursive phrasal categories/relations + prosody  word level: clitics, inflexion, word formation + prosody

Metalex implementation: fieldwork metadata source (2) Technical dimension  physical characteristics of participants: age, sex, health  physical characteristics of locale: indoor/outdoor, spatial configuration, temporal sequence, date (season), time (of day)  audio: mike type, position, room; A/D; channels, f sample, resolution; formats  video: camera & microphone type, analogue/digital; filters, lenses; audio; formats  other sensors: laryngograph, airflow, data glove,... Metalinguistic dimension  empirical method: introspection, experiment, corpus elicitation  materials: questionnaire, experiment layout, corpus scenario  metadata specification: index, metatext type, metacatalogue type

Metalex implementation: fieldwork metadata entry tool LREC 2002, Workshop on Portability Issues

Metalex implementation: fieldwork metadata entry tool HanDBase DBMS for PalmOS

Metalex objects in conjunction with work in ISLE CLWG (Computational Lexicon Working Group) (see Gibbon in reading list) LEXICON:  {, }  Macrostructure: Ordering( {ENTRY,...} )  Mesostructure:  Mesostructure: ENTRY:  

The LEXICON object Front Matter Metadata:  Bibliographical: creator, publisher, title, date,...  Medium / format: paper, CD-ROM/DVD, web,... Macrostructure type:  access: semasiological/onomasiological,  n-lingual/langue(s),  special: taxonomy (thesaurus), concordance  structure, e.g. tabular: f(type,attrib)=value

The ENTRY object: metadata Entry Metadata: (see Gibbon & al. in reading list)  Entry type (wrt macrostructure specification):  encyclopaedic  multiword unit, word,...  Microstructure data model specification:  entry structure: flat, tree, graph (net),...  dta categories specification (atribute, field, information type)  DC groups - structural skeleton  DCs  DC substructure - homography, homophony, polysemy...

The ENTRY object: DC groups Media ("surface"):  acoustic (phonetic, earcon, sonification,), visual (orthography, icon, gesture,...) Composition (structure):  part (e.g. morphology for words), context (e.g. POS, subcat for words) Meaning (definition, illustration):  semantic (components, relations, senses, ontology)  pragmatic (speech act, dialogue, disfluency,...) Use: typically: media (e.g. audio) concordance,... Metadata: lexicographer,...

The ENTRY object: DCs Countless Data Category models: (see reading list)  every existing dictionary  linguistic "types of lexical information"  several European projects (GENELEX, MULTILEX, ACQUILEX,...)  ISO terminology norms (cf. MARTIF etc....)

The ENTRY object: DC structures Computationally relevant properties of fields:  type (atomic, complex: tree, string, xyz-formatted text)  character encoding spec.: ASCII, Unicode, xyz  tree (or other graph/net):  finite depth  flat, disjunctive disjunctive tree  recursive graph (net)  table, non-tree graph, anchor/link/index structure  generated text:  print, hypertext (compiled vs. dynamic (generated on the fly)

Metalex microstruture application Media ("surface"):  phonemic & tonemic transcription (SAMPA ASCII - still waiting for Unicode...) Composition (structure):  morphemic substructure, category & subcategory Meaning (definition, illustration):  glosses (English, French, German)  definitions, senses, relations, components; audio-visual illustration Use: genres; examples (e.g. concordance link); free text notes Metadata: first record; last field

Metalex field lexicon microstruture Anouman_1:  Media attributes:  Phonemic tier: `an'U~m`'a~  Skeletal tier: VNVNV  Tonal tier: L H LH  Signal tier: Audio  Meaning attributes:  F-gloss: Oiseau  E-gloss: Bird  G-gloss: Vogel  Definition: avis  Homophone full: Anouman_2: grandchild  Homophone phonemic: Anouman_3: yesterday  Use:    Genre: narrative  Metadata:  Lexicographer: S. Adouakou  Source: Bielefeld-Anyi-Corpus, Adaou village, CI  Date: March 2002

Metalex portable lexical database Relational database:  Metalex specs flattened  structure re-constitution via metalex specs  HanDBase for PalmOS  Features:  standard full RelDBMS  XML, CSV, text export  export/import via GSM  inexpensive (wrt laptop)  stylus, keyboard, sync input  light weight  low power consumption  inconspicous in use  interfaces to Scheme, C

Metalex extension The Modelex project: "Theory and Design of Multimodal Lexica" Goals:  Data-driven, theory-informed lexicon models  Formal properties of abstract data models for multimodal lexica  Interpretation of abstract data models in XML  Integration of parallel annotation lattices for modalities and submodalities  Development of a prototype multimodal lexicon

The Modelex domain: modalities and submodalities

Modelex: data driven lexicography

Modelex: gesture annotation Time Aligned Signal Corpus System (Java, GPL) Jan-Torsten Milde, U Bielefeld TASX annotator:  Phonological tier  ToBI tiers  Gesture tier  Speech Act tier Anyi, Ega, German

Model-theoretic compilation in ILEX: INTERPRETATION ( ALEX ) = OLEX

Metalex in the Modelex project: M ultimodal concordance as microstructure DC Prototype:

Metalex in the Modelex project: underspecified ALEX microstructure for gesture coordinates Hand: == "Palm" "Digit" == " " "> == " " " " " " " " <> ==. Palm: == == palm == pw == ph == == ( + ( - ) / 3 ) == ( + ( - ) * 2 / 3 ) == == px1 == py1 == ( + ) <> == Hand.

Metalex in the Modelex project: fully specified ALEX microstructure for gesture coordinates Hand: = palm px1 py1 ( px1 + pw ) ( py1 + ph ) thumb px1 py1 ( px1 - lt ) py1 fore px1 py1 px1 ( py1 - lf ) middle ( px1 + ( ( px1 + pw ) - px1 ) / 3 ) py1 ( px1 + ( ( px1 + pw ) - px1 ) / 3 ) ( py1 - lm ) ring ( px1 + ( ( px1 + pw ) - px1 ) * 2 / 3 ) py1 ( px1 + ( ( px1 + pw ) - px1 ) * 2 / 3 ) ( py1 - lr ) pinky ( px1 + pw ) py1 ( px1 + pw ) ( py1 - lp )

Metalex: conclusion & prospects User complexity:  demands an open, data-driven approach Domain:  demands a theory-informed approach  with computational acquisition & inference Data-driven and theory-informed lexica  are possible (METALEX)  need integrated model-theoretic approach (ILEX): INTERPRETATION (ALEX) = OLEX  a formal problem remains: differing complexity of trees (archive): simulation of other graphs via semantics only annotation lattices (data), tables (lexica): regular relations if non-recursive, indexed grammars if recursive?