Download presentation
Presentation is loading. Please wait.
Published byCurtis Stokes Modified over 9 years ago
1
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001
2
2 Background EU-funded international projects EAGLES: recommendations for morphological and syntactic specifications for 9 languages GENELEX: development of a generic lexicon model PAROLE: development of harmonized WL resources (lexicon, corpus) for 12 languages SIMPLE: development of an ontology and model of semantic description for 12 languages Follow-up Danish, nationally funded co-operative lexicon project: STO
3
3 Aims of the project Monolingual aim to eliminate the usual ’bottleneck problem’: lack of a large-size Danish lexical database for language technology applications computational language research purposes Multilingual aim to provide an elaborated Danish lexical database for linked bi- or multilingual databases for LT/NLP applications contrastive CL and lexicology research …
4
4 STO development objectives Requirements of monolingual applications tailor the linguistic specifications for Danish add more language specific features extend the linguistic and lexical coverage refine the lexicon structure develop customized, user-friendly interfaces... but also requirements of multilingual linking keep the basic, harmonised lexicon structure keep the principles and language of lexical description be attentive to similar follow-up projects ’more Danish’ but still consistent with the other lexicons
5
5 The three linguistic layers of description Main info types - 3 independent but linked layers Morphology l Inflection (pattern-based) l Spelling l Compounding Syntax (totally pattern-based) l Syntactic frame (complementation structures & functional properties, etc.) l Control, raising (constructional properties) Semantics (the layer of multilingual linking) l Domain (=sublanguage, source area) l Semantic relations (qualia) l Specification of meaning (SIMPLE model + core ontolgy)
6
6 Between syntax and semantics No clear-cut borderline: difficult to represent mutual dependencies in a strictly modular description. Syntactic or semantic units? Collocations: combine features of complex structure, (morpho)syntactic constraints and slightly restricted compositionality (meaning transparency); strong subcategorisation and selectional restrictions... Phrasal verbs: combine features of complex syntactic structure and compositional/non-compositional semantics … Different representation strategies: ’early’ vs. ’late’
7
7 Linking lexicons at the semantic level Basic method: link between L1-meaning and L2-meaning Basic requirement: harmonized semantics (ontology, model & method) Advantages: proper treatment of all lexical units including homonymes polysemes complex lexical units (collocations, idioms) independent treatment of L1 and L2 wrt. morpholgy and syntax
8
8 About the STO lexical database (V.1) Point of departure: PAROLE material linguistic specifications elaborated (inc. also Danish) modular lexicon architecture developed information structure developed 20,000 general language lexicon entries encoded Main STO development steps: tailor and refine the LingSpec’s for Danish improve the information structure (DB) add new entry types (complex lexical units, etc.) extend the vocabulary to 50,000 entries (~ 35,000 GL and ~15,000 LSP from 6-8 domains)
9
9 Progress report for 2001 (1) New status: Nationally funded co-operative project requiring more thorough project planning (incl. ’logistics’) more detailed information (guidelines, specifications, cross-checks, evaluation…) Continuously ongoing supporting processes Updating and refinement of LingSpec’s Elaboration of an Encoding Manual Elaboration of various additional documentation (evaluation sheets, etc.) Revision of the database/info structure
10
10 Progress report for 2001 (2) New supporting tools for lexicographers developed Encoding tools for morphological and syntactic info Browsers for retrieval of encoded info... Number of entries encoded with morphological information ~50,000 syntactic information ~23,000 semantic information ~ 8,500 (from SIMPLE) Other tasks (ongoing/finished) selected entries (on customer’s request) downloaded work on principles of statistically based selection of lemmas and syntactic constructions to be encoded corpus-related work
11
11 Progress report for 2001 (3) Treatment of new entry types domain specific (LSP) entries compounds (decomposition and linking elements implemented) geographical proper nouns (inflectional and agreement properties investigated, the results are implemented) collocations (information structure designed) revision of the treatment of phrasal verbs
12
12 Summing up the goals STO will conform to ’general’ linguistic knowledge meet demands of a broad application and research area (size, selection of domains and vocabulary, detail of linguistic description…) satisfy monolingual language specific requirements be potentially compatible with other lexical databases for future linking be reasonable easy to access, customize/use... perform the development contract and meet the production deadlines
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.