1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.

Slides:



Advertisements
Similar presentations
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Advertisements

A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005.
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Forest Markup / Metadata Language FML
Machine Translation II How MT works Modes of use.
CODE/ CODE SWITCHING.
Using XSLT for Interoperability: DOE and The Traveling Domain Experiment Monday 20 th of October, 2003 Antoine Isaac, Raphaël Troncy and Véronique Malaisé.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Ch 3: Unified Process CSCI 4320: Software Engineering.
Of 27 lecture 7: owl - introduction. of 27 ece 627, winter ‘132 OWL a glimpse OWL – Web Ontology Language describes classes, properties and relations.
Basics of HTML What is HTML?  HTML or Hyper Text Markup Language is the standard markup language used to create Web pages.  HTML is.
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
REPORT on Computational Lexicon Working Group on Multilingual Lexicon EU -WG Meeting December 1 st -2 nd 2000 Pisa UPenn, December
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
XMELLT Cross-lingual Multi-word Expression Lexicons for Language Technology Multilingual Information Access and Management International Research Co-operation.
Cognitive Linguistics Croft & Cruse 10 An overview of construction grammars (part 2, through end)
PDDL: A Language with a Purpose? Lee McCluskey Department of Computing and Mathematical Sciences, The University of Huddersfield.
PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment Natalya Fridman Noy and Mark A. Musen.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
1/31 CS 426 Senior Projects Chapter 1: What is UML? Chapter 2: What is UP? [Arlow and Neustadt, 2005] January 22, 2009.
Chapter 9 Using Data Flow Diagrams
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Course Instructor: Aisha Azeem
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.
Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France.
Barcelona Meeting 21/06/05 MM 1 LIRICS WP2 LIRICS WP2 NLP LEXICA Task Leader: ILC-CNR (Pisa) presented by: Monica Monachini.
Chapter 10 Architectural Design
9 th Open Forum on Metadata Registries Harmonization of Terminology, Ontology and Metadata 20th – 22nd March, 2006, Kobe Japan. Commonalities and Differences.
CLEANROOM SOFTWARE ENGINEERING.
-Nikhil Bhatia 28 th October What is RUP? Central Elements of RUP Project Lifecycle Phases Six Engineering Disciplines Three Supporting Disciplines.
1 These courseware materials are to be used in conjunction with Software Engineering: A Practitioner’s Approach, 5/e and are provided with permission by.
ITEC224 Database Programming
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
LIRICS Mid-term Review 1 LIRICS WP2 – NLP Lexica Monica Monachini CNR-ILC - Pisa 23rd May 2006.
Copyright 2002 Prentice-Hall, Inc. Modern Systems Analysis and Design Third Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich Chapter 20 Object-Oriented.
FUNDAMENTALS OF LEXICOLOGY
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
ISLE: International Standards for Language Engineering A European/US joint project Martha Palmer University of Pennsylvania Tides Kickoff March 22, 2000.
The Current State of FrameNet CLFNG June 26, 2006 Fillmore.
Reasons to Study Lexicography  You love words  It can help you evaluate dictionaries  It might make you more sensitive to what dictionaries have in.
Chapter 13 Architectural Design
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
Approaching a Problem Where do we start? How do we proceed?
Eurostat Expression language (EL) in Eurostat SDMX - TWG Luxembourg, 5 Jun 2013 Adam Wroński.
Semantic Web Constraint Language complement and the editor development in Protégé Piao Guangyuan.
SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
Chapter 6 – Architectural Design Lecture 1 1Chapter 6 Architectural design.
Working with Ontologies Introduction to DOGMA and related research.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
16/11/ Semantic Web Services Language Requirements Presenter: Emilia Cimpian
1 Class exercise II: Use Case Implementation Deborah McGuinness and Peter Fox CSCI Week 8, October 20, 2008.
Developing OLIF, Version 2 Susan M. McCormick Christian Lieske OLIF2 Consortium SAP/Walldorf, Germany.
OWL Web Ontology Language Summary IHan HSIAO (Sharon)
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
16 April 2011 Alan, Edison, etc, Saturday.. Knowledge, Planning and Robotics 1.Knowledge 2.Types of knowledge 3.Representation of knowledge 4.Planning.
Chapter 9 Architectural Design. Why Architecture? The architecture is not the operational software. Rather, it is a representation that enables a software.
Technical translation
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Risk Analysis – definition, training and application area within National Customs Agency SOFIA, 4-7 OCTOBER 2005.
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
What is Linguistics? The scientific study of human language
European Network of e-Lexicography
Chapter 20 Object-Oriented Analysis and Design
Chapter 9 Architectural Design.
Presentation transcript:

1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

2 Background EU-funded international projects EAGLES: recommendations for morphological and syntactic specifications for 9 languages GENELEX: development of a generic lexicon model PAROLE: development of harmonized WL resources (lexicon, corpus) for 12 languages SIMPLE: development of an ontology and model of semantic description for 12 languages Follow-up Danish, nationally funded co-operative lexicon project: STO

3 Aims of the project Monolingual aim to eliminate the usual ’bottleneck problem’: lack of a large-size Danish lexical database for language technology applications computational language research purposes Multilingual aim to provide an elaborated Danish lexical database for linked bi- or multilingual databases for LT/NLP applications contrastive CL and lexicology research …

4 STO development objectives Requirements of monolingual applications tailor the linguistic specifications for Danish add more language specific features extend the linguistic and lexical coverage refine the lexicon structure develop customized, user-friendly interfaces... but also requirements of multilingual linking keep the basic, harmonised lexicon structure keep the principles and language of lexical description be attentive to similar follow-up projects  ’more Danish’ but still consistent with the other lexicons

5 The three linguistic layers of description Main info types - 3 independent but linked layers Morphology l Inflection (pattern-based) l Spelling l Compounding Syntax (totally pattern-based) l Syntactic frame (complementation structures & functional properties, etc.) l Control, raising (constructional properties) Semantics (the layer of multilingual linking) l Domain (=sublanguage, source area) l Semantic relations (qualia) l Specification of meaning (SIMPLE model + core ontolgy)

6 Between syntax and semantics No clear-cut borderline: difficult to represent mutual dependencies in a strictly modular description.  Syntactic or semantic units? Collocations: combine features of complex structure, (morpho)syntactic constraints and slightly restricted compositionality (meaning transparency); strong subcategorisation and selectional restrictions... Phrasal verbs: combine features of complex syntactic structure and compositional/non-compositional semantics …  Different representation strategies: ’early’ vs. ’late’

7 Linking lexicons at the semantic level Basic method: link between L1-meaning and L2-meaning Basic requirement: harmonized semantics (ontology, model & method) Advantages: proper treatment of all lexical units including homonymes polysemes complex lexical units (collocations, idioms) independent treatment of L1 and L2 wrt. morpholgy and syntax

8 About the STO lexical database (V.1) Point of departure: PAROLE material linguistic specifications elaborated (inc. also Danish) modular lexicon architecture developed information structure developed 20,000 general language lexicon entries encoded Main STO development steps: tailor and refine the LingSpec’s for Danish improve the information structure (DB) add new entry types (complex lexical units, etc.) extend the vocabulary to 50,000 entries (~ 35,000 GL and ~15,000 LSP from 6-8 domains)

9 Progress report for 2001 (1) New status: Nationally funded co-operative project requiring more thorough project planning (incl. ’logistics’) more detailed information (guidelines, specifications, cross-checks, evaluation…) Continuously ongoing supporting processes Updating and refinement of LingSpec’s Elaboration of an Encoding Manual Elaboration of various additional documentation (evaluation sheets, etc.) Revision of the database/info structure

10 Progress report for 2001 (2) New supporting tools for lexicographers developed Encoding tools for morphological and syntactic info Browsers for retrieval of encoded info... Number of entries encoded with morphological information ~50,000 syntactic information ~23,000 semantic information ~ 8,500 (from SIMPLE) Other tasks (ongoing/finished) selected entries (on customer’s request) downloaded work on principles of statistically based selection of lemmas and syntactic constructions to be encoded corpus-related work

11 Progress report for 2001 (3) Treatment of new entry types domain specific (LSP) entries compounds (decomposition and linking elements implemented) geographical proper nouns (inflectional and agreement properties investigated, the results are implemented) collocations (information structure designed) revision of the treatment of phrasal verbs

12 Summing up the goals STO will conform to ’general’ linguistic knowledge meet demands of a broad application and research area (size, selection of domains and vocabulary, detail of linguistic description…) satisfy monolingual language specific requirements be potentially compatible with other lexical databases for future linking be reasonable easy to access, customize/use... perform the development contract and meet the production deadlines