Download presentation
Presentation is loading. Please wait.
Published byMary Logan Modified over 9 years ago
1
1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004
2
2 Natural Language Processing Group Building a multilingual lexicon Starting from a model of medicine or starting from a pragmatic observation of the languages ? What representation of knowledge is to be added to a lexicon ? The question is what makes a lexicon multilingual From signals to understanding or the different levels of granularity of the language information Defining the Lexicon Ontology (LO) in order to start on a sound basis.
3
3 Natural Language Processing Group Modeling or not? In the last decade, the idea of model of medicine was prevalent, like Snomed, Galen, UMLS, etc. NLP was necessary as a way to help communicate the content of the model. The principle of guidance by the model was admitted. But a general models of medicine is far from being reallity, and this will remain true for certainly a few decades Therefore, it is not a good idea to base the NLP on the existence of a model Make the NLP free from any model !
4
4 Natural Language Processing Group has_parent has_child linked_to arm finger handfoot palm Modeling the medical domain surgery eventprocess top path.normal object traumadisease Light model
5
5 Natural Language Processing Group Local model Words are at different levels of detail: burn of the finger and burn of the thumb digestive disorder and post-prandial disorder vertebra and atlas Attributes or properties are generalized to classes of concepts Local inferences between close levels in a hierarchy of concepts is necessary before chunking information.
6
6 Natural Language Processing Group Semantic lexicons A semantic lexicon is a lexicon with attachments to existing terminologies and ontologies But, what do we attach to what and how: Grouping of words representing the same object ? What is the semantic of this association? What about multilingual aspects ? Problem of coherence of multiple attachments
7
7 Natural Language Processing Group « paupière » « eyelid » « Augenlid » _eyelid « bléphar » « blephar » « blépharo » « blepharo » « palpébral » « palpebral » « ? » _blephar _blepharo _palpebral cl_Eyelid lexical representationontological representation GalenUMLS Semantic net MEsH Snomed ICD10 other lemme levelAbstract Lexical Identifierontological levelUniversal Object Identifier From words to objects
8
8 Natural Language Processing Group « corps » « body » « Körper » « corps » « body » « Körper » « corps » « body » « Körper » « corps étranger » étranger » « foreign body » « Fremdkörper » cl_Body MEsH Semantic net etc. cl_Trunck cl_DeadBody cl_ForeignBody Dealing with proximity of words lexical representationontological representation _BodyAsWhole _BodyAsTrunck _BodyAsDeadPerson _BodyAsForeign lemme levelAbstract Lexical Identifierontological levelUniversal Object Identifier
9
9 Natural Language Processing Group From signals to understanding utterances lexicon entries language words abstract lexical identifier universal object identifier object link between objects
10
10 Natural Language Processing Group Utterances A speech, a sentence, a sign, a signal, generally issued by a human being An expression of something to be communicated Well-formed or ill-formed Difficulty to delimit what is a unit of communication or a kind of atomic message Utterances are expected to be converted to written sentences for subsequent processing.
11
11 Natural Language Processing Group Lexicon entries All 3 kinds of lexicon entries are pointing to well defined objects of the world Single word entries, without blank character, not decomposable Word components or morphosemantems are parts of decomposition of compound words Expressions or short terms, made of 2 to 5 words, representing single objects, like idiomatic expressions and language idiosyncracies, which cannot be represented by ordinary composition of their parts.
12
12 Natural Language Processing Group Language words In most natural languages, words present morphological variations, which have to be resolved Rule-based systems are able to solve this problem From a sentence, a lemmatizer is a program producing the list of the lemmes of all word – in their basic form - generally singular, masculine, nominative and infinitive, whatever applies. A multilingual lexicon should include the definitions of the rules and should flag the regular words
13
13 Natural Language Processing Group Abstract lexical identifier (LID) The same word generally exists in different languages The same word may have different lemmes in a given language The information about these facts has to be explicitely collected The recipient of the collection of all forms is call an abstract lexical identifier It is represented by a unique set of characters. based on the English lemme, with extension when necessary.
14
14 Natural Language Processing Group Universal object identifier (CID) Physical objects and abstract objects are parts of the world A unique object identifier has to be defined for the representation of each object of the domain under scrutiny One and only one link has to be defined between an abstract lexical identifier and a object identifier Multiple links may converge to the same object identifier.
15
15 Natural Language Processing Group Abdomen and its contex
16
16 Natural Language Processing Group Hypertension and its context
17
17 Natural Language Processing Group Insect and its context
18
18 Natural Language Processing Group Abandonment and its context
19
19 Natural Language Processing Group Abscess and its context
20
20 Natural Language Processing Group Fœtus and its context
21
21 Natural Language Processing Group Actual implementation
22
22 Natural Language Processing Group The Lexicon Ontology (LO) To answer to the need of a formal definition of all objects implied in the building of a multilingual lexicon Based on sound recommendations regarding modern ontologies Insure proper communication of design between the actors of the implementation and the users Frame-based implementation using Protégé May be used for a knowledge driven implementation of the lexicon.
23
23 Natural Language Processing Group LO Implementation
24
24 Natural Language Processing Group PermanentObject
25
25 Natural Language Processing Group Dependant Objects
26
26 Natural Language Processing Group FullWord
27
27 Natural Language Processing Group PartWord
28
28 Natural Language Processing Group Definition by genus and differentia Definitions are composed automatically by the schema of inheritance through the isa links A Noun is a LexiconObject which: represents a physical or abstract object or any of their attributes, is a building bloc of a sentence, is used stand alone in a text, is an undecomposable atom, is an object embodied in the construction of a multilingual lexicon of the medical domain, is necessary for processing of writen medical text.
29
29 Natural Language Processing Group Available resources Multilingual lexicon: French: > 35000 English: > 138000 German: > 23000 Latin: > 6500 (+ 9000) Proper names: > 3000 Tools (achievement may be dependant on the language) Word decomposition Tokenizer Error correction Several utilities: Semantic Net, Mesh, TA, etc. Web server for lexicon access
30
30 Natural Language Processing Group Recommendations Define the lexicon on a strong formal basis Make explicit the multilingual aspects Take care of flectional morphology Favour the proper treatment of compound words Be open to the evolution of languages and the venue of other European languages Make available links to well known terminologies and ontologies
31
31 Thank you for your attention robert.baud@sim.hcuge.ch
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.