Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004.

Similar presentations


Presentation on theme: "1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004."— Presentation transcript:

1 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

2 2 Natural Language Processing Group Building a multilingual lexicon  Starting from a model of medicine or starting from a pragmatic observation of the languages ?  What representation of knowledge is to be added to a lexicon ? The question is what makes a lexicon multilingual  From signals to understanding or the different levels of granularity of the language information  Defining the Lexicon Ontology (LO) in order to start on a sound basis.

3 3 Natural Language Processing Group Modeling or not?  In the last decade, the idea of model of medicine was prevalent, like Snomed, Galen, UMLS, etc.  NLP was necessary as a way to help communicate the content of the model.  The principle of guidance by the model was admitted.  But a general models of medicine is far from being reallity, and this will remain true for certainly a few decades  Therefore, it is not a good idea to base the NLP on the existence of a model  Make the NLP free from any model !

4 4 Natural Language Processing Group has_parent has_child linked_to arm finger handfoot palm Modeling the medical domain surgery eventprocess top path.normal object traumadisease Light model

5 5 Natural Language Processing Group Local model  Words are at different levels of detail:  burn of the finger and burn of the thumb  digestive disorder and post-prandial disorder  vertebra and atlas  Attributes or properties are generalized to classes of concepts  Local inferences between close levels in a hierarchy of concepts is necessary before chunking information.

6 6 Natural Language Processing Group Semantic lexicons  A semantic lexicon is a lexicon with attachments to existing terminologies and ontologies  But, what do we attach to what and how:  Grouping of words representing the same object ?  What is the semantic of this association?  What about multilingual aspects ?  Problem of coherence of multiple attachments

7 7 Natural Language Processing Group « paupière » « eyelid » « Augenlid » _eyelid « bléphar » « blephar » « blépharo » « blepharo » « palpébral » « palpebral » « ? » _blephar _blepharo _palpebral cl_Eyelid lexical representationontological representation GalenUMLS Semantic net MEsH Snomed ICD10 other lemme levelAbstract Lexical Identifierontological levelUniversal Object Identifier From words to objects

8 8 Natural Language Processing Group « corps » « body » « Körper » « corps » « body » « Körper » « corps » « body » « Körper » « corps étranger » étranger » « foreign body » « Fremdkörper » cl_Body MEsH Semantic net etc. cl_Trunck cl_DeadBody cl_ForeignBody Dealing with proximity of words lexical representationontological representation _BodyAsWhole _BodyAsTrunck _BodyAsDeadPerson _BodyAsForeign lemme levelAbstract Lexical Identifierontological levelUniversal Object Identifier

9 9 Natural Language Processing Group From signals to understanding utterances lexicon entries language words abstract lexical identifier universal object identifier object link between objects

10 10 Natural Language Processing Group Utterances  A speech, a sentence, a sign, a signal, generally issued by a human being  An expression of something to be communicated  Well-formed or ill-formed  Difficulty to delimit what is a unit of communication or a kind of atomic message  Utterances are expected to be converted to written sentences for subsequent processing.

11 11 Natural Language Processing Group Lexicon entries  All 3 kinds of lexicon entries are pointing to well defined objects of the world  Single word entries, without blank character, not decomposable  Word components or morphosemantems are parts of decomposition of compound words  Expressions or short terms, made of 2 to 5 words, representing single objects, like idiomatic expressions and language idiosyncracies, which cannot be represented by ordinary composition of their parts.

12 12 Natural Language Processing Group Language words  In most natural languages, words present morphological variations, which have to be resolved  Rule-based systems are able to solve this problem  From a sentence, a lemmatizer is a program producing the list of the lemmes of all word – in their basic form - generally singular, masculine, nominative and infinitive, whatever applies.  A multilingual lexicon should include the definitions of the rules and should flag the regular words

13 13 Natural Language Processing Group Abstract lexical identifier (LID)  The same word generally exists in different languages  The same word may have different lemmes in a given language  The information about these facts has to be explicitely collected  The recipient of the collection of all forms is call an abstract lexical identifier  It is represented by a unique set of characters. based on the English lemme, with extension when necessary.

14 14 Natural Language Processing Group Universal object identifier (CID)  Physical objects and abstract objects are parts of the world  A unique object identifier has to be defined for the representation of each object of the domain under scrutiny  One and only one link has to be defined between an abstract lexical identifier and a object identifier  Multiple links may converge to the same object identifier.

15 15 Natural Language Processing Group Abdomen and its contex

16 16 Natural Language Processing Group Hypertension and its context

17 17 Natural Language Processing Group Insect and its context

18 18 Natural Language Processing Group Abandonment and its context

19 19 Natural Language Processing Group Abscess and its context

20 20 Natural Language Processing Group Fœtus and its context

21 21 Natural Language Processing Group Actual implementation

22 22 Natural Language Processing Group The Lexicon Ontology (LO)  To answer to the need of a formal definition of all objects implied in the building of a multilingual lexicon  Based on sound recommendations regarding modern ontologies  Insure proper communication of design between the actors of the implementation and the users  Frame-based implementation using Protégé  May be used for a knowledge driven implementation of the lexicon.

23 23 Natural Language Processing Group LO Implementation

24 24 Natural Language Processing Group PermanentObject

25 25 Natural Language Processing Group Dependant Objects

26 26 Natural Language Processing Group FullWord

27 27 Natural Language Processing Group PartWord

28 28 Natural Language Processing Group Definition by genus and differentia  Definitions are composed automatically by the schema of inheritance through the isa links  A Noun is a LexiconObject which:  represents a physical or abstract object or any of their attributes,  is a building bloc of a sentence,  is used stand alone in a text,  is an undecomposable atom,  is an object embodied in the construction of a multilingual lexicon of the medical domain,  is necessary for processing of writen medical text.

29 29 Natural Language Processing Group Available resources  Multilingual lexicon:  French: > 35000  English: > 138000  German: > 23000  Latin: > 6500 (+ 9000)  Proper names: > 3000  Tools (achievement may be dependant on the language)  Word decomposition  Tokenizer  Error correction  Several utilities: Semantic Net, Mesh, TA, etc.  Web server for lexicon access

30 30 Natural Language Processing Group Recommendations  Define the lexicon on a strong formal basis  Make explicit the multilingual aspects  Take care of flectional morphology  Favour the proper treatment of compound words  Be open to the evolution of languages and the venue of other European languages  Make available links to well known terminologies and ontologies

31 31 Thank you for your attention robert.baud@sim.hcuge.ch


Download ppt "1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004."

Similar presentations


Ads by Google