Presentation is loading. Please wait.

Presentation is loading. Please wait.

Zdroje jazykových dat Word senses Sense tagged corpora.

Similar presentations


Presentation on theme: "Zdroje jazykových dat Word senses Sense tagged corpora."— Presentation transcript:

1 Zdroje jazykových dat Word senses Sense tagged corpora

2 Lev V. Ščerba: And indeed, every sufficiently complex word must actually become the subject of a scientific monograph; therefore it is hard to expect in the near future the completion of a good dictionary.

3 Word sense disambiguation The different meanings of polysemous words are known as “senses” and the process of deciding which is being used in a particular context “word sense disambiguation”

4 Lexical Acquisition Bottleneck In NLP many systems do not perform in practice as well as they could with adequate dictionary resources, due to the cost of production, adaptation, and maintenance of these resouces Solutions –Reusing existing dictionaries and ontologies as lexicons –Deriving disambiguation information directly from corpora

5 Usefulness of WSD NLP tools: –Systems – carries out some task of “interest for its own sake” (e.g. MT,IR); applications potentially interesting for non-linguists –Components – interesting for linguists and language engineers; e.g. WSD

6 Early approaches Preference semantics – 1970’s –Selectional constraints (e.g. ANIMATE for subject of “to drink”) Word experts – 1980’s –Hand crafted disambiguators constructed for each word separately –Limited applicability Polaroid words –Gradual disambiguation (grammar, parser, lexicon, semantic interpreter, knowledge representation language)

7 Dictionary Based Approaches Since 1980’s – dictionary publishers started to produced “Machine Readable Dictionaries” (now - m. tractable d.) Wider polysemy than in the systems described so far

8 Two claims about sense distribution One sense per discourse –There is a very strong tendency for multiple uses of a word to share the same sense in a well-written discourse One sense per collocation –With a high probability an ambiguous word has only one sense in a given collocation

9 Taxonomy of WSD Algorithms Knowledge based Corpus based –Tagged corpora –Untagged corpora Hybrid approaches

10 Word Senses and Lexicons Sense tagging = attaching senses from some lexicon to words in text Sense-enumerative dictionary

11 Deficiencies of dictionaries Omissions and oversights Coverage of names Ghost words – Dord=density (D or d) Differentiating senses (P.Hanks: A serious problem for computer applications if that dictionaries compiled for human users focus on giving lists of meanings for each entry, without saying much about how one meaning may be distinguished from another in text)

12 Two levels of sense distinction Homography –Two senses of a word are homographic when there is no obvious semantic relation between them (e.g. a ball – a dance or a rounded object) –Risk of amateur etymology Polysemy

13 Distinguishing senses P.Hanks: No generally agreed criteria exist for what counts as a sense, or for how to distinguish one sense from another Zeugma: Arthur and his driving license expired last Thursday. Polysemy vs. vagueness (e.g. mountain )

14 The Bank Model Assumption A – Words have a finite set of clearly distinct, well-defined sense Assumption B – Native speakers of … know instantly and effortlessly which meaning applies in a given situation Criticism of the bank model: Kilgarriff (“I don’t believe word senses”), Pustejovsky (Generative lexicon), and many others…

15 NLP Lexicons Longman Dictionary of Contemporary English (LDOCE) – three-level embedded structure for sense distinctions (homographs,senses,optional subsenses) Roget’s Thesaurus Cambridge International Dictionary of English COBUILD English Language Dictionary WordNet

16 Thesaurus

17 Ontology

18 There is little agreement on what an ontology is… In general, an ontology can be described as an inventory of the objects, processes, etc. in a domain, as well as a specification of (some of ) the relation that hold among them. Aristotle: genus (category to which something belongs)and differentiae (property that uniquely distinguish the category member from their parent and from one another) Nodes (concepts) in the hierarchy related by subsumption

19 Ontologies in different traditions Philosophical Cognitive Artificial intelligence Lexical semantics Lexicography Information science

20 Princeton WordNet Lexical semantic network structured around the notion of synsets Synset - skupina literálů téhož slovního druhu, které jsou v určitém kontextu vzájemně zaměnitelné („ set of synonyms “) http://www.cogsci.princeton.edu/~wn/w3wn.html Inspired by psycholinguistic theories of human lexical memory broad coverage, rich lexical information, freely available too fine-grained for practical NLP tasks Relations between two synsets: homonymy, hyperonymy, meronymy …

21 EuroWordNet (i) Multilingual database containing several monoloingual wordnets structured along the same lines as the Princeton WordNet1.5 English,Dutch,German,Spanish,French,Italian, Czech,Estonian Inter-Lingual-Index http://www.hum.uva.nl/~ewn

22 EuroWordNet (ii) Princeton WordNet 1.5EuroWordNet note, observe, make a remark, remark prohodit, poznamenat, připomenout anmerken, bemerken...

23 Sense tagged corpora “interest” corpus –2kS containing the word “interest” SENSEVAL –http://www.senseval.orghttp://www.senseval.org –WSD evaluation exercise, first run in 1998 SEMCOR –http://multisemcor.itc.it/semcor.phphttp://multisemcor.itc.it/semcor.php Subset of the English Brown corpus,700kW –More than 200kW sense-tagged according to Princeton WordNet 1.6

24 Final remarks Similarity of POS- and sense tagging Mapping lexical resources


Download ppt "Zdroje jazykových dat Word senses Sense tagged corpora."

Similar presentations


Ads by Google