Introduction to Computational Linguisitics The Lexicon.

Slides:



Advertisements
Similar presentations
Building Wordnets Piek Vossen, Irion Technologies.
Advertisements

 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.
Augmented Transition Networks
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Units of specialized knowledge* “A unit of specialized knowledge (SKU) is a unit that represents specialized knowledge at the content level, and communicates.
October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
Chapter 17. Lexical Semantics From: Chapter 17 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, by.
1 Words and the Lexicon September 10th 2009 Lecture #3.
NLP and Speech 2004 Feature Structures Feature Structures and Unification.
Introduction to Linguistics and Basic Terms
1/7 INFO60021 Natural Language Processing Harold Somers Professor of Language Engineering.
1/27 Semantics Going beyond syntax. 2/27 Semantics Relationship between surface form and meaning What is meaning? Lexical semantics Syntax and semantics.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
A STUDY ON THE KNOWLEDGE SOURCES OF TURKISH EFL LEARNERS IN LEXICAL INFERENCING İlknur İSTİFÇİ Anadolu University Eskişehir, TURKEY Eskişehir, TURKEY.
Using resources WordNet and the BNC. WordNet: History 1985: a group of psychologists and linguists start to develop a “lexical database” –Princeton University.
Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky &
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Meaning and Language Part 1.
Foundations This chapter lays down the fundamental ideas and choices on which our approach is based. First, it identifies the needs of architects in the.
Building Background F9: Vocabulary Bell Work: Write 3 strategies you can use in the classroom to help students link past learning with new concepts Opening.
Introduction to English Morphology Finite State Transducers
Parts of Speech (Lexical Categories). Parts of Speech Nouns, Verbs, Adjectives, Prepositions, Adverbs (etc.) The building blocks of sentences The [ N.
Semantics. Semantics-concerned with the investigation of meaning in a language without any reference to the context of situation The study of linguistic.
Natural Language Processing DR. SADAF RAUF. Topic Morphology: Indian Language and European Language Maryam Zahid.
Session 8 Lexical Semantic
EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
WordNet ® and its Java API ♦ Introduction to WordNet ♦ WordNet API for Java Name: Hao Li Uni: hl2489.
Finite State Automata and Tries Sambhav Jain IIIT Hyderabad.
Reading. How do you think we read? -memorizing words on the page -extracting just the meanings of the words -playing a mental movie in our heads of what.
Phonemes A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. These units are identified within.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
Finite State Machinery - I Fundamentals Recognisers and Transducers.
Morphological Analysis Lim Kay Yie Kong Moon Moon Rosaida bt ibrahim Nor hayati bt jamaludin.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Parts of Speech (Lexical Categories). Parts of Speech n Nouns, Verbs, Adjectives, Prepositions, Adverbs (etc.) n The building blocks of sentences n The.
Chapter 3 Culture and Language. Chapter Outline  Humanity and Language  Five Properties of Language  How Language Works  Language and Culture  Social.
An Introduction to Semantics
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Wordnet - A lexical database for the English Language.
Rules, Movement, Ambiguity
Artificial Intelligence: Natural Language
CSA2050 Introduction to Computational Linguistics Parsing I.
Natural Language Processing Chapter 2 : Morphology.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
Lexicography Lexicon has two different meanings:
SYNTAX.
Levels of Linguistic Analysis
Parsing and Code Generation Set 24. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program,
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Language and Cognition Colombo, June 2011 Day 2 Introduction to Linguistic Theory, Part 3.
Slang. Informal verbal communication that is generally unacceptable for formal writing.
NATURAL LANGUAGE PROCESSING
Composing Music with Grammars. grammar the whole system and structure of a language or of languages in general, usually taken as consisting of syntax.
Introduction to Computational Linguisitics The Lexicon.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
What is Linguistics? The scientific study of human language
CSC 594 Topics in AI – Applied Natural Language Processing
Levels of Linguistic Analysis
Semantics Going beyond syntax.
Morphological Parsing
Post-Midterm Practice 1
Presentation transcript:

Introduction to Computational Linguisitics The Lexicon

Introduction An inventory of words is an essential component of programs for a wide variety of language sensitive applications, such as: –Spellchecking, stylechecking –IR, IE, message understanding –parsing, generation, MT –TTS and STT Such an inventory usually called a dictionary or lexicon.

Dictionaries The purpose of a dictionary is to provide a wide range of information about words Some of this is linguistic information, e.g. syntactic category, pronunciation, distribution. But dictionaries also contain definitions of word senses thus providing knowledge about not just language but about the world itself.

What is "dog"? dog (ANIMAL) Show phonetics noun [C] a common four-legged animal, especially kept by people as a pet or to hunt or guard things: my pet dog wild dogs dog food We could hear dogs barking in the distance. (from Cambridge Advanced Learner's Dictionary)Show phoneticsCambridge Advanced Learner's Dictionary

"Dictionary" versus "Lexicon" A dictionary is a collection of words A lexicon is a collection of lexemes. A lexeme roughly corresponds to a set of words that are different forms of "the same word". For example, English run, runs, ran and running are forms of the same lexeme. A lexeme can also be regarded as a single word sense of a word.

Senses of Dog dog was found in the Cambridge Advanced Learner's Dictionary at the entries listed below.Cambridge Advanced Learner's Dictionary –dog (ANIMAL)dog (ANIMAL) –dog (PERSON)dog (PERSON) –dog (FOLLOW)dog (FOLLOW) –dog (PROBLEM)dog (PROBLEM) different senses or lexemes for dog

Two Views of the Lexicon give rise to different issues Lexicon as word database –How to represent the word collection –Access: given an arbitrary word, how to access the relevant entries –What information to provide and how to express it. Lexicon as database about word senses –What are the relations between word senses? –How do word senses hook up with concept knowledge

Representing the Word Collection Some possible representations: –Text file, 1 entry per line –Finite state automaton. –Other specialised data structure which allows for common prefixes, e.g. letter tree Full form vs. lexeme + morphological analysis

FSA for Sublexicon Fragment thes e i s a t o

Letter Tree ltree([ [b, [a, [r, [k, bark]]]], [c, [a, [r, [r, [y, carry]]], [t, cat, [e, [g, [o, [r, [y, category]]]]]]]], [d, [e, [l, [a, [y, delay]]]]], [h, [e, [l, [p, help]]], [o, [p, hop, [e, hope]]]], [q, [u, [a, [r, [r, [y, quarry]]]], [i, [z, quiz]], [o, [t, [e, quote]]]]] ]).

Informal Definition of a Letter Tree Tree is a list of branches Each branch is a list –whose first element is a letter –whose remaining elements are either another branch, or a lexical entry for a word –These elements are in a specific order. Lexical entry (if any) comes first, and branches are in alphabetical order by their first letters.

Branch representing cat, category and cook [c,[a,[t,cat, [e,[g,[o,[r,[y category]]]]]]] [o,[o,[k,cook]]]]

Full Form Dictionary There is an entry for every possible word. No need for morphological processing Exceptions are handled automatically OK when number of entries is not too large. Repeated information. Because languages have different morphological properties, full form is better for some languages than for others.

Morphological Analysis + Lexicon Morphological Analysis Input Word cats LEXICON catN sPL s3SG

Morphological Analysis Very roughly, morphological analysis of a word involves 2 subproblems: A segmentation problem: how to get from the written text to the sequence of morphemes that make it up. A morphotactic problem: how to combine the individual morphemes together in a legitimate way.

Segmentation/Morphotactic Subproblems Segmentation problem: –enlargement => en + large + ment Morphotactic problem: given what we know about en, large and ment, how can they be legitimately combined –enlargement => (en + large) + ment –enlargement =/> en + (large + ment) –en + ADJ => V –V + ment => N

2-Level Morphology In 1981 the four Ks (Kimmo Koskenniemi, Lauri Karttunen, Ronald M. Kaplan and Martin Kay) were working on morphological analysis (MA) Basic idea was that MA is about computing relation between sets of strings at two levels: –Surface Level (string of lexical words made from surface alphabet) –Lexical Level (string of morphemes made of lexical alphabet). Relation can be computed using finite state transducers. Reversibility of finite-state model

What Information to Provide Specific Information – eg "kicks" Syntactic Information –POS = verb –Tense = pres –Number = singular –Person = 3 –Type =Transitive Semantic Information –event-type = Physical Action –type-of subject = animate –type-of object = physical

What Information to Provide General Information Class Attributes –Agreement has (Number, Gender) Enumeration of possible values –Gender = [masc, fem] –Number = [sing, plur] Class Relationships –Transitive isa Verb –Common isa Noun

Two Views of the Lexicon give rise to different issues Lexicon as word database –How to represent the word collection –Access: given an arbitrary word, how to access the relevant entries –What information to provide and how to express it. Lexicon as database about word senses –What are the relations between word senses? –How do word senses hook up with conceptual knowledge

WordNet In 1985 a group of psychologists and linguists at Princeton had the idea of searching dictionaries conceptually rather than alphabetically. Attempt to organise a dictionary in terms of word meanings rather than word forms. What is the nature and organisation of the lexicalised concepts that words can express? Distinction between word forms, word meanings, and entries.

Lexical Matrix Word Meanings Word Forms F1F2..Fn M1E1,1E1,2 M2E2,1.. MmEm,n polysemy synonymy entries

WordNet A key aspect of WordNet is that a given meaning or word sense is represented as the set of words that can be used to express it. These meanings are called synsets – sets of words with synonymous readings. Synsets are established empirically according to a principle of substitutability that is relativised to context.

The Principle of Substitutability Two expressions are synonymous if the substitution of one for another never alters the truth value of a sentence in which the substitution is made. Two expressions are synonymous in linguistic context C if the substitution of one for the other in C does not alter the truth value. e.g. plank/board in carpentry contexts

Lexical Matrix Word Meanings Word Forms boardcommitteeplank..Fn board committee E1,1E1,2 board plank E2,1E2,3.. MmEm,n entries

WordNet In Wordnet, the synonymy relation between words is fundamental. Synsets can be thought of as representing concepts which stand in various semantic relations to each other. –X Antonym Y: meaning (synset) X is opposite to meaning (synset) Y (big, small) –X Hyponym Y: like isa (e.g. dog, mammal) –X Meronym Y: X is a part of Y (e.g. leg, man)

Lexicon as a Concept Graph We can thus imagine the WordNet Lexicon as a gigantic graph whose nodes are synsets and whose arcs are semantic relations between synsets. Such a structure can be regarded as a semantic map of the concepts used in a given language. Many applications can be created using the WordNet graph as a resource

Using WordNet to Measure Semantic Orientations of Adjectives Jaap Kamps, Maarten Marx, Robert J. Mokken, Maarten de Rijke

Conclusion Lexicon is a central building block of language- sensitive systems Schizophrenic status of lexical information: linguistic versus world knowledge. As a wordlist, lexicon has to solve problem of representation and access. Morphological analysis can help to keep number of entries to a manageable level. As a collection of definitions, lexicon has to deal with relationships between word meanings.